llamadart 0.4.0
llamadart: ^0.4.0 copied to clipboard
A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models
llamadart #
llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.
β¨ Features #
- π High Performance: Powered by
llama.cpp's optimized C++ kernels. - π οΈ Zero Configuration: Uses the modern Pure Native Asset mechanismβno manual build scripts or platform folders required.
- π± Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
- β‘ GPU Acceleration:
- Apple: Metal (macOS/iOS)
- Android/Linux/Windows: Vulkan
- πΌοΈ Multimodal Support: Run vision and audio models (LLaVA, Gemma 3, Qwen2-VL) with integrated media processing.
- β¬ Resumable Downloads: Robust background-safe model downloads with parallel chunking and persistence using
.metatracking. - LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
- π Web Support: Run inference in the browser via WASM (powered by
wllamav2). - π Dart-First API: Streamlined architecture with decoupled backends.
- π Logging Control: Toggle native engine output or use granular filtering on Web.
- π§ͺ High Coverage: Robust test suite with 80%+ global core coverage.
ποΈ Architecture #
llamadart 0.3.0+ uses a modern, decoupled architecture designed for flexibility and platform independence:
- LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
- ChatSession: A stateful wrapper for
LlamaEnginethat automatically manages conversation history, system prompts, and enforces context window limits (sliding window). - LlamaBackend: A platform-agnostic interface that allows swapping implementation details:
NativeLlamaBackend: Uses Dart FFI and background Isolates for high-performance desktop/mobile inference.WebLlamaBackend: Uses WebAssembly and thewllamaJS library for in-browser inference.
- LlamaBackendFactory: Automatically selects the appropriate backend for your current platform.
π Quick Start #
| Platform | Architecture(s) | GPU Backend | Status |
|---|---|---|---|
| macOS | arm64, x86_64 | Metal | β Tested |
| iOS | arm64 (Device), x86_64 (Sim) | Metal (Device), CPU (Sim) | β Tested |
| Android | arm64-v8a, x86_64 | Vulkan | β Tested |
| Linux | arm64, x86_64 | Vulkan | β Tested |
| Windows | x64 | Vulkan | β Tested |
| Web | WASM | CPU | β Tested |
π¦ Installation #
Add llamadart to your pubspec.yaml:
dependencies:
llamadart: ^0.4.0
Zero Setup (Native Assets) #
llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:
- Detects your target platform and architecture.
- Downloads the appropriate pre-compiled binary from GitHub.
- Bundles it seamlessly into your application.
No manual binary downloads, CMake configuration, or platform-specific project changes are needed.
π οΈ Usage #
1. Simple Usage #
The easiest way to get started is by using the default LlamaBackend.
import 'package:llamadart/llamadart.dart';
void main() async {
// Automatically selects Native or Web backend
final engine = LlamaEngine(LlamaBackend());
try {
// Initialize with a local GGUF model
await engine.loadModel('path/to/model.gguf');
// Generate text (streaming)
await for (final token in engine.generate('The capital of France is')) {
print(token);
}
} finally {
// CRITICAL: Always dispose the engine to release native resources
await engine.dispose();
}
}
2. Advanced Usage (ChatSession) #
Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('model.gguf');
// Create a session with a system prompt and optional tools
final session = ChatSession(
engine,
systemPrompt: 'You are a helpful assistant.',
toolRegistry: myToolRegistry, // Optional
);
// Just send user text; history and tools are handled automatically
// The model decides when to use tools or respond directly.
await for (final token in session.chat('What is the capital of France?')) {
stdout.write(token);
}
} finally {
await engine.dispose();
}
}
3. Tool Calling #
llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.
final registry = ToolRegistry([
ToolDefinition(
name: 'get_weather',
description: 'Get the current weather',
parameters: [
ToolParam.string('location', description: 'City name', required: true),
],
handler: (params) async {
final location = params.getRequiredString('location');
return 'It is 22Β°C and sunny in $location';
},
),
]);
final session = ChatSession(engine, toolRegistry: registry);
// "how's the weather in London?" -> Calls get_weather -> "It is 22Β°C and sunny in London"
await for (final token in session.chat("how's the weather in London?")) {
stdout.write(token);
}
4. Multimodal Usage (Vision/Audio) #
llamadart supports multimodal models (vision and audio) using LlamaChatMessage.multimodal.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('vision-model.gguf');
await engine.loadMultimodalProjector('mmproj.gguf');
final session = ChatSession(engine);
// Create a multimodal message
final messages = [
LlamaChatMessage.multimodal(
role: LlamaChatRole.user,
parts: [
LlamaImageContent(path: 'image.jpg'),
LlamaTextContent('What is in this image?'),
],
),
];
// Use singleTurn for one-off multimodal requests
final response = await ChatSession.singleTurn(engine, messages);
print(response);
} finally {
await engine.dispose();
}
}
π‘ Model-Specific Notes #
Moondream 2 & Phi-2
These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:
- Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
- Manual Templates: Automatically applies the required
Question: / Answer:format if the model metadata is missing a chat template. - Stop Sequences: Injects
Question:as a stop sequence to prevent rambling in multi-turn conversations.
π§Ή Resource Management #
Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.
- Explicit Disposal: Always call
await engine.dispose()when you are finished with an engine instance. - Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
- Hot Restart Support: In Flutter, placing the engine inside a
ProviderorStateand callingdispose()in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
_engine.dispose();
super.dispose();
}
π¨ Low-Rank Adaptation (LoRA) #
llamadart supports applying multiple LoRA adapters dynamically at runtime.
- Dynamic Scaling: Adjust the strength (
scale) of each adapter on the fly. - Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
- Efficient: Multiple LoRAs share the memory of a single base model.
Check out our LoRA Training Notebook to learn how to train and convert your own adapters.
π§ͺ Testing & Quality #
This project maintains a high standard of quality with 80%+ global test coverage.
- Multi-Platform Testing: Run all tests across VM and Chrome automatically.
- CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run all tests (VM and Chrome)
dart test
# Run tests with coverage
dart test --coverage=coverage
π€ Contributing #
Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.
π License #
This project is licensed under the MIT License - see the LICENSE file for details.