llamadart 0.3.0
llamadart: ^0.3.0 copied to clipboard
A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models
llamadart #
llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.
β¨ Features #
- π High Performance: Powered by
llama.cpp's optimized C++ kernels. - π οΈ Zero Configuration: Uses the modern Pure Native Asset mechanismβno manual build scripts or platform folders required.
- π± Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
- β‘ GPU Acceleration:
- Apple: Metal (macOS/iOS)
- Android/Linux/Windows: Vulkan
- LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
- π Web Support: Run inference in the browser via WASM (powered by
wllamav2). - π Dart-First API: Streamlined architecture with decoupled backends.
- π Logging Control: Toggle native engine output or use granular filtering on Web.
- π§ͺ High Coverage: Robust test suite with 80%+ global core coverage.
ποΈ Architecture #
llamadart 0.3.0+ uses a modern, decoupled architecture designed for flexibility and platform independence:
- LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
- LlamaBackend: A platform-agnostic interface that allows swapping implementation details:
NativeLlamaBackend: Uses Dart FFI and background Isolates for high-performance desktop/mobile inference.WebLlamaBackend: Uses WebAssembly and thewllamaJS library for in-browser inference.
- LlamaBackendFactory: Automatically selects the appropriate backend for your current platform.
π Quick Start #
| Platform | Architecture(s) | GPU Backend | Status |
|---|---|---|---|
| macOS | arm64, x86_64 | Metal | β Tested |
| iOS | arm64 (Device), x86_64 (Sim) | Metal (Device), CPU (Sim) | β Tested |
| Android | arm64-v8a, x86_64 | Vulkan | β Tested |
| Linux | arm64, x86_64 | Vulkan | β Tested |
| Windows | x64 | Vulkan | β Tested |
| Web | WASM | CPU | β Tested |
π¦ Installation #
Add llamadart to your pubspec.yaml:
dependencies:
llamadart: ^0.3.0
Zero Setup (Native Assets) #
llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:
- Detects your target platform and architecture.
- Downloads the appropriate pre-compiled binary from GitHub.
- Bundles it seamlessly into your application.
No manual binary downloads, CMake configuration, or platform-specific project changes are needed.
π οΈ Usage #
1. Simple Usage #
The easiest way to get started is by using the default LlamaBackend.
import 'package:llamadart/llamadart.dart';
void main() async {
// Automatically selects Native or Web backend
final engine = LlamaEngine(LlamaBackend());
try {
// Initialize with a local GGUF model
await engine.loadModel('path/to/model.gguf');
// Generate text (streaming)
await for (final token in engine.generate('The capital of France is')) {
print(token);
}
} finally {
// CRITICAL: Always dispose the engine to release native resources
await engine.dispose();
}
}
2. Advanced Usage (Decoupled Engine) #
Use LlamaEngine directly for more granular control, such as swapping backends or manual context management.
import 'package:llamadart/llamadart.dart';
void main() async {
// Explicitly select Native or Web backend
final backend = NativeLlamaBackend();
final engine = LlamaEngine(backend);
try {
await engine.loadModel('model.gguf');
// High-level Chat interface (handles templates and stop sequences)
final messages = [
LlamaChatMessage(role: 'system', content: 'You are a poetic assistant.'),
LlamaChatMessage(role: 'user', content: 'Tell a story about a cat.'),
];
await for (final text in engine.chat(messages)) {
print(text);
}
} finally {
await engine.dispose();
}
}
π§Ή Resource Management #
Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.
- Explicit Disposal: Always call
await engine.dispose()when you are finished with an engine instance. - Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
- Hot Restart Support: In Flutter, placing the engine inside a
ProviderorStateand callingdispose()in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
_engine.dispose();
super.dispose();
}
π¨ Low-Rank Adaptation (LoRA) #
llamadart supports applying multiple LoRA adapters dynamically at runtime.
- Dynamic Scaling: Adjust the strength (
scale) of each adapter on the fly. - Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
- Efficient: Multiple LoRAs share the memory of a single base model.
Check out our LoRA Training Notebook to learn how to train and convert your own adapters.
π§ͺ Testing & Quality #
This project maintains a high standard of quality with 80%+ global test coverage.
- Native Tests: Integration tests using real GGUF models via FFI.
- Web Tests: Browser-based unit and integration tests using Chrome.
- CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run all native tests
dart test
# Run web tests (requires Chrome)
dart test -p chrome test/web_backend_unit_test.dart
π€ Contributing #
Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.
π License #
This project is licensed under the MIT License - see the LICENSE file for details.