llamadart
llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It lets you run GGUF LLMs locally across native platforms and web (CPU/WebGPU bridge path).
โจ Features
- ๐ High Performance: Powered by
llama.cppkernels. - ๐ ๏ธ Zero Configuration: Uses Pure Native Assets; no manual CMake or platform project edits.
- ๐ฑ Cross-Platform: Android, iOS, macOS, Linux, Windows, and web.
- โก GPU Acceleration:
- Apple: Metal
- Android/Linux/Windows: Vulkan by default, with optional target-specific modules
- Web: WebGPU via bridge runtime (with CPU fallback)
- ๐ผ๏ธ Multimodal Support: Vision/audio model runtime support.
- LoRA Support: Runtime GGUF adapter application.
- ๐ Split Logging Control: Dart logs and native logs can be configured independently.
๐ Start Here (Plugin Users)
1. Add dependency
dependencies:
llamadart: ^0.6.1
2. Run with defaults
On first dart run / flutter run, llamadart will:
- Detect platform/architecture.
- Download the matching native runtime bundle from
leehack/llamadart-native. - Wire it into your app via native assets.
No manual binary download or C++ build steps are required.
3. Optional: choose backend modules per target (non-Apple)
hooks:
user_defines:
llamadart:
llamadart_native_backends:
platforms:
android-arm64: [vulkan] # opencl is opt-in
linux-x64: [vulkan, cuda]
windows-x64: [vulkan, cuda]
If a requested module is unavailable for a target, llamadart logs a warning and falls back to target defaults.
4. Minimal first model load
import 'package:llamadart/llamadart.dart';
Future<void> main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('path/to/model.gguf');
await for (final token in engine.generate('Hello')) {
print(token);
}
} finally {
await engine.dispose();
}
}
โ Platform Defaults and Configurability
| Target | Default runtime backends | Configurable in pubspec.yaml |
|---|---|---|
| android-arm64 / android-x64 | cpu, vulkan | yes |
| linux-arm64 / linux-x64 | cpu, vulkan | yes |
| windows-arm64 / windows-x64 | cpu, vulkan | yes |
| macos-arm64 / macos-x86_64 | cpu, METAL | no |
| ios-arm64 / ios simulators | cpu, METAL | no |
| web | webgpu, cpu (bridge router) | n/a |
Full module matrix (available modules by target)
Backend module matrix from pinned native tag b8099:
| Target | Available backend modules in bundle |
|---|---|
| android-arm64 | cpu, vulkan, opencl |
| android-x64 | cpu, vulkan, opencl |
| linux-arm64 | cpu, vulkan, blas |
| linux-x64 | cpu, vulkan, blas, cuda, hip |
| windows-arm64 | cpu, vulkan, blas |
| windows-x64 | cpu, vulkan, blas, cuda |
| macos-arm64 | n/a (single consolidated native lib) |
| macos-x86_64 | n/a (single consolidated native lib) |
| ios-arm64 | n/a (single consolidated native lib) |
| ios-arm64-sim | n/a (single consolidated native lib) |
| ios-x86_64-sim | n/a (single consolidated native lib) |
Recognized backend names for llamadart_native_backends:
vulkancpuopenclcudablasmetalhip
Accepted aliases:
vk->vulkanocl->openclopen-cl->opencl
Notes:
- Module availability depends on the pinned native release bundle and may change when the native tag updates.
- Configurable targets always keep
cpubundled as a fallback. - Android keeps OpenCL available for opt-in, but defaults to Vulkan.
KleidiAIandZenDNNare CPU-path optimizations inllama.cpp, not standalone backend module files.example/chat_appbackend settings show runtime-detected backends/devices (what initialized), not only bundled module files.example/chat_appno longer exposes anAutoselector; it lists concrete detected backends.- Legacy saved
Autopreferences inexample/chat_appare auto-migrated at runtime. - Apple targets are intentionally non-configurable in this hook path and use consolidated native libraries.
- The native-assets hook refreshes emitted files each build; if you are upgrading from older cached outputs, run
flutter cleanonce.
If you change llamadart_native_backends, run flutter clean once so stale native-asset outputs do not override new bundle selection.
๐ Web Backend Notes (Router)
The default web backend uses WebGpuLlamaBackend as a router for WebGPU and CPU paths.
- Web mode is currently experimental and depends on an external JS bridge runtime.
- Bridge API contract: WebGPU bridge contract.
- Runtime assets are published via:
example/chat_appprefers local bridge assets, then falls back to jsDelivr.- Browser Cache Storage is used for repeated model loads when
useCacheis enabled (default). loadMultimodalProjectoris supported on web for URL-based model/mmproj assets.supportsVisionandsupportsAudioreflect loaded projector capabilities.- LoRA runtime adapters are not currently supported on web.
setLogLevel/setNativeLogLevelchanges take effect on next model load.
If your app targets both native and web, gate feature toggles by capability checks.
๐ง Linux Runtime Prerequisites
Linux targets may need host runtime dependencies based on selected backends:
cpu: no extra GPU runtime dependency.vulkan: Vulkan loader + valid GPU driver/ICD.blas: OpenBLAS runtime (libopenblas.so.0).cuda(linux-x64): NVIDIA driver + compatible CUDA runtime libs.hip(linux-x64): ROCm runtime libs (for examplelibhipblas.so.2).
Example (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install -y libvulkan1 vulkan-tools libopenblas0
Example (Fedora/RHEL/CentOS):
sudo dnf install -y vulkan-loader vulkan-tools openblas
Example (Arch Linux):
sudo pacman -S --needed vulkan-icd-loader vulkan-tools openblas
Quick verification:
for f in .dart_tool/lib/libggml-*.so; do
LD_LIBRARY_PATH=.dart_tool/lib ldd "$f" | grep "not found" || true
done
Docker-based Linux link/runtime validation (power users and maintainers)
# 1) Prepare linux-x64 native modules in .dart_tool/lib
docker run --rm --platform linux/amd64 \
-v "$PWD:/workspace" \
-v "/absolute/path/to/model.gguf:/models/your.gguf:ro" \
-w /workspace/example/llamadart_cli \
ghcr.io/cirruslabs/flutter:stable \
bash -lc '
rm -rf .dart_tool /workspace/.dart_tool/lib &&
dart pub get &&
dart run bin/llamadart_cli.dart --model /models/your.gguf --no-interactive --predict 1 --gpu-layers 0
'
# 2) Baseline CPU/Vulkan/BLAS link-check
docker run --rm --platform linux/amd64 \
-v "$PWD:/workspace" \
-w /workspace/example/llamadart_cli \
ghcr.io/cirruslabs/flutter:stable \
bash -lc '
apt-get update &&
apt-get install -y --no-install-recommends libvulkan1 vulkan-tools libopenblas0 &&
/workspace/scripts/check_native_link_deps.sh .dart_tool/lib \
libggml-cpu.so libggml-vulkan.so libggml-blas.so
'
# Optional CUDA module link-check without GPU execution
docker build --platform linux/amd64 \
-f docker/validation/Dockerfile.cuda-linkcheck \
-t llamadart-linkcheck-cuda .
docker run --rm --platform linux/amd64 \
-v "$PWD:/workspace" \
-w /workspace/example/llamadart_cli \
llamadart-linkcheck-cuda \
bash -lc '
/workspace/scripts/check_native_link_deps.sh .dart_tool/lib \
libggml-cuda.so libggml-blas.so libggml-vulkan.so
'
# Optional HIP module link-check without GPU execution
docker build --platform linux/amd64 \
-f docker/validation/Dockerfile.hip-linkcheck \
-t llamadart-linkcheck-hip .
docker run --rm --platform linux/amd64 \
-v "$PWD:/workspace" \
-w /workspace/example/llamadart_cli \
llamadart-linkcheck-hip \
bash -lc '
export LD_LIBRARY_PATH=".dart_tool/lib:/opt/rocm/lib:/opt/rocm-6.3.0/lib:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH:-}" &&
/workspace/scripts/check_native_link_deps.sh .dart_tool/lib libggml-hip.so
'
Notes:
- Docker can validate module packaging and shared-library resolution.
- GPU execution still requires host device/runtime passthrough.
- CUDA validation requires NVIDIA runtime-enabled container execution.
- HIP validation requires ROCm passthrough.
๐๏ธ Runtime Repositories (Maintainer Context)
llamadart has decoupled runtime ownership:
- Native source/build/release:
leehack/llamadart-native - Web bridge source/build:
leehack/llama-web-bridge - Web bridge runtime assets:
leehack/llama-web-bridge-assets - This repository consumes pinned published artifacts from those repositories.
Core abstractions in this package:
LlamaEngine: orchestrates model lifecycle, generation, and templates.ChatSession: stateful helper for chat history and sliding-window context.LlamaBackend: platform-agnostic backend interface with native/web routing.
โ ๏ธ Breaking Changes in 0.6.x
If you are upgrading from 0.5.x, read:
High-impact changes:
- Removed legacy custom template-handler/override APIs from
ChatTemplateEngine:registerHandler(...),unregisterHandler(...),clearCustomHandlers(...)registerTemplateOverride(...),unregisterTemplateOverride(...),clearTemplateOverrides(...)
- Removed legacy per-call handler routing:
customHandlerIdand parsehandlerId
- Render/parse paths no longer silently downgrade to content-only output when a handler/parser fails; failures are surfaced to the caller.
๐ ๏ธ Usage
1. Simple Usage
The easiest way to get started is by using the default LlamaBackend.
import 'package:llamadart/llamadart.dart';
void main() async {
// Automatically selects Native or Web backend
final engine = LlamaEngine(LlamaBackend());
try {
// Initialize with a local GGUF model
await engine.loadModel('path/to/model.gguf');
// Generate text (streaming)
await for (final token in engine.generate('The capital of France is')) {
print(token);
}
} finally {
// CRITICAL: Always dispose the engine to release native resources
await engine.dispose();
}
}
2. Advanced Usage (ChatSession)
Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('model.gguf');
// Create a session with a system prompt
final session = ChatSession(
engine,
systemPrompt: 'You are a helpful assistant.',
);
// Send a message
await for (final chunk in session.create([LlamaTextContent('What is the capital of France?')])) {
stdout.write(chunk.choices.first.delta.content ?? '');
}
} finally {
await engine.dispose();
}
}
3. Tool Calling
llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.
final tools = [
ToolDefinition(
name: 'get_weather',
description: 'Get the current weather',
parameters: [
ToolParam.string('location', description: 'City name', required: true),
],
handler: (params) async {
final location = params.getRequiredString('location');
return 'It is 22ยฐC and sunny in $location';
},
),
];
final session = ChatSession(engine);
// Pass tools per-request
await for (final chunk in session.create(
[LlamaTextContent("how's the weather in London?")],
tools: tools,
)) {
final delta = chunk.choices.first.delta;
if (delta.content != null) stdout.write(delta.content);
}
Notes:
- Built-in template handlers automatically select model-specific tool-call grammar and parser behavior; you usually do not need to set
GenerationParams.grammarmanually for normal tool use. - Some handlers use lazy grammar activation (triggered when a tool-call prefix appears) to match llama.cpp behavior.
- If you implement a custom handler grammar, prefer Dart raw strings (
r'''...''') for GBNF blocks to avoid escaping bugs.
3.5 Template Routing (Strict llama.cpp parity)
Template/render/parse routing is intentionally strict to match llama.cpp:
- Built-in format detection and built-in handlers are always used.
customTemplateis supported per call.- Legacy custom handler/override registry APIs were removed.
If you need deterministic template customization, use customTemplate,
chatTemplateKwargs, and templateNow:
final result = await engine.chatTemplate(
[
const LlamaChatMessage.fromText(
role: LlamaChatRole.user,
text: 'hello',
),
],
customTemplate: '{{ "CUSTOM:" ~ messages[0]["content"] }}',
chatTemplateKwargs: {'my_flag': true, 'tenant': 'demo'},
templateNow: DateTime.utc(2026, 1, 1),
);
print(result.prompt);
3.6 Logging Control
Use separate log levels for Dart and native output when debugging:
import 'package:llamadart/llamadart.dart';
final engine = LlamaEngine(LlamaBackend());
// Dart-side logs (template routing, parser diagnostics, etc.)
await engine.setDartLogLevel(LlamaLogLevel.info);
// Native llama.cpp / ggml logs
await engine.setNativeLogLevel(LlamaLogLevel.warn);
// Convenience: set both at once
await engine.setLogLevel(LlamaLogLevel.none);
4. Multimodal Usage (Vision/Audio)
llamadart supports multimodal models (vision and audio) using LlamaChatMessage.withContent.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('vision-model.gguf');
await engine.loadMultimodalProjector('mmproj.gguf');
final session = ChatSession(engine);
// Create a multimodal message
final messages = [
LlamaChatMessage.withContent(
role: LlamaChatRole.user,
content: [
LlamaImageContent(path: 'image.jpg'),
LlamaTextContent('What is in this image?'),
],
),
];
// Use stateless engine.create for one-off multimodal requests
final response = engine.create(messages);
await for (final chunk in response) {
stdout.write(chunk.choices.first.delta.content ?? '');
}
} finally {
await engine.dispose();
}
}
Web-specific note:
- Load model/mmproj with URL-based assets (
loadModelFromUrl+ URL projector). - For user-picked browser files, send media as bytes (
LlamaImageContent(bytes: ...),LlamaAudioContent(bytes: ...)) rather than local file paths.
๐ก Model-Specific Notes
Moondream 2 & Phi-2
These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:
- Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
- Manual Templates: Automatically applies the required
Question: / Answer:format if the model metadata is missing a chat template. - Stop Sequences: Injects
Question:as a stop sequence to prevent rambling in multi-turn conversations.
๐งน Resource Management
Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.
- Explicit Disposal: Always call
await engine.dispose()when you are finished with an engine instance. - Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
- Hot Restart Support: In Flutter, placing the engine inside a
ProviderorStateand callingdispose()in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
_engine.dispose();
super.dispose();
}
๐จ Low-Rank Adaptation (LoRA)
llamadart supports applying multiple LoRA adapters dynamically at runtime.
- Dynamic Scaling: Adjust the strength (
scale) of each adapter on the fly. - Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
- Efficient: Multiple LoRAs share the memory of a single base model.
Check out our LoRA Training Notebook to learn how to train and convert your own adapters.
๐งช Testing & Quality
This project maintains a high standard of quality with >=70% line coverage on maintainable lib/ code (auto-generated files marked with // coverage:ignore-file are excluded).
- Multi-Platform Testing:
dart testruns VM and Chrome-compatible suites automatically. - Local-Only Scenarios: Slow E2E tests are tagged
local-onlyand skipped by default. - CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run default test suite (VM + Chrome-compatible tests)
dart test
# Run local-only E2E scenarios
dart test --run-skipped -t local-only
# Run VM tests with coverage
dart test -p vm --coverage=coverage
# Format lcov for maintainable code (respects // coverage:ignore-file)
dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
# Enforce >=70% threshold
dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Libraries
- llamadart
- High-performance Dart and Flutter plugin for llama.cpp.