llamadart

llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It lets you run GGUF LLMs locally across native platforms and web (CPU/WebGPU bridge path).

📚 Documentation

Docs site: https://llamadart.leehack.com/
API reference: https://pub.dev/documentation/llamadart/latest/
Chat app demo: https://leehack-llamadart.static.hf.space
Migration guide: MIGRATION.md

✨ Features

🚀 High Performance: Powered by llama.cpp kernels.
🛠️ Zero Configuration: Uses Pure Native Assets; no manual CMake or platform project edits.
📱 Cross-Platform: Android, iOS, macOS, Linux, Windows, and web.
⚡ GPU Acceleration:
- Apple: Metal
- Android/Linux/Windows: Vulkan by default, with optional target-specific modules
- Web: WebGPU via bridge runtime (with CPU fallback)
🧭 Embeddings API: Generate vectors with embed(...) and embedBatch(...).
🖼️ Multimodal Support: Vision/audio model runtime support.
LoRA Support: Runtime GGUF adapter application.
🔇 Split Logging Control: Dart logs and native logs can be configured independently.

🚀 Start Here (Plugin Users)

1. Add dependency

dependencies:
  llamadart: ^0.6.6

2. Run with defaults

On first dart run / flutter run, llamadart will:

Detect platform/architecture.
Download the matching native runtime bundle from leehack/llamadart-native.
Wire it into your app via native assets.

No manual binary download or C++ build steps are required.

3. Optional: choose backend modules per target (non-Apple)

hooks:
  user_defines:
    llamadart:
      llamadart_native_backends:
        platforms:
          android-arm64:
            backends: [vulkan] # opencl is opt-in
            cpu_profile: full # default: full; compact keeps baseline only
          linux-x64: [vulkan, cuda]
          windows-x64: [vulkan, cuda]

If a requested module is unavailable for a target, llamadart logs a warning and falls back to target defaults.

4. Minimal first model load

import 'package:llamadart/llamadart.dart';

Future<void> main() async {
  final engine = LlamaEngine(LlamaBackend());
  try {
    await engine.loadModel('path/to/model.gguf');
    await for (final token in engine.generate('Hello')) {
      print(token);
    }
  } finally {
    await engine.dispose();
  }
}

5. Generate embeddings

import 'package:llamadart/llamadart.dart';

Future<void> main() async {
  final engine = LlamaEngine(LlamaBackend());
  try {
    await engine.loadModel('path/to/embedding-model.gguf');

    final vector = await engine.embed('hello world');
    final batch = await engine.embedBatch([
      'semantic search',
      'document retrieval',
    ]);

    print('Single embedding dims: ${vector.length}');
    print('Batch size: ${batch.length}');
  } finally {
    await engine.dispose();
  }
}

Note: embedding support depends on backend/runtime capabilities.

Native runtime supports single and batched embeddings.
Web runtime requires bridge assets with embedding APIs (v0.1.7 or newer).
See the full guide: https://llamadart.leehack.com/docs/guides/embeddings

✅ Platform Defaults and Configurability

Target	Default runtime backends	Configurable in `pubspec.yaml`
android-arm64 / android-x64	cpu, vulkan	yes
linux-arm64 / linux-x64	cpu, vulkan	yes
windows-arm64 / windows-x64	cpu, vulkan	yes
macos-arm64 / macos-x86_64	cpu, METAL	no
ios-arm64 / ios simulators	cpu, METAL	no
web	webgpu, cpu (bridge router)	n/a

Full module matrix (available modules by target)

Backend module matrix from pinned native tag b8216:

Target	Available backend modules in bundle
android-arm64	cpu, vulkan, opencl
android-x64	cpu, vulkan, opencl
linux-arm64	cpu, vulkan, blas
linux-x64	cpu, vulkan, blas, cuda, hip
windows-arm64	cpu, vulkan, blas
windows-x64	cpu, vulkan, blas, cuda
macos-arm64	n/a (single consolidated native lib)
macos-x86_64	n/a (single consolidated native lib)
ios-arm64	n/a (single consolidated native lib)
ios-arm64-sim	n/a (single consolidated native lib)
ios-x86_64-sim	n/a (single consolidated native lib)

Recognized backend names for llamadart_native_backends:

vulkan
cpu
opencl
cuda
blas
metal
hip

Accepted aliases:

vk -> vulkan
ocl -> opencl
open-cl -> opencl

Android arm64 CPU profile options (platforms.android-arm64):

cpu_profile: full (default) bundles all Android ARM CPU variants.
cpu_profile: compact bundles baseline CPU variant only.
cpu_variants: [...] (advanced) overrides cpu_profile with an exact variant list.

Supported canonical cpu_variants values:

android_armv8.0_1
android_armv8.2_1
android_armv8.2_2
android_armv8.6_1
android_armv9.0_1
android_armv9.2_1
android_armv9.2_2

Variant feature differences:

Variant	Optional feature set used by that module
`android_armv8.0_1`	baseline
`android_armv8.2_1`	`DOTPROD`
`android_armv8.2_2`	`DOTPROD` + `FP16_VECTOR_ARITHMETIC`
`android_armv8.6_1`	`DOTPROD` + `FP16_VECTOR_ARITHMETIC` + `MATMUL_INT8`
`android_armv9.0_1`	`DOTPROD` + `FP16_VECTOR_ARITHMETIC` + `MATMUL_INT8` + `SVE2`
`android_armv9.2_1`	`DOTPROD` + `FP16_VECTOR_ARITHMETIC` + `MATMUL_INT8` + `SVE` + `SME`
`android_armv9.2_2`	`DOTPROD` + `FP16_VECTOR_ARITHMETIC` + `MATMUL_INT8` + `SVE` + `SVE2` + `SME`

Accepted cpu_variants input forms are normalized, for example:

baseline
armv8_6_1
v9_0_1
android-armv9.2_2
libggml-cpu-android_armv8.2_2.so

Selection precedence for Android arm64 CPU variants:

cpu_variants (if present and valid)
cpu_profile
default profile (full)

Notes:

Module availability depends on the pinned native release bundle and may change when the native tag updates.
Configurable targets always keep cpu bundled as a fallback.
Android arm64 defaults to cpu_profile: full for best runtime CPU optimization coverage.
Android keeps OpenCL available for opt-in, but defaults to Vulkan.
Use cpu_profile: compact if you prefer smaller Android arm64 package size over CPU-path optimization coverage.
If cpu_variants contains unknown entries, they are ignored with warnings.
If all provided cpu_variants are invalid, hook selection falls back to cpu_profile/default.
KleidiAI and ZenDNN are CPU-path optimizations in llama.cpp, not standalone backend module files.
example/chat_app backend settings list bundled backend options without forcing optional GPU backend initialization.
example/chat_app active backend status reflects the effective backend used for model load (for example CPU when GPU fallback is applied).
example/chat_app exposes Auto only on web; native selectors list concrete backend options.
CPU mode (preferredBackend: cpu or effective gpuLayers == 0) also disables context-time GPU offload so context creation stays CPU-only.
ModelParams.batchSize (n_batch) and ModelParams.microBatchSize (n_ubatch) can be set independently for memory/performance tuning; defaults keep legacy behavior (n_batch = n_ctx, n_ubatch = n_batch).
Apple targets are intentionally non-configurable in this hook path and use consolidated native libraries.
The native-assets hook refreshes emitted files each build; if you are upgrading from older cached outputs, run flutter clean once.

If you change llamadart_native_backends, run flutter clean once so stale native-asset outputs do not override new bundle selection.

🌐 Web Backend Notes (Router)

The default web backend uses WebGpuLlamaBackend as a router for WebGPU and CPU paths.

Web mode is currently experimental and depends on an external JS bridge runtime.
Bridge API contract: WebGPU bridge contract.
Runtime assets are published via:
- leehack/llama-web-bridge
- leehack/llama-web-bridge-assets
example/chat_app prefers vendored local bridge assets on localhost for dev/runtime validation, and otherwise prefers pinned jsDelivr assets with local fallback.
Web embeddings require bridge assets with embedding APIs (v0.1.7 or newer).
Browser Cache Storage is used for repeated model loads when useCache is enabled (default).
loadMultimodalProjector is supported on web for URL-based model/mmproj assets.
supportsVision and supportsAudio reflect loaded projector capabilities.
LoRA runtime adapters are not currently supported on web.
setLogLevel / setNativeLogLevel changes take effect on next model load.

If your app targets both native and web, gate feature toggles by capability checks.

🐧 Linux Runtime Prerequisites

Linux targets may need host runtime dependencies based on selected backends:

cpu: no extra GPU runtime dependency.
vulkan: Vulkan loader + valid GPU driver/ICD.
blas: OpenBLAS runtime (libopenblas.so.0).
cuda (linux-x64): NVIDIA driver + compatible CUDA runtime libs.
hip (linux-x64): ROCm runtime libs (for example libhipblas.so.2).

Example (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install -y libvulkan1 vulkan-tools libopenblas0

Example (Fedora/RHEL/CentOS):

sudo dnf install -y vulkan-loader vulkan-tools openblas

Example (Arch Linux):

sudo pacman -S --needed vulkan-icd-loader vulkan-tools openblas

Quick verification:

for f in .dart_tool/lib/libggml-*.so; do
  LD_LIBRARY_PATH=.dart_tool/lib ldd "$f" | grep "not found" || true
done

Docker-based Linux link/runtime validation (power users and maintainers)

# 1) Prepare linux-x64 native modules in .dart_tool/lib
docker run --rm --platform linux/amd64 \
  -v "$PWD:/workspace" \
  -v "/absolute/path/to/model.gguf:/models/your.gguf:ro" \
  -w /workspace/example/llamadart_cli \
  ghcr.io/cirruslabs/flutter:stable \
  bash -lc '
    rm -rf .dart_tool /workspace/.dart_tool/lib &&
    dart pub get &&
    dart run bin/llamadart_cli.dart --model /models/your.gguf --no-interactive --predict 1 --gpu-layers 0
  '

# 2) Baseline CPU/Vulkan/BLAS link-check
docker run --rm --platform linux/amd64 \
  -v "$PWD:/workspace" \
  -w /workspace/example/llamadart_cli \
  ghcr.io/cirruslabs/flutter:stable \
  bash -lc '
    apt-get update &&
    apt-get install -y --no-install-recommends libvulkan1 vulkan-tools libopenblas0 &&
    /workspace/scripts/check_native_link_deps.sh .dart_tool/lib \
      libggml-cpu.so libggml-vulkan.so libggml-blas.so
  '

# Optional CUDA module link-check without GPU execution
docker build --platform linux/amd64 \
  -f docker/validation/Dockerfile.cuda-linkcheck \
  -t llamadart-linkcheck-cuda .

docker run --rm --platform linux/amd64 \
  -v "$PWD:/workspace" \
  -w /workspace/example/llamadart_cli \
  llamadart-linkcheck-cuda \
  bash -lc '
    /workspace/scripts/check_native_link_deps.sh .dart_tool/lib \
      libggml-cuda.so libggml-blas.so libggml-vulkan.so
  '

# Optional HIP module link-check without GPU execution
docker build --platform linux/amd64 \
  -f docker/validation/Dockerfile.hip-linkcheck \
  -t llamadart-linkcheck-hip .

docker run --rm --platform linux/amd64 \
  -v "$PWD:/workspace" \
  -w /workspace/example/llamadart_cli \
  llamadart-linkcheck-hip \
  bash -lc '
    export LD_LIBRARY_PATH=".dart_tool/lib:/opt/rocm/lib:/opt/rocm-6.3.0/lib:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH:-}" &&
    /workspace/scripts/check_native_link_deps.sh .dart_tool/lib libggml-hip.so
  '

Notes:

Docker can validate module packaging and shared-library resolution.
GPU execution still requires host device/runtime passthrough.
CUDA validation requires NVIDIA runtime-enabled container execution.
HIP validation requires ROCm passthrough.

🏗️ Runtime Repositories (Maintainer Context)

llamadart has decoupled runtime ownership:

Native source/build/release: leehack/llamadart-native
Web bridge source/build: leehack/llama-web-bridge
Web bridge runtime assets: leehack/llama-web-bridge-assets
This repository consumes pinned published artifacts from those repositories.

Core abstractions in this package:

LlamaEngine: orchestrates model lifecycle, generation, and templates.
ChatSession: stateful helper for chat history and sliding-window context.
LlamaBackend: platform-agnostic backend interface with native/web routing.
Optional runtime diagnostics are exposed through LlamaEngine helpers such as getBackendName(), getAvailableBackends(), and getResolvedGpuLayers() when supported by the active backend.

⚠️ Breaking Changes in 0.6.x

If you are upgrading from 0.5.x, read:

MIGRATION.md

High-impact changes:

Removed legacy custom template-handler/override APIs from ChatTemplateEngine:
- registerHandler(...), unregisterHandler(...), clearCustomHandlers(...)
- registerTemplateOverride(...), unregisterTemplateOverride(...), clearTemplateOverrides(...)
Removed legacy per-call handler routing:
- customHandlerId and parse handlerId
Render/parse paths no longer silently downgrade to content-only output when a handler/parser fails; failures are surfaced to the caller.

🛠️ Usage

1. Simple Usage

The easiest way to get started is by using the default LlamaBackend.

import 'package:llamadart/llamadart.dart';

void main() async {
  // Automatically selects Native or Web backend
  final engine = LlamaEngine(LlamaBackend());

  try {
    // Initialize with a local GGUF model
    await engine.loadModel('path/to/model.gguf');

    // Generate text (streaming)
    await for (final token in engine.generate('The capital of France is')) {
      print(token);
    }
  } finally {
    // CRITICAL: Always dispose the engine to release native resources
    await engine.dispose();
  }
}

2. Advanced Usage (ChatSession)

Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());

  try {
    await engine.loadModel('model.gguf');

    // Create a session with a system prompt
    final session = ChatSession(
      engine, 
      systemPrompt: 'You are a helpful assistant.',
    );

    // Send a message
    await for (final chunk in session.create([LlamaTextContent('What is the capital of France?')])) {
      stdout.write(chunk.choices.first.delta.content ?? '');
    }
  } finally {
    await engine.dispose();
  }
}

3. Tool Calling

llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.

final tools = [
  ToolDefinition(
    name: 'get_weather',
    description: 'Get the current weather',
    parameters: [
      ToolParam.string('location', description: 'City name', required: true),
    ],
    handler: (params) async {
      final location = params.getRequiredString('location');
      return 'It is 22°C and sunny in $location';
    },
  ),
];

final session = ChatSession(engine);

// Pass tools per-request
await for (final chunk in session.create(
  [LlamaTextContent("how's the weather in London?")],
  tools: tools,
)) {
  final delta = chunk.choices.first.delta;
  if (delta.content != null) stdout.write(delta.content);
}

Notes:

Built-in template handlers automatically select model-specific tool-call grammar and parser behavior; you usually do not need to set GenerationParams.grammar manually for normal tool use.
Some handlers use lazy grammar activation (triggered when a tool-call prefix appears) to match llama.cpp behavior.
If you implement a custom handler grammar, prefer Dart raw strings (r'''...''') for GBNF blocks to avoid escaping bugs.

3.5 Template Routing (Strict llama.cpp parity)

Template/render/parse routing is intentionally strict to match llama.cpp:

Built-in format detection and built-in handlers are always used.
customTemplate is supported per call.
Legacy custom handler/override registry APIs were removed.

If you need deterministic template customization, use customTemplate, chatTemplateKwargs, and templateNow:

final result = await engine.chatTemplate(
  [
    const LlamaChatMessage.fromText(
      role: LlamaChatRole.user,
      text: 'hello',
    ),
  ],
  customTemplate: '{{ "CUSTOM:" ~ messages[0]["content"] }}',
  chatTemplateKwargs: {'my_flag': true, 'tenant': 'demo'},
  templateNow: DateTime.utc(2026, 1, 1),
);

print(result.prompt);

3.6 Logging Control

Use separate log levels for Dart and native output when debugging:

import 'package:llamadart/llamadart.dart';

final engine = LlamaEngine(LlamaBackend());

// Dart-side logs (template routing, parser diagnostics, etc.)
await engine.setDartLogLevel(LlamaLogLevel.info);

// Native llama.cpp / ggml logs
await engine.setNativeLogLevel(LlamaLogLevel.warn);

// Convenience: set both at once
await engine.setLogLevel(LlamaLogLevel.none);

4. Multimodal Usage (Vision/Audio)

llamadart supports multimodal models (vision and audio) using LlamaChatMessage.withContent.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());
  
  try {
    await engine.loadModel('vision-model.gguf');
    await engine.loadMultimodalProjector('mmproj.gguf');

    final session = ChatSession(engine);

    // Create a multimodal message
    final messages = [
      LlamaChatMessage.withContent(
        role: LlamaChatRole.user,
        content: [
          LlamaImageContent(path: 'image.jpg'),
          LlamaTextContent('What is in this image?'),
        ],
      ),
    ];

    // Use stateless engine.create for one-off multimodal requests
    final response = engine.create(messages);
    await for (final chunk in response) {
      stdout.write(chunk.choices.first.delta.content ?? '');
    }
  } finally {
    await engine.dispose();
  }
}

Web-specific note:

Load model/mmproj with URL-based assets (loadModelFromUrl + URL projector).
For user-picked browser files, send media as bytes (LlamaImageContent(bytes: ...), LlamaAudioContent(bytes: ...)) rather than local file paths.

💡 Model-Specific Notes

Moondream 2 & Phi-2

These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:

Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
Manual Templates: Automatically applies the required Question: / Answer: format if the model metadata is missing a chat template.
Stop Sequences: Injects Question: as a stop sequence to prevent rambling in multi-turn conversations.

🧹 Resource Management

Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.

Explicit Disposal: Always call await engine.dispose() when you are finished with an engine instance.
Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
Hot Restart Support: In Flutter, placing the engine inside a Provider or State and calling dispose() in the appropriate lifecycle method ensures stability across Hot Restarts.

@override
void dispose() {
  _engine.dispose();
  super.dispose();
}

🎨 Low-Rank Adaptation (LoRA)

llamadart supports applying multiple LoRA adapters dynamically at runtime.

Dynamic Scaling: Adjust the strength (scale) of each adapter on the fly.
Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
Efficient: Multiple LoRAs share the memory of a single base model.

Check out our LoRA Training Notebook to learn how to train and convert your own adapters.

🧪 Testing & Quality

This project maintains a high standard of quality with >=70% line coverage on maintainable lib/ code (auto-generated files marked with // coverage:ignore-file are excluded).

Multi-Platform Testing: dart test runs VM and Chrome-compatible suites automatically.
Local-Only Scenarios: Slow E2E tests are tagged local-only and skipped by default.
CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.

# Run default test suite (VM + Chrome-compatible tests)
dart test

# Run local-only E2E scenarios
dart test --run-skipped -t local-only

# Run VM tests with coverage
dart test -p vm --coverage=coverage

# Format lcov for maintainable code (respects // coverage:ignore-file)
dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore

# Enforce >=70% threshold
dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70

# Benchmark embedding throughput (sequential vs batch)
dart run tool/testing/native_embedding_benchmark.dart --model path/to/model.gguf --cpu --mode both --input-count 8 --max-seq 8

# Sweep max-seq and export CSV for plotting
dart run tool/testing/native_embedding_sweep.dart --model path/to/model.gguf --cpu --max-seq-values 1,2,4,8 --csv-out embedding_speedup.csv

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

llamadart

📚 Documentation

✨ Features

🚀 Start Here (Plugin Users)

1. Add dependency

2. Run with defaults

3. Optional: choose backend modules per target (non-Apple)

4. Minimal first model load

5. Generate embeddings

✅ Platform Defaults and Configurability

🌐 Web Backend Notes (Router)

🐧 Linux Runtime Prerequisites

🏗️ Runtime Repositories (Maintainer Context)

⚠️ Breaking Changes in 0.6.x

🛠️ Usage

1. Simple Usage

2. Advanced Usage (ChatSession)

3. Tool Calling

3.5 Template Routing (Strict llama.cpp parity)

3.6 Logging Control

4. Multimodal Usage (Vision/Audio)

💡 Model-Specific Notes

Moondream 2 & Phi-2

🧹 Resource Management

🎨 Low-Rank Adaptation (LoRA)

🧪 Testing & Quality

🤝 Contributing

📜 License

Libraries

llamadart package