llamadart 0.4.0 copy "llamadart: ^0.4.0" to clipboard
llamadart: ^0.4.0 copied to clipboard

A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models

llamadart #

Pub Version codecov License: MIT GitHub

llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.

✨ Features #

  • πŸš€ High Performance: Powered by llama.cpp's optimized C++ kernels.
  • πŸ› οΈ Zero Configuration: Uses the modern Pure Native Asset mechanismβ€”no manual build scripts or platform folders required.
  • πŸ“± Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
  • ⚑ GPU Acceleration:
    • Apple: Metal (macOS/iOS)
    • Android/Linux/Windows: Vulkan
  • πŸ–ΌοΈ Multimodal Support: Run vision and audio models (LLaVA, Gemma 3, Qwen2-VL) with integrated media processing.
  • ⏬ Resumable Downloads: Robust background-safe model downloads with parallel chunking and persistence using .meta tracking.
  • LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
  • 🌐 Web Support: Run inference in the browser via WASM (powered by wllama v2).
  • πŸ’Ž Dart-First API: Streamlined architecture with decoupled backends.
  • πŸ”‡ Logging Control: Toggle native engine output or use granular filtering on Web.
  • πŸ§ͺ High Coverage: Robust test suite with 80%+ global core coverage.

πŸ—οΈ Architecture #

llamadart 0.3.0+ uses a modern, decoupled architecture designed for flexibility and platform independence:

  • LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
  • ChatSession: A stateful wrapper for LlamaEngine that automatically manages conversation history, system prompts, and enforces context window limits (sliding window).
  • LlamaBackend: A platform-agnostic interface that allows swapping implementation details:
    • NativeLlamaBackend: Uses Dart FFI and background Isolates for high-performance desktop/mobile inference.
    • WebLlamaBackend: Uses WebAssembly and the wllama JS library for in-browser inference.
  • LlamaBackendFactory: Automatically selects the appropriate backend for your current platform.

πŸš€ Quick Start #

Platform Architecture(s) GPU Backend Status
macOS arm64, x86_64 Metal βœ… Tested
iOS arm64 (Device), x86_64 (Sim) Metal (Device), CPU (Sim) βœ… Tested
Android arm64-v8a, x86_64 Vulkan βœ… Tested
Linux arm64, x86_64 Vulkan βœ… Tested
Windows x64 Vulkan βœ… Tested
Web WASM CPU βœ… Tested

πŸ“¦ Installation #

Add llamadart to your pubspec.yaml:

dependencies:
  llamadart: ^0.4.0

Zero Setup (Native Assets) #

llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:

  1. Detects your target platform and architecture.
  2. Downloads the appropriate pre-compiled binary from GitHub.
  3. Bundles it seamlessly into your application.

No manual binary downloads, CMake configuration, or platform-specific project changes are needed.


πŸ› οΈ Usage #

1. Simple Usage #

The easiest way to get started is by using the default LlamaBackend.

import 'package:llamadart/llamadart.dart';

void main() async {
  // Automatically selects Native or Web backend
  final engine = LlamaEngine(LlamaBackend());

  try {
    // Initialize with a local GGUF model
    await engine.loadModel('path/to/model.gguf');

    // Generate text (streaming)
    await for (final token in engine.generate('The capital of France is')) {
      print(token);
    }
  } finally {
    // CRITICAL: Always dispose the engine to release native resources
    await engine.dispose();
  }
}

2. Advanced Usage (ChatSession) #

Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());

  try {
    await engine.loadModel('model.gguf');

    // Create a session with a system prompt and optional tools
    final session = ChatSession(
      engine, 
      systemPrompt: 'You are a helpful assistant.',
      toolRegistry: myToolRegistry, // Optional
    );

    // Just send user text; history and tools are handled automatically
    // The model decides when to use tools or respond directly.
    await for (final token in session.chat('What is the capital of France?')) {
      stdout.write(token);
    }
  } finally {
    await engine.dispose();
  }
}

3. Tool Calling #

llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.

final registry = ToolRegistry([
  ToolDefinition(
    name: 'get_weather',
    description: 'Get the current weather',
    parameters: [
      ToolParam.string('location', description: 'City name', required: true),
    ],
    handler: (params) async {
      final location = params.getRequiredString('location');
      return 'It is 22Β°C and sunny in $location';
    },
  ),
]);

final session = ChatSession(engine, toolRegistry: registry);

// "how's the weather in London?" -> Calls get_weather -> "It is 22Β°C and sunny in London"
await for (final token in session.chat("how's the weather in London?")) {
  stdout.write(token);
}

4. Multimodal Usage (Vision/Audio) #

llamadart supports multimodal models (vision and audio) using LlamaChatMessage.multimodal.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());
  
  try {
    await engine.loadModel('vision-model.gguf');
    await engine.loadMultimodalProjector('mmproj.gguf');

    final session = ChatSession(engine);

    // Create a multimodal message
    final messages = [
      LlamaChatMessage.multimodal(
        role: LlamaChatRole.user,
        parts: [
          LlamaImageContent(path: 'image.jpg'),
          LlamaTextContent('What is in this image?'),
        ],
      ),
    ];

    // Use singleTurn for one-off multimodal requests
    final response = await ChatSession.singleTurn(engine, messages);
    print(response);
  } finally {
    await engine.dispose();
  }
}

πŸ’‘ Model-Specific Notes #

Moondream 2 & Phi-2

These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:

  • Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
  • Manual Templates: Automatically applies the required Question: / Answer: format if the model metadata is missing a chat template.
  • Stop Sequences: Injects Question: as a stop sequence to prevent rambling in multi-turn conversations.

🧹 Resource Management #

Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.

  • Explicit Disposal: Always call await engine.dispose() when you are finished with an engine instance.
  • Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
  • Hot Restart Support: In Flutter, placing the engine inside a Provider or State and calling dispose() in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
  _engine.dispose();
  super.dispose();
}

🎨 Low-Rank Adaptation (LoRA) #

llamadart supports applying multiple LoRA adapters dynamically at runtime.

  • Dynamic Scaling: Adjust the strength (scale) of each adapter on the fly.
  • Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
  • Efficient: Multiple LoRAs share the memory of a single base model.

Check out our LoRA Training Notebook to learn how to train and convert your own adapters.


πŸ§ͺ Testing & Quality #

This project maintains a high standard of quality with 80%+ global test coverage.

  • Multi-Platform Testing: Run all tests across VM and Chrome automatically.
  • CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run all tests (VM and Chrome)
dart test

# Run tests with coverage
dart test --coverage=coverage

🀝 Contributing #

Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.

πŸ“œ License #

This project is licensed under the MIT License - see the LICENSE file for details.

13
likes
0
points
707
downloads

Publisher

verified publisherleehack.com

Weekly Downloads

A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models

Repository (GitHub)
View/report issues

Topics

#llama #llm #ai #inference #gguf

License

unknown (license)

Dependencies

code_assets, ffi, flutter, hooks, http, json_rpc_2, logging, path, path_provider, web

More

Packages that depend on llamadart