llamadart #

llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.

✨ Features #

🚀 High Performance: Powered by llama.cpp's optimized C++ kernels.
🛠️ Zero Configuration: Uses the modern Pure Native Asset mechanism—no manual build scripts or platform folders required.
📱 Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
⚡ GPU Acceleration:
- Apple: Metal (macOS/iOS)
- Android/Linux/Windows: Vulkan
🖼️ Multimodal Support: Run vision and audio models (LLaVA, Gemma 3, Qwen2-VL) with integrated media processing.
⏬ Resumable Downloads: Robust background-safe model downloads with parallel chunking and persistence using .meta tracking.
LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
🌐 Web Support: Run inference in the browser via WASM (powered by wllama v2).
💎 Dart-First API: Streamlined architecture with decoupled backends.
🔇 Logging Control: Toggle native engine output or use granular filtering on Web.
🧪 High Coverage: Robust test suite with 80%+ global core coverage.

🏗️ Architecture #

llamadart 0.3.0+ uses a modern, decoupled architecture designed for flexibility and platform independence:

LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
ChatSession: A stateful wrapper for LlamaEngine that automatically manages conversation history, system prompts, and enforces context window limits (sliding window).
LlamaBackend: A platform-agnostic interface that allows swapping implementation details:
- NativeLlamaBackend: Uses Dart FFI and background Isolates for high-performance desktop/mobile inference.
- WebLlamaBackend: Uses WebAssembly and the wllama JS library for in-browser inference.
LlamaBackendFactory: Automatically selects the appropriate backend for your current platform.

🚀 Quick Start #

Platform	Architecture(s)	GPU Backend	Status
macOS	arm64, x86_64	Metal	✅ Tested
iOS	arm64 (Device), x86_64 (Sim)	Metal (Device), CPU (Sim)	✅ Tested
Android	arm64-v8a, x86_64	Vulkan	✅ Tested
Linux	arm64, x86_64	Vulkan	✅ Tested
Windows	x64	Vulkan	✅ Tested
Web	WASM	CPU	✅ Tested

📦 Installation #

Add llamadart to your pubspec.yaml:

dependencies:
  llamadart: ^0.4.0

Zero Setup (Native Assets) #

llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:

Detects your target platform and architecture.
Downloads the appropriate pre-compiled binary from GitHub.
Bundles it seamlessly into your application.

No manual binary downloads, CMake configuration, or platform-specific project changes are needed.

🛠️ Usage #

1. Simple Usage #

The easiest way to get started is by using the default LlamaBackend.

import 'package:llamadart/llamadart.dart';

void main() async {
  // Automatically selects Native or Web backend
  final engine = LlamaEngine(LlamaBackend());

  try {
    // Initialize with a local GGUF model
    await engine.loadModel('path/to/model.gguf');

    // Generate text (streaming)
    await for (final token in engine.generate('The capital of France is')) {
      print(token);
    }
  } finally {
    // CRITICAL: Always dispose the engine to release native resources
    await engine.dispose();
  }
}

2. Advanced Usage (ChatSession) #

Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());

  try {
    await engine.loadModel('model.gguf');

    // Create a session with a system prompt and optional tools
    final session = ChatSession(
      engine, 
      systemPrompt: 'You are a helpful assistant.',
      toolRegistry: myToolRegistry, // Optional
    );

    // Just send user text; history and tools are handled automatically
    // The model decides when to use tools or respond directly.
    await for (final token in session.chat('What is the capital of France?')) {
      stdout.write(token);
    }
  } finally {
    await engine.dispose();
  }
}

3. Tool Calling #

llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.

final registry = ToolRegistry([
  ToolDefinition(
    name: 'get_weather',
    description: 'Get the current weather',
    parameters: [
      ToolParam.string('location', description: 'City name', required: true),
    ],
    handler: (params) async {
      final location = params.getRequiredString('location');
      return 'It is 22°C and sunny in $location';
    },
  ),
]);

final session = ChatSession(engine, toolRegistry: registry);

// "how's the weather in London?" -> Calls get_weather -> "It is 22°C and sunny in London"
await for (final token in session.chat("how's the weather in London?")) {
  stdout.write(token);
}

4. Multimodal Usage (Vision/Audio) #

llamadart supports multimodal models (vision and audio) using LlamaChatMessage.multimodal.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());
  
  try {
    await engine.loadModel('vision-model.gguf');
    await engine.loadMultimodalProjector('mmproj.gguf');

    final session = ChatSession(engine);

    // Create a multimodal message
    final messages = [
      LlamaChatMessage.multimodal(
        role: LlamaChatRole.user,
        parts: [
          LlamaImageContent(path: 'image.jpg'),
          LlamaTextContent('What is in this image?'),
        ],
      ),
    ];

    // Use singleTurn for one-off multimodal requests
    final response = await ChatSession.singleTurn(engine, messages);
    print(response);
  } finally {
    await engine.dispose();
  }
}

💡 Model-Specific Notes #

Moondream 2 & Phi-2

These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:

Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
Manual Templates: Automatically applies the required Question: / Answer: format if the model metadata is missing a chat template.
Stop Sequences: Injects Question: as a stop sequence to prevent rambling in multi-turn conversations.

🧹 Resource Management #

Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.

Explicit Disposal: Always call await engine.dispose() when you are finished with an engine instance.
Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
Hot Restart Support: In Flutter, placing the engine inside a Provider or State and calling dispose() in the appropriate lifecycle method ensures stability across Hot Restarts.

@override
void dispose() {
  _engine.dispose();
  super.dispose();
}

🎨 Low-Rank Adaptation (LoRA) #

llamadart supports applying multiple LoRA adapters dynamically at runtime.

Dynamic Scaling: Adjust the strength (scale) of each adapter on the fly.
Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
Efficient: Multiple LoRAs share the memory of a single base model.

Check out our LoRA Training Notebook to learn how to train and convert your own adapters.

🧪 Testing & Quality #

This project maintains a high standard of quality with 80%+ global test coverage.

Multi-Platform Testing: Run all tests across VM and Chrome automatically.
CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.

# Run all tests (VM and Chrome)
dart test

# Run tests with coverage
dart test --coverage=coverage

🤝 Contributing #

Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.

📜 License #

This project is licensed under the MIT License - see the LICENSE file for details.

llamadart 0.4.0
llamadart: ^0.4.0 copied to clipboard

Metadata

llamadart #

✨ Features #

🏗️ Architecture #

🚀 Quick Start #

📦 Installation #

Zero Setup (Native Assets) #

🛠️ Usage #

1. Simple Usage #

2. Advanced Usage (ChatSession) #

3. Tool Calling #

4. Multimodal Usage (Vision/Audio) #

💡 Model-Specific Notes #

Moondream 2 & Phi-2

🧹 Resource Management #

🎨 Low-Rank Adaptation (LoRA) #

🧪 Testing & Quality #

🤝 Contributing #

📜 License #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

llamadart 0.4.0 llamadart: ^0.4.0 copied to clipboard

Metadata

llamadart #

✨ Features #

🏗️ Architecture #

🚀 Quick Start #

📦 Installation #

Zero Setup (Native Assets) #

🛠️ Usage #

1. Simple Usage #

2. Advanced Usage (ChatSession) #

3. Tool Calling #

4. Multimodal Usage (Vision/Audio) #

💡 Model-Specific Notes #

Moondream 2 & Phi-2

🧹 Resource Management #

🎨 Low-Rank Adaptation (LoRA) #

🧪 Testing & Quality #

🤝 Contributing #

📜 License #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

llamadart 0.4.0
llamadart: ^0.4.0 copied to clipboard