dart_llama #

A Dart package that provides FFI bindings to llama.cpp for running LLaMA models locally.

Features #

FFI-based bindings - Direct integration with llama.cpp for maximum performance
Low-level API - Full control over text generation without opinionated abstractions
Streaming support - Real-time token generation with proper stop sequence handling
Stop sequences - Configurable stop sequences for controlling generation boundaries
Configurable parameters - Fine-tune model behavior with temperature, top-p, and repetition settings
Memory efficient - Support for memory mapping and model locking
Cross-platform - Works on macOS and Linux (Windows untested)
Type-safe - Full Dart type safety with proper error handling
CLI tools - Globally installable ldcompletion and ldchat commands

Getting Started #

Prerequisites #

Dart SDK - Version 3.0.0 or higher
C/C++ Compiler - For building the native wrapper
CMake - For building llama.cpp

Installation #

Add to your pubspec.yaml:

dependencies:
  dart_llama: ^0.2.0

CLI Installation (Global) #

Install the CLI tools globally with a single command:

# Clone the repository
git clone https://github.com/WAMF/dart_llama.git
cd dart_llama

# Build and install globally (builds libraries, installs to ~/.dart_llama/, activates globally)
dart run bin/dart_llama_tool.dart global-install

# Then use from anywhere:
ldcompletion ~/models/model.gguf "Hello world" --stream
ldchat ~/models/model.gguf --stream

Quick Start (Local Development) #

# Clone the repository
git clone https://github.com/WAMF/dart_llama.git
cd dart_llama

# Run setup (builds static libraries, compiles executables, downloads model, runs tests)
dart run bin/dart_llama_tool.dart setup

# Try the compiled executables
./dist/ldcompletion models/gemma-3-1b-it-Q4_K_M.gguf "Hello world" --stream
./dist/ldchat models/gemma-3-1b-it-Q4_K_M.gguf --stream

Build Tool Commands #

The dart_llama_tool provides commands for building and setup:

# Global installation - builds and installs CLI tools globally (recommended)
dart run bin/dart_llama_tool.dart global-install

# Complete setup - builds static libraries, compiles executables, downloads model, runs tests
dart run bin/dart_llama_tool.dart setup

# Build llama.cpp and wrapper libraries (static by default)
dart run bin/dart_llama_tool.dart build

# Compile CLI tools to native executables in dist/
dart run bin/dart_llama_tool.dart compile

# Install library to ~/.dart_llama/ for global CLI usage
dart run bin/dart_llama_tool.dart install-lib

# Regenerate FFI bindings
dart run bin/dart_llama_tool.dart ffigen

# Download the Gemma 3 1B model
dart run bin/dart_llama_tool.dart download-model

# Clean build artifacts and llama.cpp source
dart run bin/dart_llama_tool.dart clean

# Show all commands
dart run bin/dart_llama_tool.dart --help

Dynamic Linking (Development) #

For development, you can use dynamic linking which is faster to build:

# Setup with dynamic linking (no native executables)
dart run bin/dart_llama_tool.dart setup --dynamic

# Build with dynamic linking
dart run bin/dart_llama_tool.dart build --dynamic

# Run via dart
dart run bin/ldcompletion.dart models/gemma-3-1b-it-Q4_K_M.gguf "Hello" --stream
dart run bin/ldchat.dart models/gemma-3-1b-it-Q4_K_M.gguf --stream

Distribution #

After running setup, the dist/ folder contains everything needed for distribution:

Native executables (ldcompletion, ldchat)
Single wrapper library (libllama_wrapper.dylib)

Copy the dist/ folder contents along with your model file (.gguf).

Usage #

Low-Level Text Generation API (Recommended) #

The LlamaModel class provides direct access to text generation without any chat-specific formatting:

import 'package:dart_llama/dart_llama.dart';

void main() async {
  // Create configuration
  final config = LlamaConfig(
    modelPath: 'models/gemma-3-1b-it-Q4_K_M.gguf',
    contextSize: 2048,
    threads: 4,
  );

  // Initialize the model
  final model = LlamaModel(config);
  model.initialize();

  // Create a generation request
  final request = GenerationRequest(
    prompt: 'Once upon a time in a galaxy far, far away',
    temperature: 0.7,
    maxTokens: 256,
  );

  // Generate text
  final response = await model.generate(request);
  print(response.text);

  // Clean up
  model.dispose();
}

Streaming Generation #

// Stream tokens as they are generated
final request = GenerationRequest(
  prompt: 'Write a haiku about programming',
  temperature: 0.8,
  maxTokens: 50,
);

await for (final token in model.generateStream(request)) {
  stdout.write(token);
}

Building Chat Interfaces #

Different models expect different chat formats. Use the low-level LlamaModel API to implement model-specific formatting:

// Example: Gemma chat format
String buildGemmaPrompt(List<ChatMessage> messages) {
  final buffer = StringBuffer();

  // Gemma requires <bos> token at the beginning
  buffer.write('<bos>');

  for (final message in messages) {
    if (message.role == 'user') {
      buffer
        ..writeln('<start_of_turn>user')
        ..writeln(message.content)
        ..writeln('<end_of_turn>');
    } else if (message.role == 'assistant') {
      buffer
        ..writeln('<start_of_turn>model')
        ..writeln(message.content)
        ..writeln('<end_of_turn>');
    }
  }

  buffer.write('<start_of_turn>model\n');
  return buffer.toString();
}

// Use with LlamaModel and stop sequences
final prompt = buildGemmaPrompt(messages);
final request = GenerationRequest(
  prompt: prompt,
  stopSequences: ['<end_of_turn>'], // Stop at turn boundaries
);
final response = await model.generate(request);

See example/gemma_chat.dart for a complete Gemma chat implementation.

Configuration #

Model Configuration #

final config = LlamaConfig(
  modelPath: 'model.gguf',

  // Context and performance
  contextSize: 4096,        // Maximum context window
  batchSize: 2048,         // Batch size for processing
  threads: 8,              // Number of CPU threads

  // Memory options
  useMmap: true,           // Memory-map the model
  useMlock: false,         // Lock model in RAM
);

final model = LlamaModel(config);

Generation Parameters #

final request = GenerationRequest(
  prompt: 'Your prompt here',

  // Sampling parameters
  temperature: 0.7,        // Creativity level (0.0-1.0)
  topP: 0.9,              // Nucleus sampling threshold
  topK: 40,               // Top-k sampling

  // Generation control
  maxTokens: 512,          // Maximum tokens to generate
  repeatPenalty: 1.1,      // Repetition penalty
  repeatLastN: 64,         // Context for repetition check
  seed: -1,               // Random seed (-1 for random)
);

Examples #

CLI Tools #

# Text completion
ldcompletion model.gguf "Once upon a time" --stream

# Interactive chat
ldchat models/gemma-3-1b-it-Q4_K_M.gguf --stream

Running Examples Directly #

# Simple completion
dart example/main.dart model.gguf "Once upon a time"

# With streaming
dart example/main.dart model.gguf "Write a poem about" --stream

# Interactive Gemma chat
dart example/gemma_chat.dart models/gemma-3-1b-it-Q4_K_M.gguf

# With custom settings
dart example/gemma_chat.dart model.gguf \
  --threads 8 \
  --context 4096 \
  --temp 0.8 \
  --max-tokens 1024 \
  --stream

API Documentation #

LlamaModel #

Low-level interface for text generation:

LlamaModel(config) - Create a new model instance
initialize() - Initialize the model and context
generate(request, {onToken}) - Generate text with optional token callback
generateStream(request) - Generate text as a stream of tokens
clearContext() - Clear KV cache for fresh generation (call before each chat turn)
dispose() - Clean up resources

LlamaConfig #

Configuration for the LLaMA model:

LlamaConfig({
  required String modelPath,   // Path to GGUF model file
  int contextSize = 2048,      // Maximum context window
  int batchSize = 2048,        // Batch size for processing
  int threads = 4,             // Number of CPU threads
  bool useMmap = true,         // Memory-map the model
  bool useMlock = false,       // Lock model in RAM
})

GenerationRequest #

Parameters for text generation:

GenerationRequest({
  required String prompt,               // Input text prompt
  int maxTokens = 512,                 // Maximum tokens to generate
  double temperature = 0.7,             // Creativity (0.0-1.0)
  double topP = 0.9,                   // Nucleus sampling threshold
  int topK = 40,                       // Top-k sampling
  double repeatPenalty = 1.1,          // Repetition penalty
  int repeatLastN = 64,                // Context for repetition check
  int seed = -1,                       // Random seed (-1 for random)
  List<String> stopSequences = const [],// Stop generation at these sequences
})

GenerationResponse #

Response from text generation:

GenerationResponse({
  String text,                         // Generated text
  int promptTokens,                    // Number of prompt tokens
  int generatedTokens,                 // Number of generated tokens
  int totalTokens,                     // Total tokens processed
  Duration generationTime,             // Time taken to generate
})

Platform Support #

Platform	Status	Architecture
macOS	Supported	arm64, x86_64
Linux	Supported	x86_64
Windows	Untested	x86_64

Troubleshooting #

Library Not Found #

If you get a library loading error:

For global CLI usage, run dart run bin/dart_llama_tool.dart install-lib to install to ~/.dart_llama/
For local development, ensure libllama_wrapper.dylib exists in the project root
Set LLAMA_LIBRARY_PATH environment variable to the library location

On Linux, you may need to set LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.dart_llama

The library is searched in this order:

LLAMA_LIBRARY_PATH environment variable
Next to the executable
Current directory
./dist/ subdirectory
~/.dart_llama/

Model Loading Issues #

Verify the model is in GGUF format
Ensure you have enough RAM (model size + overhead)
Check model compatibility with your llama.cpp version

Performance Tips #

Use more threads for faster generation
Enable useMmap for faster model loading
Adjust batchSize based on your hardware
Use quantized models for better performance

Contributing #

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Run tests: dart test
Run analysis: dart analyze
Submit a pull request

License #

MIT License - see LICENSE file for details.

Acknowledgments #

llama.cpp - The underlying inference engine
The Dart FFI team for excellent documentation
The open-source AI community

dart_llama 0.2.0 dart_llama: ^0.2.0 copied to clipboard

Metadata