dart_llama

A Dart package that provides FFI bindings to llama.cpp for running LLaMA models locally.

Features

FFI-based bindings - Direct integration with llama.cpp for maximum performance
Low-level API - Full control over text generation without opinionated abstractions
Streaming support - Real-time token generation with proper stop sequence handling
Stop sequences - Configurable stop sequences for controlling generation boundaries
Configurable parameters - Fine-tune model behavior with temperature, top-p, and repetition settings
Memory efficient - Support for memory mapping and model locking
Cross-platform - Works on macOS, Linux, and Windows (planned)
Type-safe - Full Dart type safety with proper error handling

Getting Started

Prerequisites

Dart SDK - Version 3.0.0 or higher
C/C++ Compiler - For building the native wrapper
CMake - For building llama.cpp

Installation

Add to your pubspec.yaml:

dependencies:
  dart_llama: ^0.1.0

Building the Native Library

This package requires building native libraries before use. Follow these steps:

# Clone the repository
git clone https://github.com/WAMF/dart_llama.git
cd dart_llama

# Build both llama.cpp and the wrapper library
./scripts/build_llama.sh

# The script will:
# 1. Clone/update llama.cpp submodule
# 2. Build llama.cpp as a shared library (libllama.dylib/so/dll)
# 3. Build the wrapper library (libllama_wrapper.dylib/so/dll)

# Optional: Download a test model (Gemma 3 1B)
./scripts/download_gemma.sh

Manual Build (Advanced)

If you need to rebuild just the wrapper after making changes:

./scripts/build_wrapper.sh

Or build manually:

# macOS
clang -shared -fPIC -o libllama_wrapper.dylib llama_wrapper.c \
    -I./llama.cpp/include -L. -lllama -std=c11

# Linux
gcc -shared -fPIC -o libllama_wrapper.so llama_wrapper.c \
    -I./llama.cpp/include -L. -lllama -std=c11

# Windows
gcc -shared -o llama_wrapper.dll llama_wrapper.c \
    -I./llama.cpp/include -L. -lllama -std=c11

Usage

Low-Level Text Generation API (Recommended)

The LlamaModel class provides direct access to text generation without any chat-specific formatting:

import 'package:dart_llama/dart_llama.dart';

void main() async {
  // Create configuration
  final config = LlamaConfig(
    modelPath: 'models/gemma-3-1b-it-Q4_K_M.gguf',
    contextSize: 2048,
    threads: 4,
  );

  // Initialize the model
  final model = LlamaModel(config);
  model.initialize();

  // Create a generation request
  final request = GenerationRequest(
    prompt: 'Once upon a time in a galaxy far, far away',
    temperature: 0.7,
    maxTokens: 256,
  );

  // Generate text
  final response = await model.generate(request);
  print(response.text);

  // Clean up
  model.dispose();
}

Streaming Generation

// Stream tokens as they are generated
final request = GenerationRequest(
  prompt: 'Write a haiku about programming',
  temperature: 0.8,
  maxTokens: 50,
);

await for (final token in model.generateStream(request)) {
  stdout.write(token);
}

Building Chat Interfaces

Different models expect different chat formats. Use the low-level LlamaModel API to implement model-specific formatting:

// Example: Gemma chat format
String buildGemmaPrompt(List<ChatMessage> messages) {
  final buffer = StringBuffer();
  
  // Gemma requires <bos> token at the beginning
  buffer.write('<bos>');
  
  for (final message in messages) {
    if (message.role == 'user') {
      buffer
        ..writeln('<start_of_turn>user')
        ..writeln(message.content)
        ..writeln('<end_of_turn>');
    } else if (message.role == 'assistant') {
      buffer
        ..writeln('<start_of_turn>model')
        ..writeln(message.content)
        ..writeln('<end_of_turn>');
    }
  }
  
  buffer.write('<start_of_turn>model\n');
  return buffer.toString();
}

// Use with LlamaModel and stop sequences
final prompt = buildGemmaPrompt(messages);
final request = GenerationRequest(
  prompt: prompt,
  stopSequences: ['<end_of_turn>'], // Stop at turn boundaries
);
final response = await model.generate(request);

See example/gemma_chat.dart for a complete Gemma chat implementation.

Configuration

Model Configuration

final config = LlamaConfig(
  modelPath: 'model.gguf',
  
  // Context and performance
  contextSize: 4096,        // Maximum context window
  batchSize: 2048,         // Batch size for processing
  threads: 8,              // Number of CPU threads
  
  // Memory options
  useMmap: true,           // Memory-map the model
  useMlock: false,         // Lock model in RAM
);

final model = LlamaModel(config);

Generation Parameters

final request = GenerationRequest(
  prompt: 'Your prompt here',
  
  // Sampling parameters
  temperature: 0.7,        // Creativity level (0.0-1.0)
  topP: 0.9,              // Nucleus sampling threshold
  topK: 40,               // Top-k sampling
  
  // Generation control
  maxTokens: 512,          // Maximum tokens to generate
  repeatPenalty: 1.1,      // Repetition penalty
  repeatLastN: 64,         // Context for repetition check
  seed: -1,               // Random seed (-1 for random)
);

Examples

Text Completion

# Simple completion
dart example/completion.dart model.gguf "Once upon a time"

# With streaming
dart example/completion.dart model.gguf "Write a poem about" --stream

Gemma Chat

# Interactive Gemma chat
dart example/gemma_chat.dart models/gemma-3-1b-it-Q4_K_M.gguf

# With custom settings
dart example/gemma_chat.dart model.gguf \
  --threads 8 \
  --context 4096 \
  --temp 0.8 \
  --max-tokens 1024 \
  --stream

API Documentation

LlamaModel

Low-level interface for text generation:

LlamaModel(config) - Create a new model instance
initialize() - Initialize the model and context
generate(request, {onToken}) - Generate text with optional token callback
generateStream(request) - Generate text as a stream of tokens
dispose() - Clean up resources

LlamaConfig

Configuration for the LLaMA model:

LlamaConfig({
  required String modelPath,   // Path to GGUF model file
  int contextSize = 2048,      // Maximum context window
  int batchSize = 2048,        // Batch size for processing
  int threads = 4,             // Number of CPU threads
  bool useMmap = true,         // Memory-map the model
  bool useMlock = false,       // Lock model in RAM
})

GenerationRequest

Parameters for text generation:

GenerationRequest({
  required String prompt,               // Input text prompt
  int maxTokens = 512,                 // Maximum tokens to generate
  double temperature = 0.7,             // Creativity (0.0-1.0)
  double topP = 0.9,                   // Nucleus sampling threshold
  int topK = 40,                       // Top-k sampling
  double repeatPenalty = 1.1,          // Repetition penalty
  int repeatLastN = 64,                // Context for repetition check
  int seed = -1,                       // Random seed (-1 for random)
  List<String> stopSequences = const [],// Stop generation at these sequences
})

GenerationResponse

Response from text generation:

GenerationResponse({
  String text,                         // Generated text
  int promptTokens,                    // Number of prompt tokens
  int generatedTokens,                 // Number of generated tokens
  int totalTokens,                     // Total tokens processed
  Duration generationTime,             // Time taken to generate
})

Platform Support

Platform	Status	Architecture
macOS	✅ Supported	arm64, x86_64
Linux	✅ Supported	x86_64
Windows	🚧 Planned	x86_64

Troubleshooting

Library Not Found

If you get a library loading error:

Ensure you've run ./scripts/build_llama.sh to build both libraries
Check that both libraries exist in the project root:
- libllama.dylib (or .so on Linux, .dll on Windows)
- libllama_wrapper.dylib (or .so on Linux, .dll on Windows)
On Linux, you may need to set LD_LIBRARY_PATH:
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
```

Model Loading Issues

Verify the model is in GGUF format
Ensure you have enough RAM (model size + overhead)
Check model compatibility with your llama.cpp version

Performance Tips

Use more threads for faster generation
Enable useMmap for faster model loading
Adjust batchSize based on your hardware
Use quantized models for better performance

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Run tests: dart test
Run analysis: dart analyze
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

llama.cpp - The underlying inference engine
The Dart FFI team for excellent documentation
The open-source AI community