dart_llama 0.2.0
dart_llama: ^0.2.0 copied to clipboard
A Dart package for interfacing with llama.cpp models using FFI
dart_llama #
A Dart package that provides FFI bindings to llama.cpp for running LLaMA models locally.
Features #
- FFI-based bindings - Direct integration with llama.cpp for maximum performance
- Low-level API - Full control over text generation without opinionated abstractions
- Streaming support - Real-time token generation with proper stop sequence handling
- Stop sequences - Configurable stop sequences for controlling generation boundaries
- Configurable parameters - Fine-tune model behavior with temperature, top-p, and repetition settings
- Memory efficient - Support for memory mapping and model locking
- Cross-platform - Works on macOS and Linux (Windows untested)
- Type-safe - Full Dart type safety with proper error handling
- CLI tools - Globally installable
ldcompletionandldchatcommands
Getting Started #
Prerequisites #
- Dart SDK - Version 3.0.0 or higher
- C/C++ Compiler - For building the native wrapper
- CMake - For building llama.cpp
Installation #
Add to your pubspec.yaml:
dependencies:
dart_llama: ^0.2.0
CLI Installation (Global) #
Install the CLI tools globally with a single command:
# Clone the repository
git clone https://github.com/WAMF/dart_llama.git
cd dart_llama
# Build and install globally (builds libraries, installs to ~/.dart_llama/, activates globally)
dart run bin/dart_llama_tool.dart global-install
# Then use from anywhere:
ldcompletion ~/models/model.gguf "Hello world" --stream
ldchat ~/models/model.gguf --stream
Quick Start (Local Development) #
# Clone the repository
git clone https://github.com/WAMF/dart_llama.git
cd dart_llama
# Run setup (builds static libraries, compiles executables, downloads model, runs tests)
dart run bin/dart_llama_tool.dart setup
# Try the compiled executables
./dist/ldcompletion models/gemma-3-1b-it-Q4_K_M.gguf "Hello world" --stream
./dist/ldchat models/gemma-3-1b-it-Q4_K_M.gguf --stream
Build Tool Commands #
The dart_llama_tool provides commands for building and setup:
# Global installation - builds and installs CLI tools globally (recommended)
dart run bin/dart_llama_tool.dart global-install
# Complete setup - builds static libraries, compiles executables, downloads model, runs tests
dart run bin/dart_llama_tool.dart setup
# Build llama.cpp and wrapper libraries (static by default)
dart run bin/dart_llama_tool.dart build
# Compile CLI tools to native executables in dist/
dart run bin/dart_llama_tool.dart compile
# Install library to ~/.dart_llama/ for global CLI usage
dart run bin/dart_llama_tool.dart install-lib
# Regenerate FFI bindings
dart run bin/dart_llama_tool.dart ffigen
# Download the Gemma 3 1B model
dart run bin/dart_llama_tool.dart download-model
# Clean build artifacts and llama.cpp source
dart run bin/dart_llama_tool.dart clean
# Show all commands
dart run bin/dart_llama_tool.dart --help
Dynamic Linking (Development) #
For development, you can use dynamic linking which is faster to build:
# Setup with dynamic linking (no native executables)
dart run bin/dart_llama_tool.dart setup --dynamic
# Build with dynamic linking
dart run bin/dart_llama_tool.dart build --dynamic
# Run via dart
dart run bin/ldcompletion.dart models/gemma-3-1b-it-Q4_K_M.gguf "Hello" --stream
dart run bin/ldchat.dart models/gemma-3-1b-it-Q4_K_M.gguf --stream
Distribution #
After running setup, the dist/ folder contains everything needed for distribution:
- Native executables (
ldcompletion,ldchat) - Single wrapper library (
libllama_wrapper.dylib)
Copy the dist/ folder contents along with your model file (.gguf).
Usage #
Low-Level Text Generation API (Recommended) #
The LlamaModel class provides direct access to text generation without any chat-specific formatting:
import 'package:dart_llama/dart_llama.dart';
void main() async {
// Create configuration
final config = LlamaConfig(
modelPath: 'models/gemma-3-1b-it-Q4_K_M.gguf',
contextSize: 2048,
threads: 4,
);
// Initialize the model
final model = LlamaModel(config);
model.initialize();
// Create a generation request
final request = GenerationRequest(
prompt: 'Once upon a time in a galaxy far, far away',
temperature: 0.7,
maxTokens: 256,
);
// Generate text
final response = await model.generate(request);
print(response.text);
// Clean up
model.dispose();
}
Streaming Generation #
// Stream tokens as they are generated
final request = GenerationRequest(
prompt: 'Write a haiku about programming',
temperature: 0.8,
maxTokens: 50,
);
await for (final token in model.generateStream(request)) {
stdout.write(token);
}
Building Chat Interfaces #
Different models expect different chat formats. Use the low-level LlamaModel API to implement model-specific formatting:
// Example: Gemma chat format
String buildGemmaPrompt(List<ChatMessage> messages) {
final buffer = StringBuffer();
// Gemma requires <bos> token at the beginning
buffer.write('<bos>');
for (final message in messages) {
if (message.role == 'user') {
buffer
..writeln('<start_of_turn>user')
..writeln(message.content)
..writeln('<end_of_turn>');
} else if (message.role == 'assistant') {
buffer
..writeln('<start_of_turn>model')
..writeln(message.content)
..writeln('<end_of_turn>');
}
}
buffer.write('<start_of_turn>model\n');
return buffer.toString();
}
// Use with LlamaModel and stop sequences
final prompt = buildGemmaPrompt(messages);
final request = GenerationRequest(
prompt: prompt,
stopSequences: ['<end_of_turn>'], // Stop at turn boundaries
);
final response = await model.generate(request);
See example/gemma_chat.dart for a complete Gemma chat implementation.
Configuration #
Model Configuration #
final config = LlamaConfig(
modelPath: 'model.gguf',
// Context and performance
contextSize: 4096, // Maximum context window
batchSize: 2048, // Batch size for processing
threads: 8, // Number of CPU threads
// Memory options
useMmap: true, // Memory-map the model
useMlock: false, // Lock model in RAM
);
final model = LlamaModel(config);
Generation Parameters #
final request = GenerationRequest(
prompt: 'Your prompt here',
// Sampling parameters
temperature: 0.7, // Creativity level (0.0-1.0)
topP: 0.9, // Nucleus sampling threshold
topK: 40, // Top-k sampling
// Generation control
maxTokens: 512, // Maximum tokens to generate
repeatPenalty: 1.1, // Repetition penalty
repeatLastN: 64, // Context for repetition check
seed: -1, // Random seed (-1 for random)
);
Examples #
CLI Tools #
# Text completion
ldcompletion model.gguf "Once upon a time" --stream
# Interactive chat
ldchat models/gemma-3-1b-it-Q4_K_M.gguf --stream
Running Examples Directly #
# Simple completion
dart example/main.dart model.gguf "Once upon a time"
# With streaming
dart example/main.dart model.gguf "Write a poem about" --stream
# Interactive Gemma chat
dart example/gemma_chat.dart models/gemma-3-1b-it-Q4_K_M.gguf
# With custom settings
dart example/gemma_chat.dart model.gguf \
--threads 8 \
--context 4096 \
--temp 0.8 \
--max-tokens 1024 \
--stream
API Documentation #
LlamaModel #
Low-level interface for text generation:
LlamaModel(config)- Create a new model instanceinitialize()- Initialize the model and contextgenerate(request, {onToken})- Generate text with optional token callbackgenerateStream(request)- Generate text as a stream of tokensclearContext()- Clear KV cache for fresh generation (call before each chat turn)dispose()- Clean up resources
LlamaConfig #
Configuration for the LLaMA model:
LlamaConfig({
required String modelPath, // Path to GGUF model file
int contextSize = 2048, // Maximum context window
int batchSize = 2048, // Batch size for processing
int threads = 4, // Number of CPU threads
bool useMmap = true, // Memory-map the model
bool useMlock = false, // Lock model in RAM
})
GenerationRequest #
Parameters for text generation:
GenerationRequest({
required String prompt, // Input text prompt
int maxTokens = 512, // Maximum tokens to generate
double temperature = 0.7, // Creativity (0.0-1.0)
double topP = 0.9, // Nucleus sampling threshold
int topK = 40, // Top-k sampling
double repeatPenalty = 1.1, // Repetition penalty
int repeatLastN = 64, // Context for repetition check
int seed = -1, // Random seed (-1 for random)
List<String> stopSequences = const [],// Stop generation at these sequences
})
GenerationResponse #
Response from text generation:
GenerationResponse({
String text, // Generated text
int promptTokens, // Number of prompt tokens
int generatedTokens, // Number of generated tokens
int totalTokens, // Total tokens processed
Duration generationTime, // Time taken to generate
})
Platform Support #
| Platform | Status | Architecture |
|---|---|---|
| macOS | Supported | arm64, x86_64 |
| Linux | Supported | x86_64 |
| Windows | Untested | x86_64 |
Troubleshooting #
Library Not Found #
If you get a library loading error:
- For global CLI usage, run
dart run bin/dart_llama_tool.dart install-libto install to~/.dart_llama/ - For local development, ensure
libllama_wrapper.dylibexists in the project root - Set
LLAMA_LIBRARY_PATHenvironment variable to the library location - On Linux, you may need to set
LD_LIBRARY_PATH:export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.dart_llama
The library is searched in this order:
LLAMA_LIBRARY_PATHenvironment variable- Next to the executable
- Current directory
./dist/subdirectory~/.dart_llama/
Model Loading Issues #
- Verify the model is in GGUF format
- Ensure you have enough RAM (model size + overhead)
- Check model compatibility with your llama.cpp version
Performance Tips #
- Use more threads for faster generation
- Enable
useMmapfor faster model loading - Adjust
batchSizebased on your hardware - Use quantized models for better performance
Contributing #
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Run tests:
dart test - Run analysis:
dart analyze - Submit a pull request
License #
MIT License - see LICENSE file for details.
Acknowledgments #
- llama.cpp - The underlying inference engine
- The Dart FFI team for excellent documentation
- The open-source AI community