RunAnywhere LlamaCpp Backend
High-performance LLM text generation backend for the RunAnywhere Flutter SDK, powered by llama.cpp.
Features
| Feature | Description |
|---|---|
| GGUF Model Support | Run any GGUF-quantized model (Q4, Q5, Q8, etc.) |
| Streaming Generation | Token-by-token streaming for real-time UI updates |
| Metal Acceleration | Hardware acceleration on iOS devices |
| NEON Acceleration | ARM NEON optimizations on Android |
| Privacy-First | All processing happens locally on device |
| Memory Efficient | Quantized models reduce memory footprint |
Installation
Add both the core SDK and this backend to your pubspec.yaml:
dependencies:
runanywhere: ^0.15.11
runanywhere_llamacpp: ^0.15.11
Then run:
flutter pub get
Note: This package requires the core
runanywherepackage. It won't work standalone.
Platform Support
| Platform | Minimum Version | Acceleration |
|---|---|---|
| iOS | 14.0+ | Metal GPU |
| Android | API 24+ | NEON SIMD |
Quick Start
1. Initialize & Register
import 'package:runanywhere/runanywhere.dart';
import 'package:runanywhere_llamacpp/runanywhere_llamacpp.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
// Initialize SDK
await RunAnywhere.initialize();
// Register LlamaCpp backend
await LlamaCpp.register();
runApp(MyApp());
}
2. Add a Model
LlamaCpp.addModel(
id: 'smollm2-360m-q8_0',
name: 'SmolLM2 360M Q8_0',
url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
memoryRequirement: 500000000, // ~500MB
);
3. Download & Load
// Download the model
await for (final progress in RunAnywhere.downloadModel('smollm2-360m-q8_0')) {
print('Progress: ${(progress.percentage * 100).toStringAsFixed(1)}%');
if (progress.state.isCompleted) break;
}
// Load the model
await RunAnywhere.loadModel('smollm2-360m-q8_0');
print('Model loaded: ${RunAnywhere.isModelLoaded}');
4. Generate Text
// Simple chat
final response = await RunAnywhere.chat('Hello! How are you?');
print(response);
// Streaming generation
final result = await RunAnywhere.generateStream(
'Write a short poem about Flutter',
options: LLMGenerationOptions(maxTokens: 100, temperature: 0.7),
);
await for (final token in result.stream) {
stdout.write(token); // Real-time output
}
// Get metrics after completion
final metrics = await result.result;
print('\nTokens/sec: ${metrics.tokensPerSecond.toStringAsFixed(1)}');
API Reference
LlamaCpp Class
register()
Register the LlamaCpp backend with the SDK.
static Future<void> register({int priority = 100})
Parameters:
priority– Backend priority (higher = preferred). Default: 100.
addModel()
Add an LLM model to the registry.
static void addModel({
required String id,
required String name,
required String url,
int memoryRequirement = 0,
bool supportsThinking = false,
})
Parameters:
id– Unique model identifiername– Human-readable model nameurl– Download URL for the GGUF filememoryRequirement– Estimated memory usage in bytessupportsThinking– Whether model supports thinking tokens (e.g., DeepSeek R1)
Supported Models
Any GGUF model compatible with llama.cpp:
Recommended Models
| Model | Size | Memory | Use Case |
|---|---|---|---|
| SmolLM2 360M Q8_0 | ~400MB | ~500MB | Fast responses, mobile |
| Qwen2.5 0.5B Q8_0 | ~600MB | ~700MB | Good quality, small |
| Qwen2.5 1.5B Q4_K_M | ~1GB | ~1.2GB | Better quality |
| Phi-3.5-mini Q4_K_M | ~2GB | ~2.5GB | High quality |
| Llama 3.2 1B Q4_K_M | ~800MB | ~1GB | Balanced |
| DeepSeek R1 1.5B Q4_K_M | ~1.2GB | ~1.5GB | Reasoning, thinking |
Quantization Guide
| Format | Quality | Size | Speed |
|---|---|---|---|
| Q8_0 | Highest | Largest | Slower |
| Q6_K | Very High | Large | Medium |
| Q5_K_M | High | Medium | Medium |
| Q4_K_M | Good | Small | Fast |
| Q4_0 | Lower | Smallest | Fastest |
Tip: For mobile, Q4_K_M or Q5_K_M offer the best quality/size balance.
Memory Management
Checking Memory
// Get available models with their memory requirements
final models = await RunAnywhere.availableModels();
for (final model in models) {
if (model.downloadSize != null) {
print('${model.name}: ${(model.downloadSize! / 1e9).toStringAsFixed(1)} GB');
}
}
Unloading Models
// Unload to free memory
await RunAnywhere.unloadModel();
Generation Options
final result = await RunAnywhere.generate(
'Your prompt here',
options: LLMGenerationOptions(
maxTokens: 200, // Maximum tokens to generate
temperature: 0.7, // Randomness (0.0 = deterministic, 1.0 = creative)
topP: 0.9, // Nucleus sampling
systemPrompt: 'You are a helpful assistant.',
),
);
| Option | Default | Range | Description |
|---|---|---|---|
maxTokens |
100 | 1-4096 | Maximum tokens to generate |
temperature |
0.8 | 0.0-2.0 | Response randomness |
topP |
1.0 | 0.0-1.0 | Nucleus sampling threshold |
systemPrompt |
null | - | System prompt prepended to input |
Troubleshooting
Model Loading Fails
Symptom: SDKError.modelLoadFailed
Solutions:
- Verify model is fully downloaded (check
model.isDownloaded) - Ensure sufficient memory available
- Check model format is GGUF (not GGML or safetensors)
Slow Generation
Solutions:
- Use smaller quantization (Q4_K_M instead of Q8_0)
- Use a smaller model
- Reduce
maxTokens - On iOS, ensure Metal is available (device not in low power mode)
Out of Memory
Solutions:
- Unload current model before loading new one
- Use smaller quantization
- Use a smaller model
Related Packages
- runanywhere — Core SDK (required)
- runanywhere_llamacpp — LLM backend (this package)
- runanywhere_onnx — STT/TTS/VAD backend
Resources
License
This software is licensed under the RunAnywhere License, which is based on Apache 2.0 with additional terms for commercial use. See LICENSE for details.
For commercial licensing inquiries, contact: san@runanywhere.ai
Libraries
- llamacpp
- LlamaCPP backend for RunAnywhere Flutter SDK.
- llamacpp_error
- native/llamacpp_bindings
- runanywhere_llamacpp
- LlamaCpp backend for RunAnywhere Flutter SDK.