llamafu 0.1.0
llamafu: ^0.1.0 copied to clipboard
A Flutter package for running language models on device with support for completion, instruct mode, tool calling, streaming, constrained generation, LoRA, and multi-modal inputs (images, audio).
Llamafu #
Run AI models directly on mobile devices. No cloud. No latency. Complete privacy.
Llamafu is a Flutter FFI plugin that brings the power of large language models to your mobile apps. Built on llama.cpp, it delivers high-performance inference with support for text generation, vision, tool calling, and more—all running locally on the device.
Why Llamafu? #
| Feature | Benefit |
|---|---|
| 100% On-Device | No API keys, no network calls, works offline |
| Privacy First | Data never leaves the device |
| Low Latency | No round-trip to cloud servers |
| Cost Effective | No per-token API charges |
Features #
Core Capabilities
- Text generation with streaming support
- Chat completions with conversation history
- Embeddings generation for semantic search
Advanced AI
- Vision/multimodal (images, audio) with LLaVA, Qwen2-VL
- Tool calling / function calling
- Structured JSON output with schema validation
- Grammar-constrained generation (GBNF)
Customization
- LoRA adapter loading and hot-swapping
- Fine-grained sampling controls (temperature, top-k, top-p, penalties)
- Configurable context size and threading
Platform Support
- Android (API 21+) and iOS (12.0+)
- Optimized native code via FFI
- GPU acceleration where available
Requirements #
| Platform | Minimum Version |
|---|---|
| Flutter | 3.10.0+ |
| Dart SDK | 3.1.0+ |
| Android | API 21+ (Android 5.0), NDK 21+ |
| iOS | 12.0+, Xcode 14+ |
Models must be in GGUF format. Quantized versions (Q4_K_M, Q8_0) recommended for mobile.
Installation #
flutter pub add llamafu
Or add manually to pubspec.yaml:
dependencies:
llamafu: ^0.1.0
Quick Start #
import 'package:llamafu/llamafu.dart';
void main() async {
// Load a GGUF model
final llamafu = await Llamafu.init(
modelPath: '/path/to/model.gguf',
threads: 4,
contextSize: 2048,
);
// Generate text
final result = await llamafu.complete(
prompt: 'Explain quantum computing in simple terms:',
maxTokens: 256,
temperature: 0.7,
);
print(result);
// Always clean up resources
llamafu.close();
}
That's it! The model runs entirely on the device—no internet required.
Documentation #
| Guide | Description |
|---|---|
| Getting Started | Installation and first steps |
| Model Guide | Choosing and obtaining models |
| High-Level APIs | Chat, LoRA, and multimodal |
| Tool Calling | Function calling and JSON output |
| Performance Guide | Memory and speed optimization |
| API Reference | Complete API documentation |
| Architecture | Technical design (for contributors) |
| Building | Build from source |
| Contributing | Development guidelines |
Usage Examples #
Text Generation #
final result = await llamafu.complete(
prompt: 'Write a function to sort an array:',
maxTokens: 300,
temperature: 0.7,
topK: 40,
topP: 0.9,
repeatPenalty: 1.1,
);
Multimodal (Vision) #
// Load vision model with projection
final llamafu = await Llamafu.init(
modelPath: '/path/to/llava-model.gguf',
mmprojPath: '/path/to/mmproj.gguf',
);
final result = await llamafu.multimodalComplete(
prompt: 'Describe this image:',
mediaInputs: [
MediaInput(type: MediaType.image, data: '/path/to/image.jpg'),
],
maxTokens: 200,
);
LoRA Adapters #
// Load and apply adapter
final adapter = await llamafu.loadLoraAdapter('/path/to/adapter.gguf');
await llamafu.applyLoraAdapter(adapter, scale: 0.8);
// Generate with adapter
final result = await llamafu.complete(
prompt: 'Translate to French: Hello',
maxTokens: 50,
);
// Remove adapter
await llamafu.removeLoraAdapter(adapter);
Structured Output #
const jsonGrammar = '''
root ::= object
object ::= "{" ws string ":" ws value "}" ws
string ::= "\\"" [a-zA-Z]+ "\\""
value ::= string | number
number ::= [0-9]+
ws ::= [ ]*
''';
final result = await llamafu.completeWithGrammar(
prompt: 'Generate user data:',
grammarStr: jsonGrammar,
grammarRoot: 'root',
maxTokens: 100,
);
Tool Calling #
// Define tools
final weatherTool = Tool(
name: 'get_weather',
description: 'Get weather for a location',
parameters: {
'type': 'object',
'properties': {
'location': {'type': 'string'},
},
'required': ['location'],
},
);
// Generate tool call
final toolCall = await llamafu.generateToolCall(
prompt: "What's the weather in Paris?",
tools: [weatherTool],
);
print(toolCall.name); // "get_weather"
print(toolCall.arguments); // {"location": "Paris"}
JSON Output #
// Generate JSON matching a schema
final result = await llamafu.generateJson(
prompt: 'Extract: John is 25 years old',
schema: {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
},
);
print(result); // {"name": "John", "age": 25}
Tokenization #
// Tokenize
final tokens = await llamafu.tokenize('Hello world');
print('Token count: ${tokens.length}');
// Detokenize
final text = await llamafu.detokenize(tokens);
// Model info
final info = await llamafu.getModelInfo();
print('Vocab: ${info.vocabularySize}');
print('Context: ${info.contextLength}');
Supported Models #
Works with any GGUF-format model. Popular choices for mobile:
| Category | Models |
|---|---|
| General | LLaMA 3, Mistral, Phi-3, Qwen2, Gemma 2 |
| Code | Code LLaMA, DeepSeek Coder, StarCoder2 |
| Vision | LLaVA, Qwen2-VL, Moondream |
| Small/Fast | Phi-3 Mini, TinyLlama, Gemma 2B |
Recommended quantizations for mobile: Q4_K_M (best quality/size), Q4_0 (fastest), Q8_0 (highest quality)
Find models at Hugging Face or convert your own with llama.cpp.
For Developers #
Building from Source #
# Clone with submodules
git clone --recursive https://github.com/neul-labs/llamafu.git
cd llamafu
# Setup development environment
make setup
# Build and test
make build
make test
Platform-specific builds:
make build-android # Android AAR
make build-ios # iOS framework
make build-local # Local dev with GPU support
Project Structure #
llamafu/
├── lib/src/
│ ├── llamafu_base.dart # High-level Dart API
│ └── llamafu_bindings.dart # FFI bindings (dart:ffi)
├── android/src/main/cpp/
│ ├── llamafu.h # C API header
│ └── llamafu.cpp # Native implementation
├── ios/Classes/ # iOS native code
├── llama.cpp/ # llama.cpp submodule (inference engine)
├── test/ # Comprehensive test suite
├── tools/ # Build and test scripts
├── example/ # Example Flutter app
└── docs/ # Documentation
Architecture #
┌─────────────────────────────────────────────────────────┐
│ Your Flutter App │
├─────────────────────────────────────────────────────────┤
│ Llamafu Dart API (lib/src/) │
│ High-level, type-safe, async interface │
├─────────────────────────────────────────────────────────┤
│ FFI Bindings (llamafu_bindings.dart) │
│ dart:ffi ↔ Native C bridge │
├─────────────────────────────────────────────────────────┤
│ Native C++ Layer (llamafu.cpp) │
│ RAII, memory safety, validation │
├─────────────────────────────────────────────────────────┤
│ llama.cpp Engine │
│ High-performance inference, GGUF loading │
└─────────────────────────────────────────────────────────┘
Performance Tips #
- Use quantized models — Q4_K_M offers the best quality/size tradeoff for mobile
- Right-size context — Smaller context = less memory (start with 2048, increase if needed)
- Tune threading —
threads: Platform.numberOfProcessors - 1is a good default - Clean up resources — Always call
close()when done to free native memory - Stream responses — Use streaming for better perceived performance in chat UIs
Error Handling #
try {
final result = await llamafu.complete(prompt: input, maxTokens: 200);
} on LlamafuException catch (e) {
switch (e.code) {
case LlamafuErrorCode.modelLoadFailed:
print('Could not load model: ${e.message}');
case LlamafuErrorCode.outOfMemory:
print('Not enough memory—try a smaller model or context size');
default:
print('Error: ${e.message}');
}
}
Contributing #
Contributions are welcome! See docs/contributing.md for guidelines.
# Run the full test suite
make test
# Format and lint
dart format . && dart analyze
Support #
- Issues: GitHub Issues
- Docs: docs/
License #
MIT License. See LICENSE for details.
Acknowledgments #
Built on the shoulders of giants: