llama_flutter_android 0.1.1
llama_flutter_android: ^0.1.1 copied to clipboard
Run GGUF models on Android with llama.cpp - MIT licensed, Android-only Flutter plugin
llama_flutter_android #
Run GGUF models on Android with llama.cpp - A simple, MIT-licensed Flutter plugin.
Features #
- Android Only - Optimized specifically for Android
- Simple API - Easy-to-use Dart interface with Pigeon type safety
- Token Streaming - Real-time token generation with EventChannel
- Stop Generation - Cancel text generation mid-process on Android devices
- 18 Parameters - Complete control: temperature, penalties, mirostat, seed, and more
- 7 Chat Templates - ChatML, Llama-2, Alpaca, Vicuna, Phi, Gemma, Zephyr
- Auto-Detection - Chat templates detected from model filename
- Latest llama.cpp - Built on October 2025 llama.cpp (no patches needed)
- ARM64 Optimized - NEON and dot product optimizations enabled
Local Chat App #
The plugin includes a complete example chat application that demonstrates how to integrate the plugin into a real application. The example app showcases:
- Load Model from Local Storage - Select and load GGUF models directly from your device storage
- Context Size Real Time Update - Monitor and adjust context usage in real-time with visual indicators
- Advanced Parameters - Full control over model parameters including Temperature Control, Top-P, Top-K, and more
- Auto Unload Model - Automatic model unloading when inactivity is detected to preserve device resources
An APK of the example app is available in the example-app Github for immediate testing.
Screenshots #
![]() |
![]() |
![]() |
|---|
Requirements #
- Flutter 3.24.0+
- Dart SDK 3.3.0+
- Android API 26+ (Android 8.0)
- NDK r27+ (for 16KB page size support)
Installation #
Add to your pubspec.yaml:
dependencies:
llama_flutter_android: latest
Quick Start #
Basic Usage #
import 'package:llama_flutter_android/llama_flutter_android.dart';
// Initialize controller
final controller = LlamaController();
// Load model
await controller.loadModel(
modelPath: '/path/to/model.gguf',
nThreads: 4,
contextSize: 2048,
);
// Generate text with streaming
StreamSubscription? subscription;
subscription = controller.generate(
prompt: 'Write a story about a robot',
maxTokens: 512,
temperature: 0.7,
).listen(
(token) => print(token), // Print each token as it arrives
onDone: () => print('Generation complete!'),
onError: (error) => print('Error: $error'),
);
// Stop generation mid-process (critical for UX!)
await controller.stop();
subscription?.cancel();
// Clean up
await controller.dispose();
Chat Mode with Templates #
// Chat with automatic template formatting
controller.generateChat(
messages: [
ChatMessage(role: 'system', content: 'You are a helpful assistant'),
ChatMessage(role: 'user', content: 'Explain quantum computing'),
],
template: 'chatml', // Auto-detected if null
temperature: 0.7,
maxTokens: 1000,
).listen((token) => print(token));
Advanced Parameters #
// Fine-grained control over generation
controller.generate(
prompt: 'Explain machine learning',
maxTokens: 1000,
// Sampling
temperature: 0.8, // Creativity (0.0-2.0)
topP: 0.9, // Nucleus sampling
topK: 40, // Top-K sampling
minP: 0.05, // Minimum probability
// Penalties (reduce repetition)
repeatPenalty: 1.2, // Penalize repeated tokens
frequencyPenalty: 0.5, // Penalize frequent tokens
presencePenalty: 0.3, // Penalize token presence
repeatLastN: 64, // Penalty window size
// Reproducibility
seed: 42, // Fixed seed for same output
// Mirostat (perplexity control)
mirostat: 2, // 0=off, 1=v1, 2=v2
mirostatTau: 5.0, // Target perplexity
mirostatEta: 0.1, // Learning rate
).listen((token) => print(token));
// Stop anytime!
await controller.stop();
Architecture #
Flutter App (Dart)
↓
llama_flutter_android.dart
(User-facing API)
↓
Pigeon Generated Code
(Type-safe bridge)
↓
LlamaFlutterAndroidPlugin.kt
(Kotlin coroutines)
↓
InferenceService.kt
(Foreground service)
↓
jni_wrapper.cpp
(JNI bridge)
↓
llama.cpp
(Native inference)
API Reference #
LlamaController #
The main interface for working with llama.cpp models.
Methods:
loadModel()- Load a GGUF model filegenerate()- Generate text with streaming tokensgenerateChat()- Generate chat responses with template formattingstop()- Stop generation mid-processdispose()- Clean up resources
Parameters:
- Basic:
maxTokens,seed - Sampling:
temperature,topP,topK,minP,typicalP - Penalties:
repeatPenalty,frequencyPenalty,presencePenalty,repeatLastN,penalizeNl - Mirostat:
mirostat,mirostatTau,mirostatEta - Advanced:
tfsZ,locallyTypical
Supported Chat Templates:
chatml- ChatML format (default)llama2- Llama-2 formatalpaca- Alpaca formatvicuna- Vicuna formatphi- Phi formatgemma- Gemma formatzephyr- Zephyr format
Release Build Notes #
If the release app crashes, try these solutions in android/app/build.gradle:
- Enable MaxHeap: Add
android:largeHeap="true"to the application manifest - Disable minification and shrinking: Set
minifyEnabled falseandshrinkResources falsein release build settings
Contributing #
Contributions are welcome! Please read CONTRIBUTING.md for details.
License #
MIT License - see LICENSE file for details.
Credits #
Support #
- Issue Tracker
- 💬 Discussions
- 📦 Example App - Complete working example


