llama_flutter_android #

Run GGUF models on Android with llama.cpp - A simple, MIT-licensed Flutter plugin.

Features #

Android Only - Optimized specifically for Android
Simple API - Easy-to-use Dart interface with Pigeon type safety
Token Streaming - Real-time token generation with EventChannel
Stop Generation - Cancel text generation mid-process on Android devices
18 Parameters - Complete control: temperature, penalties, mirostat, seed, and more
7 Chat Templates - ChatML, Llama-2, Alpaca, Vicuna, Phi, Gemma, Zephyr
Auto-Detection - Chat templates detected from model filename
Latest llama.cpp - Built on October 2025 llama.cpp (no patches needed)
ARM64 Optimized - NEON and dot product optimizations enabled

Local Chat App #

The plugin includes a complete example chat application that demonstrates how to integrate the plugin into a real application. The example app showcases:

Load Model from Local Storage - Select and load GGUF models directly from your device storage
Context Size Real Time Update - Monitor and adjust context usage in real-time with visual indicators
Advanced Parameters - Full control over model parameters including Temperature Control, Top-P, Top-K, and more
Auto Unload Model - Automatic model unloading when inactivity is detected to preserve device resources

An APK of the example app is available in the example-app Github for immediate testing.

Screenshots #

Requirements #

Flutter 3.24.0+
Dart SDK 3.3.0+
Android API 26+ (Android 8.0)
NDK r27+ (for 16KB page size support)

Installation #

Add to your pubspec.yaml:

dependencies:
  llama_flutter_android: latest

Quick Start #

Basic Usage #

import 'package:llama_flutter_android/llama_flutter_android.dart';

// Initialize controller
final controller = LlamaController();

// Load model
await controller.loadModel(
  modelPath: '/path/to/model.gguf',
  nThreads: 4,
  contextSize: 2048,
);

// Generate text with streaming
StreamSubscription? subscription;
subscription = controller.generate(
  prompt: 'Write a story about a robot',
  maxTokens: 512,
  temperature: 0.7,
).listen(
  (token) => print(token),  // Print each token as it arrives
  onDone: () => print('Generation complete!'),
  onError: (error) => print('Error: $error'),
);

// Stop generation mid-process (critical for UX!)
await controller.stop();
subscription?.cancel();

// Clean up
await controller.dispose();

Chat Mode with Templates #

// Chat with automatic template formatting
controller.generateChat(
  messages: [
    ChatMessage(role: 'system', content: 'You are a helpful assistant'),
    ChatMessage(role: 'user', content: 'Explain quantum computing'),
  ],
  template: 'chatml', // Auto-detected if null
  temperature: 0.7,
  maxTokens: 1000,
).listen((token) => print(token));

Advanced Parameters #

// Fine-grained control over generation
controller.generate(
  prompt: 'Explain machine learning',
  maxTokens: 1000,
  // Sampling
  temperature: 0.8,      // Creativity (0.0-2.0)
  topP: 0.9,             // Nucleus sampling
  topK: 40,              // Top-K sampling
  minP: 0.05,            // Minimum probability
  // Penalties (reduce repetition)
  repeatPenalty: 1.2,    // Penalize repeated tokens
  frequencyPenalty: 0.5, // Penalize frequent tokens
  presencePenalty: 0.3,  // Penalize token presence
  repeatLastN: 64,       // Penalty window size
  // Reproducibility
  seed: 42,              // Fixed seed for same output
  // Mirostat (perplexity control)
  mirostat: 2,           // 0=off, 1=v1, 2=v2
  mirostatTau: 5.0,      // Target perplexity
  mirostatEta: 0.1,      // Learning rate
).listen((token) => print(token));

// Stop anytime!
await controller.stop();

Architecture #

         Flutter App (Dart)
                ↓
    llama_flutter_android.dart
    (User-facing API)
                ↓
    Pigeon Generated Code
    (Type-safe bridge)
                ↓
    LlamaFlutterAndroidPlugin.kt
    (Kotlin coroutines)
                ↓
    InferenceService.kt
    (Foreground service)
                ↓
    jni_wrapper.cpp
    (JNI bridge)
                ↓
    llama.cpp
    (Native inference)

API Reference #

LlamaController #

The main interface for working with llama.cpp models.

Methods:

loadModel() - Load a GGUF model file
generate() - Generate text with streaming tokens
generateChat() - Generate chat responses with template formatting
stop() - Stop generation mid-process
dispose() - Clean up resources

Parameters:

Basic: maxTokens, seed
Sampling: temperature, topP, topK, minP, typicalP
Penalties: repeatPenalty, frequencyPenalty, presencePenalty, repeatLastN, penalizeNl
Mirostat: mirostat, mirostatTau, mirostatEta
Advanced: tfsZ, locallyTypical

Supported Chat Templates:

chatml - ChatML format (default)
llama2 - Llama-2 format
alpaca - Alpaca format
vicuna - Vicuna format
phi - Phi format
gemma - Gemma format
zephyr - Zephyr format

Release Build Notes #

If the release app crashes, try these solutions in android/app/build.gradle:

Enable MaxHeap: Add android:largeHeap="true" to the application manifest
Disable minification and shrinking: Set minifyEnabled false and shrinkResources false in release build settings

Contributing #

Contributions are welcome! Please read CONTRIBUTING.md for details.

License #

MIT License - see LICENSE file for details.

Credits #

llama.cpp - The amazing inference engine
Pigeon - Type-safe platform communication

Support #

Issue Tracker
💬 Discussions
📦 Example App - Complete working example

llama_flutter_android 0.1.1
llama_flutter_android: ^0.1.1 copied to clipboard

Metadata

llama_flutter_android #

Features #

Local Chat App #

Screenshots #

Requirements #

Installation #

Quick Start #

Basic Usage #

Chat Mode with Templates #

Advanced Parameters #

Architecture #

API Reference #

LlamaController #

Release Build Notes #

Contributing #

License #

Credits #

Support #

← Metadata

Publisher

Weekly Downloads

Metadata

Documentation

License

Dependencies

More

llama_flutter_android 0.1.1 llama_flutter_android: ^0.1.1 copied to clipboard

Metadata

llama_flutter_android #

Features #

Local Chat App #

Screenshots #

Requirements #

Installation #

Quick Start #

Basic Usage #

Chat Mode with Templates #

Advanced Parameters #

Architecture #

API Reference #

LlamaController #

Release Build Notes #

Contributing #

License #

Credits #

Support #

← Metadata

Publisher

Weekly Downloads

Metadata

Documentation

License

Dependencies

More

llama_flutter_android 0.1.1
llama_flutter_android: ^0.1.1 copied to clipboard