flutter_mind_local 0.2.1
flutter_mind_local: ^0.2.1 copied to clipboard
On-device LLM inference via llama.cpp. Drop LocalEngine into flutter_mind's FlutterMindClient — same API as every other engine.
flutter_mind_local #
On-device LLM inference for Flutter — powered by llama.cpp.
No API key. No internet. No server. Runs entirely on the user's device.
A companion package to flutter_mind that implements
LocalEngine, a drop-in AiEngine for running quantized .gguf models locally.
LocalEngine plugs directly into flutter_mind's FlutterMindClient — the same client
you'd use with GeminiEngine or any other engine. No separate API to learn: swap engines,
keep the same send/stream/countTokens calls.
Features #
- Offline inference — model runs fully on-device after the first download
- Any GGUF model — Qwen, Llama 3, Gemma, Phi, Mistral, DeepSeek, and more
- Non-blocking — model loading and inference run on background isolates
- Lifecycle events — know exactly when the model is loading, ready, or thinking
- Auto chat template — detects the correct prompt format from
.ggufmetadata - Android + iOS + macOS — single package, no per-platform boilerplate
Installation #
dependencies:
flutter_mind_local: ^0.1.0
import 'package:flutter_mind_local/flutter_mind_local.dart';
One import gives you LocalEngine, LocalConfig, LocalModelType, all lifecycle
events, and the shared AiEngine interface from flutter_mind.
Getting a model #
Download any quantized GGUF model from HuggingFace. Good starting points for mobile:
| Model | Size | HuggingFace |
|---|---|---|
| Qwen 2.5 0.5B Q4 | ~400 MB | Qwen/Qwen2.5-0.5B-Instruct-GGUF |
| Qwen 2.5 1.5B Q4 | ~1 GB | Qwen/Qwen2.5-1.5B-Instruct-GGUF |
| Llama 3.2 1B Q4 | ~700 MB | meta-llama/Llama-3.2-1B-Instruct-GGUF |
| Gemma 3 1B Q4 | ~700 MB | google/gemma-3-1b-it-GGUF |
Save the .gguf file to the device's app documents directory, then pass its
absolute path to LocalConfig.modelPath.
Quick start #
With FlutterMindClient (recommended) #
The same client used for every flutter_mind engine — input validation, beforeSend
hooks, and the singleton pattern all come for free, and swapping to a cloud engine
later is a one-line change.
import 'package:flutter_mind/flutter_mind.dart';
import 'package:flutter_mind_local/flutter_mind_local.dart';
final ai = FlutterMindClient(
engine: LocalEngine(
config: LocalConfig(
modelPath: '/data/user/0/com.example.app/files/qwen.gguf',
),
),
);
// The model loads on the first send() call — no explicit init needed.
final response = await ai.send(userMessage: 'Hello! Who are you?');
print(response.text);
// Streaming, word by word
ai.stream(userMessage: 'Tell me a story').listen((chunk) => print(chunk));
Direct LocalEngine usage #
Skip FlutterMindClient if you don't need validation/hooks and want the engine's
AiEngine API directly:
import 'package:flutter_mind_local/flutter_mind_local.dart';
final engine = LocalEngine(
config: LocalConfig(
modelPath: '/data/user/0/com.example.app/files/qwen.gguf',
),
);
final response = await engine.send(userMessage: 'Hello! Who are you?');
print(response.text);
// Always dispose when done to free RAM.
engine.dispose();
Configuration #
All parameters except modelPath are optional — sensible defaults are applied automatically.
LocalConfig(
// Required
modelPath: '/path/to/model.gguf',
// Recommended
systemPrompt: Prompt(role: 'helpful assistant'),
modelType: LocalModelType.qwen, // or .auto to detect from metadata
stopSequences: ['<|im_end|>'], // model-specific stop tokens
// Generation quality
temperature: 0.7, // 0.0 = deterministic, 2.0 = very creative
maxOutputTokens: 512, // max tokens to generate per response
topP: 0.9, // nucleus sampling threshold
topK: 40, // top-K sampling pool
// Memory & performance
contextSize: 2048, // KV-cache size in tokens — limits conversation memory
threads: 4, // CPU threads — 4 is a safe default for mobile
// Reproducibility
seed: -1, // -1 = random, any other value = reproducible output
repeatPenalty: 1.1, // penalise repeated tokens — 1.0 = off
// Events
onEvent: (event) => print(event),
)
Model types #
Set modelType to get the correct chat template applied automatically.
Use LocalModelType.auto (default) to detect from .gguf metadata.
| Value | Models |
|---|---|
LocalModelType.auto |
Auto-detect from metadata (recommended) |
LocalModelType.qwen |
Qwen 2, 2.5 family |
LocalModelType.llama3 |
Llama 3, 3.2 family |
LocalModelType.gemma |
Gemma 1, 2, 3 family |
LocalModelType.phi |
Phi 2, 3, 4 family |
LocalModelType.mistral |
Mistral family |
LocalModelType.deepSeek |
DeepSeek family |
Lifecycle events #
Pass onEvent to LocalConfig to track what the engine is doing:
LocalConfig(
modelPath: '/path/to/model.gguf',
onEvent: (event) => switch (event) {
ModelLoadStarted() => setState(() => _loading = true),
ModelReady(:final loadTime) => setState(() => _loading = false),
ModelFailed(:final error) => setState(() => _error = error),
InferenceStarted(:final userMessage) => setState(() => _thinking = true),
InferenceCompleted(:final inferenceTime) => setState(() => _thinking = false),
InferenceFailed(:final error) => setState(() => _error = error),
ContextCleared() => debugPrint('memory reset'),
ModelDisposed() => null,
},
)
| Event | When it fires |
|---|---|
ModelLoadStarted |
send() called for the first time — model begins loading |
ModelReady |
Model fully loaded, includes loadTime |
ModelFailed |
Model failed to load, includes error |
InferenceStarted |
Generation begins, includes userMessage |
InferenceCompleted |
Response ready, includes inferenceTime |
InferenceFailed |
Generation failed, includes error |
ContextCleared |
KV-cache was reset due to context overflow |
ModelDisposed |
dispose() was called, model unloaded from RAM |
Streaming #
stream() generates real token-by-token output — each chunk is sent to your listener
as soon as the model produces it, not all at once after the full response finishes:
engine.stream(userMessage: 'Tell me a story').listen((chunk) {
setState(() => displayText += chunk);
});
This runs on a dedicated long-lived background isolate (separate from send()'s
one-shot isolate), so the UI thread is never blocked while tokens stream in. Only
one send() or stream() call can run at a time per LocalEngine instance — a
second call waits for the first to finish before starting.
Known limitation: if a stop sequence spans more than one token (e.g. a chat
template's end-of-turn marker gets tokenized as two pieces), a partial fragment of
it can appear in the stream before the full match is detected and generation stops.
This doesn't affect send(), which only ever returns the fully-trimmed final string.
In practice this is rare — most chat templates' stop strings tokenize as a single token.
Conversation history #
Pass previous messages to maintain context across turns:
final history = <ChatMessage>[];
final response = await engine.send(
userMessage: 'What did I just say?',
history: history,
maxHistoryMessages: 20, // keep last N turns to avoid context overflow
);
history.add(ChatMessage.user('What did I just say?'));
history.add(ChatMessage.model(response.text));
Per-call config override #
Override any config field for a specific call without changing the engine default:
final creative = await engine.send(
userMessage: 'Write me a poem.',
config: LocalConfig(
modelPath: engine.defaultConfig.modelPath,
temperature: 1.4,
maxOutputTokens: 200,
),
);
Thread safety #
LocalEngine is not thread-safe. If two send() calls are made concurrently,
the second waits for the first to finish before starting. Use one engine instance
per screen or isolate it behind a queue.
Platform notes #
| Platform | Build system | Status |
|---|---|---|
| Android | CMake + FetchContent | ✅ Tested on a real device |
| iOS | Swift Package Manager | 🚧 Build files exist, untested — not yet enabled in pubspec.yaml |
| macOS | Swift Package Manager | 🚧 Build files exist, untested — not yet enabled in pubspec.yaml |
| Linux | Not yet supported | — |
| Windows | Not yet supported | — |
llama.cpp is downloaded and compiled from source at build time — no pre-built binaries, no manual setup required.