flutter_mind_local 0.2.1 copy "flutter_mind_local: ^0.2.1" to clipboard
flutter_mind_local: ^0.2.1 copied to clipboard

PlatformAndroid

On-device LLM inference via llama.cpp. Drop LocalEngine into flutter_mind's FlutterMindClient — same API as every other engine.

flutter_mind_local #

flutter_mind_local logo

On-device LLM inference for Flutter — powered by llama.cpp.
No API key. No internet. No server. Runs entirely on the user's device.

A companion package to flutter_mind that implements LocalEngine, a drop-in AiEngine for running quantized .gguf models locally.

LocalEngine plugs directly into flutter_mind's FlutterMindClient — the same client you'd use with GeminiEngine or any other engine. No separate API to learn: swap engines, keep the same send/stream/countTokens calls.


Features #

  • Offline inference — model runs fully on-device after the first download
  • Any GGUF model — Qwen, Llama 3, Gemma, Phi, Mistral, DeepSeek, and more
  • Non-blocking — model loading and inference run on background isolates
  • Lifecycle events — know exactly when the model is loading, ready, or thinking
  • Auto chat template — detects the correct prompt format from .gguf metadata
  • Android + iOS + macOS — single package, no per-platform boilerplate

Installation #

dependencies:
  flutter_mind_local: ^0.1.0
import 'package:flutter_mind_local/flutter_mind_local.dart';

One import gives you LocalEngine, LocalConfig, LocalModelType, all lifecycle events, and the shared AiEngine interface from flutter_mind.


Getting a model #

Download any quantized GGUF model from HuggingFace. Good starting points for mobile:

Model Size HuggingFace
Qwen 2.5 0.5B Q4 ~400 MB Qwen/Qwen2.5-0.5B-Instruct-GGUF
Qwen 2.5 1.5B Q4 ~1 GB Qwen/Qwen2.5-1.5B-Instruct-GGUF
Llama 3.2 1B Q4 ~700 MB meta-llama/Llama-3.2-1B-Instruct-GGUF
Gemma 3 1B Q4 ~700 MB google/gemma-3-1b-it-GGUF

Save the .gguf file to the device's app documents directory, then pass its absolute path to LocalConfig.modelPath.


Quick start #

The same client used for every flutter_mind engine — input validation, beforeSend hooks, and the singleton pattern all come for free, and swapping to a cloud engine later is a one-line change.

import 'package:flutter_mind/flutter_mind.dart';
import 'package:flutter_mind_local/flutter_mind_local.dart';

final ai = FlutterMindClient(
  engine: LocalEngine(
    config: LocalConfig(
      modelPath: '/data/user/0/com.example.app/files/qwen.gguf',
    ),
  ),
);

// The model loads on the first send() call — no explicit init needed.
final response = await ai.send(userMessage: 'Hello! Who are you?');
print(response.text);

// Streaming, word by word
ai.stream(userMessage: 'Tell me a story').listen((chunk) => print(chunk));

Direct LocalEngine usage #

Skip FlutterMindClient if you don't need validation/hooks and want the engine's AiEngine API directly:

import 'package:flutter_mind_local/flutter_mind_local.dart';

final engine = LocalEngine(
  config: LocalConfig(
    modelPath: '/data/user/0/com.example.app/files/qwen.gguf',
  ),
);

final response = await engine.send(userMessage: 'Hello! Who are you?');
print(response.text);

// Always dispose when done to free RAM.
engine.dispose();

Configuration #

All parameters except modelPath are optional — sensible defaults are applied automatically.

LocalConfig(
  // Required
  modelPath: '/path/to/model.gguf',

  // Recommended
  systemPrompt: Prompt(role: 'helpful assistant'),
  modelType: LocalModelType.qwen,      // or .auto to detect from metadata
  stopSequences: ['<|im_end|>'],        // model-specific stop tokens

  // Generation quality
  temperature: 0.7,       // 0.0 = deterministic, 2.0 = very creative
  maxOutputTokens: 512,   // max tokens to generate per response
  topP: 0.9,              // nucleus sampling threshold
  topK: 40,               // top-K sampling pool

  // Memory & performance
  contextSize: 2048,      // KV-cache size in tokens — limits conversation memory
  threads: 4,             // CPU threads — 4 is a safe default for mobile

  // Reproducibility
  seed: -1,               // -1 = random, any other value = reproducible output
  repeatPenalty: 1.1,     // penalise repeated tokens — 1.0 = off

  // Events
  onEvent: (event) => print(event),
)

Model types #

Set modelType to get the correct chat template applied automatically. Use LocalModelType.auto (default) to detect from .gguf metadata.

Value Models
LocalModelType.auto Auto-detect from metadata (recommended)
LocalModelType.qwen Qwen 2, 2.5 family
LocalModelType.llama3 Llama 3, 3.2 family
LocalModelType.gemma Gemma 1, 2, 3 family
LocalModelType.phi Phi 2, 3, 4 family
LocalModelType.mistral Mistral family
LocalModelType.deepSeek DeepSeek family

Lifecycle events #

Pass onEvent to LocalConfig to track what the engine is doing:

LocalConfig(
  modelPath: '/path/to/model.gguf',
  onEvent: (event) => switch (event) {
    ModelLoadStarted()                        => setState(() => _loading = true),
    ModelReady(:final loadTime)               => setState(() => _loading = false),
    ModelFailed(:final error)                 => setState(() => _error = error),
    InferenceStarted(:final userMessage)      => setState(() => _thinking = true),
    InferenceCompleted(:final inferenceTime)  => setState(() => _thinking = false),
    InferenceFailed(:final error)             => setState(() => _error = error),
    ContextCleared()                          => debugPrint('memory reset'),
    ModelDisposed()                           => null,
  },
)
Event When it fires
ModelLoadStarted send() called for the first time — model begins loading
ModelReady Model fully loaded, includes loadTime
ModelFailed Model failed to load, includes error
InferenceStarted Generation begins, includes userMessage
InferenceCompleted Response ready, includes inferenceTime
InferenceFailed Generation failed, includes error
ContextCleared KV-cache was reset due to context overflow
ModelDisposed dispose() was called, model unloaded from RAM

Streaming #

stream() generates real token-by-token output — each chunk is sent to your listener as soon as the model produces it, not all at once after the full response finishes:

engine.stream(userMessage: 'Tell me a story').listen((chunk) {
  setState(() => displayText += chunk);
});

This runs on a dedicated long-lived background isolate (separate from send()'s one-shot isolate), so the UI thread is never blocked while tokens stream in. Only one send() or stream() call can run at a time per LocalEngine instance — a second call waits for the first to finish before starting.

Known limitation: if a stop sequence spans more than one token (e.g. a chat template's end-of-turn marker gets tokenized as two pieces), a partial fragment of it can appear in the stream before the full match is detected and generation stops. This doesn't affect send(), which only ever returns the fully-trimmed final string. In practice this is rare — most chat templates' stop strings tokenize as a single token.


Conversation history #

Pass previous messages to maintain context across turns:

final history = <ChatMessage>[];

final response = await engine.send(
  userMessage: 'What did I just say?',
  history: history,
  maxHistoryMessages: 20, // keep last N turns to avoid context overflow
);

history.add(ChatMessage.user('What did I just say?'));
history.add(ChatMessage.model(response.text));

Per-call config override #

Override any config field for a specific call without changing the engine default:

final creative = await engine.send(
  userMessage: 'Write me a poem.',
  config: LocalConfig(
    modelPath: engine.defaultConfig.modelPath,
    temperature: 1.4,
    maxOutputTokens: 200,
  ),
);

Thread safety #

LocalEngine is not thread-safe. If two send() calls are made concurrently, the second waits for the first to finish before starting. Use one engine instance per screen or isolate it behind a queue.


Platform notes #

Platform Build system Status
Android CMake + FetchContent ✅ Tested on a real device
iOS Swift Package Manager 🚧 Build files exist, untested — not yet enabled in pubspec.yaml
macOS Swift Package Manager 🚧 Build files exist, untested — not yet enabled in pubspec.yaml
Linux Not yet supported
Windows Not yet supported

llama.cpp is downloaded and compiled from source at build time — no pre-built binaries, no manual setup required.

0
likes
160
points
0
downloads

Documentation

API reference

Publisher

unverified uploader

Weekly Downloads

On-device LLM inference via llama.cpp. Drop LocalEngine into flutter_mind's FlutterMindClient — same API as every other engine.

Repository (GitHub)
View/report issues

Topics

#ai #llm #ffi #on-device-ai #llama

License

MIT (license)

Dependencies

ffi, flutter, flutter_mind, meta

More

Packages that depend on flutter_mind_local

Packages that implement flutter_mind_local