mt_llmkit 0.0.1-beta.1 copy "mt_llmkit: ^0.0.1-beta.1" to clipboard
mt_llmkit: ^0.0.1-beta.1 copied to clipboard

A Flutter plugin for running Large Language Models locally on Android and iOS via llama.cpp, with real-time streaming inference and performance metrics.

mt_llmkit #

Flutter Dart Platform Stability License

A Flutter plugin for running Large Language Models (LLMs) locally on Android and iOS using llamadart (which wraps llama.cpp). Also provides a unified interface for cloud AI chat providers (OpenAI, Gemini, Claude, Mistral) and a fully local RAG (Retrieval-Augmented Generation) pipeline.


Table of Contents #

  1. Installation
  2. Local LLM Inference (GGUF)
  3. Vision (Multimodal)
  4. Cloud API Providers
  5. Local RAG Pipeline

Installation #

Add to your pubspec.yaml:

dependencies:
  mt_llmkit: ^0.0.1-beta.1

Then run:

flutter pub get

Import the library:

import 'package:mt_llmkit/llmcpp.dart';

Local LLM Inference (GGUF) #

Run quantized GGUF models entirely on-device — no internet connection required.

Quick start #

final model = LocalModel(
  config: LlmConfig(temp: 0.7, nCtx: 2048, nGpuLayers: 4),
);

await model.loadModel('/path/to/model.gguf');

// Stream tokens as they are generated
model.sendPrompt('What is Flutter?').listen((token) {
  stdout.write(token);
});

model.dispose();

Backends #

LocalModel supports two backends controlled by the backend parameter:

Backend Class When to use
ModelBackend.isolate (default) LlmModelIsolated Production. Runs in a Dart Isolate — no UI jank. Required when loading multiple models (e.g. RAG).
ModelBackend.inProcess LlmModelStandard Lighter startup cost. Supports clean() to reset context without reloading the model.
// Isolate backend (default)
final model = LocalModel(backend: ModelBackend.isolate);

// In-process backend
final model = LocalModel(backend: ModelBackend.inProcess);
model.clean(); // reset context — only available with inProcess

Note: clean() throws UnsupportedError when called on the isolate backend.

Configuration #

All parameters are optional; sensible defaults are applied automatically.

final config = LlmConfig(
  nGpuLayers: 4,    // GPU layers offloaded (default: 64)
  nCtx: 2048,       // context window in tokens (default: 8192)
  nBatch: 512,      // batch size (default: 4096)
  nPredict: 1024,   // max tokens to generate (default: 8192)
  nThreads: 4,      // CPU threads (default: 6)
  temp: 0.7,        // temperature (default: 0.72)
  topK: 40,         // top-K sampling (default: 64)
  topP: 0.9,        // top-P sampling (default: 0.95)
  penaltyRepeat: 1.1, // repetition penalty (default: 1.1)
);

Generation methods #

Three methods are available on LocalModel (and any LlmInterface implementation):

Method Return type Description
sendPrompt(prompt) Stream<String> Raw token stream. Lowest overhead.
sendPromptComplete(prompt) Future<String> Waits for the full response and returns it as a single string.
sendPromptStream(prompt) Stream<StreamingChunk> Recommended. Token stream with live performance metrics.
// 1. Raw token stream
model.sendPrompt('Hello').listen(stdout.write);

// 2. Full response at once
final response = await model.sendPromptComplete('Hello');
print(response);

// 3. Streaming with live metrics (recommended)
model.sendPromptStream('Hello').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    final m = chunk.metrics!;
    print('\n--- ${m.tokensGenerated} tokens, ${m.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

StreamingChunk fields:

Field Type Description
text String The generated text fragment.
isFinal bool true on the last chunk of the response.
metrics PerformanceMetrics? Available on every chunk; most useful on the final one.

PerformanceMetrics fields: tokensGenerated, durationMs, tokensPerSecond, msPerToken.

Prompt format #

Override the model's built-in chat template by passing a raw GGUF/Jinja template string via LlmConfig.chatTemplate. When chatTemplate is null (the default), the template embedded in the model file is used automatically.

final config = LlmConfig(
  chatTemplate: '<|user|>\n{prompt}<|end|>\n<|assistant|>\n', // custom override
);

Performance metrics #

PerformanceMetrics is updated incrementally with every StreamingChunk:

model.sendPromptStream('Explain Dart isolates in detail.').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.metrics != null) {
    final m = chunk.metrics!;
    // Update UI progress indicator
    print('${m.tokensGenerated} tokens | ${m.tokensPerSecond.toStringAsFixed(2)} t/s');
  }
});

Vision (Multimodal) #

LocalModel supports multimodal vision models (LLaVA, Gemma 3, Qwen VL, SmolVLM, etc.) that can analyse images alongside a text prompt. Vision requires two GGUF files: the main language model and a multimodal projector (mmproj-*.gguf).

Quick start (vision) {#quick-start-vision} #

final model = LocalModel(
  config: LlmConfig(
    mmprojPath: '/path/to/mmproj-model-f16-4B.gguf',
    nGpuLayers: 4,
    nCtx: 4096,
    nPredict: 512,
    temp: 0.3,
  ),
);

await model.loadModel('/path/to/gemma-3-4b-it-q4_0.gguf');

final image = LlamaImageContent(path: '/path/to/photo.jpg');

model.sendPromptStream(
  'Describe what you see in this image. <image>',
  images: [image],
).listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n--- ${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

Important: The prompt must contain one <image> placeholder per image passed in the list.

Supported models #

Any vision GGUF model that uses the libmtmd multimodal projection layer is supported. Tested models:

Model Notes
Gemma 3 (4B, 12B, 27B) Recommended. Good accuracy, available in Q4 quantisation.
Qwen 2.5 VL Strong OCR and document understanding.
LLaVA 1.5 / 1.6 Classic CLIP-based architecture.
SmolVLM Compact, fast, good for mobile devices.

Each model has a corresponding mmproj-*.gguf file available on Hugging Face alongside the main model.

Image input #

LlamaImageContent is created by providing the image file path:

final image = LlamaImageContent(path: '/path/to/photo.jpg');

Supported formats: JPEG, PNG, and any format supported by the underlying libmtmd library.

Generation methods (vision) {#generation-methods-1} #

All three standard generation methods accept an optional images parameter:

Method Return type Description
sendPrompt(prompt, images: images) Stream<String> Raw token stream.
sendPromptComplete(prompt, images: images) Future<String> Full response as a single string.
sendPromptStream(prompt, images: images) Stream<StreamingChunk> Recommended. Streaming with live performance metrics.

All three methods throw UnsupportedError if LlmConfig.mmprojPath was not set.

// Full response at once
final response = await model.sendPromptComplete(
  'What objects are visible in this photo? <image>',
  images: [LlamaImageContent(path: '/path/to/photo.jpg')],
);
print(response);

Cloud API Providers #

mt_llmkit includes a unified AIChatProvider interface for four cloud LLM providers. All providers share the same API surface, making it easy to swap backends.

Supported providers #

Provider Enum value Default model
OpenAI AIChatProviderType.openai gpt-4o-mini
Google Gemini AIChatProviderType.gemini gemini-1.5-flash
Anthropic Claude AIChatProviderType.claude claude-haiku-4-5-20251001
Mistral AI AIChatProviderType.mistral mistral-small-latest

Basic usage #

Use AIChatProviderFactory to create a provider without importing the concrete class:

// Create and initialize in one step
final provider = await AIChatProviderFactory.createAndInitialize(
  AIChatProviderType.openai,
  {'apiKey': 'sk-...'},
);

final response = await provider.sendMessage('What is Flutter?');
print(response.message.content);
print('Tokens used: ${response.inputTokens} in / ${response.outputTokens} out');

await provider.dispose();

Or manage the lifecycle manually:

final provider = AIChatProviderFactory.create(AIChatProviderType.gemini);
await provider.initialize({'apiKey': 'AIza...'});

final response = await provider.sendMessage('Hello!');
print(response.message.content);

await provider.dispose();

Multi-turn conversations #

Build conversation history with ChatMessage:

final history = <ChatMessage>[
  ChatMessage.system('You are a concise assistant. Reply in three sentences max.'),
  ChatMessage.user('What is a Dart isolate?'),
  ChatMessage.assistant('A Dart isolate is an independent thread of execution...'),
];

// Continue the conversation
final r = await provider.sendMessage('Give me a code example.', history: history);
print(r.message.content);

// Append the reply to keep history growing
history.add(r.message);

For full control, pass a complete message list directly:

final response = await provider.sendChatMessages([
  ChatMessage.system('You are a poet.'),
  ChatMessage.user('Write a haiku about Flutter.'),
]);
print(response.message.content);

Streaming #

All providers support token streaming via sendMessageStream:

await for (final token in provider.sendMessageStream('Tell me a story.')) {
  stdout.write(token);
}

With conversation history:

final history = [ChatMessage.system('Reply only in Spanish.')];

await for (final token in provider.sendMessageStream('Hello!', history: history)) {
  stdout.write(token);
}

Provider-specific config #

OpenAI

await provider.initialize({
  'apiKey': 'sk-...',
  'model': 'gpt-4o',               // optional, default: gpt-4o-mini
  'baseUrl': 'https://...',        // optional, for Azure OpenAI or proxies
});

Google Gemini

await provider.initialize({
  'apiKey': 'AIza...',
  'model': 'gemini-1.5-pro',       // optional
});

Anthropic Claude

await provider.initialize({
  'apiKey': 'sk-ant-...',
  'model': 'claude-opus-4-6',      // optional
});

Mistral AI

await provider.initialize({
  'apiKey': '...',
  'model': 'mistral-large-latest', // optional
});

Error handling #

All providers throw typed exceptions from chat_exceptions.dart:

Exception Cause
APIKeyException Invalid or missing API key (HTTP 401/403)
NetworkException Transport error (timeout, DNS, connection reset)
RateLimitException Quota exceeded (HTTP 429); contains retryAfter
AIChatException Base class for any other API error

Network and rate-limit errors are automatically retried up to 3 times with exponential back-off.

try {
  final response = await provider.sendMessage('Hello');
  print(response.message.content);
} on APIKeyException catch (e) {
  print('Check your API key: $e');
} on RateLimitException catch (e) {
  print('Rate limited. Retry after ${e.retryAfter?.inSeconds}s');
} on AIChatException catch (e) {
  print('API error: $e');
}

Local RAG Pipeline #

RagEngine provides a fully on-device Retrieval-Augmented Generation pipeline. Documents are chunked, embedded with a local embedding model, stored in an in-memory vector store, and retrieved at query time to ground the generation model's response.

How it works #

Ingestion:  Document → TextChunker → chunks → EmbeddingModel → VectorStore
Query:      question → EmbeddingModel → VectorStore.search() → prompt + context → GenerationModel

Both the embedding model and the generation model run inside a single worker isolate, avoiding threading issues that arise when multiple native model instances share global state — this is handled transparently by llamadart.

Quick start #

You need two GGUF models:

  • A generation model (e.g. Llama 3, Mistral) for producing answers.
  • An embedding model (e.g. nomic-embed-text) for vectorising text.
final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  genConfig: LlmConfig(temp: 0.3, nCtx: 4096, nGpuLayers: 4),
);

await rag.initialize();

Document ingestion #

Create a Document from text or PDF content and ingest it into the vector store:

// From plain text
final doc = Document.fromText(
  'Flutter is Google\'s UI toolkit for building natively compiled applications...',
  source: 'flutter_intro.txt',
);

// From extracted PDF text (use a PDF parser to extract the string first)
final pdfDoc = Document.fromPdf(
  extractedText,
  source: '/path/to/manual.pdf',
  pageCount: 42,
);

// Ingest — stream progress for UI updates
await for (final progress in rag.ingestDocument(doc)) {
  print('${progress.embeddedChunks}/${progress.totalChunks} — ${progress.currentPreview}');
  // progress.fraction gives 0.0–1.0 for a progress bar
}

print('Indexed: ${rag.indexedSize} chunks across ${rag.documentIds.length} documents');

Manage the index:

// Remove a single document (all its chunks)
await rag.removeDocument(doc.id);

// Clear everything
await rag.clearIndex();

Querying #

// Stream the generated answer
await for (final chunk in rag.query('What is Flutter?')) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s');
  }
}

Optional query parameters:

rag.query(
  'Explain the rendering pipeline.',
  topK: 3,              // number of context chunks (default: 5)
  minSimilarity: 0.35,  // minimum cosine similarity 0.0–1.0 (default: 0.25)
)

Retrieve relevant chunks without generating an answer (useful for inspection):

final results = await rag.findRelevant('rendering pipeline', topK: 3);
for (final r in results) {
  print('${(r.similarity * 100).toStringAsFixed(0)}% — ${r.chunk.text.substring(0, 80)}');
}

Index persistence #

Pass indexPath to automatically save and restore the vector index between sessions:

final dir = await getApplicationDocumentsDirectory();

final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  indexPath: '${dir.path}/rag_index.json',  // auto-save on every change
);

await rag.initialize(); // loads existing index if the file exists

When indexPath is null, the store is in-memory only and data is lost when the app restarts.

Advanced: custom prompt template #

The default template instructs the model to answer only from the provided context. Override it if you need different behaviour:

final rag = RagEngine(
  genModelPath: '...',
  embedModelPath: '...',
  promptTemplate:
    'You are a helpful assistant. Use the context below to answer.\n\n'
    'CONTEXT:\n{context}\n\nQUESTION: {question}\n\nANSWER:',
);

The template must contain {context} and {question} placeholders.

Cleanup #

rag.dispose(); // releases the worker isolate and both models; safe to call twice
0
likes
150
points
12
downloads

Documentation

API reference

Publisher

verified publishermobitouch.net

Weekly Downloads

A Flutter plugin for running Large Language Models locally on Android and iOS via llama.cpp, with real-time streaming inference and performance metrics.

Repository (GitHub)
View/report issues

Topics

#llm #llama-cpp #ai #on-device-ai #flutter-plugin

License

MIT (license)

Dependencies

flutter, flutter_web_plugins, http, llamadart, plugin_platform_interface, web

More

Packages that depend on mt_llmkit

Packages that implement mt_llmkit