mt_llmkit

Flutter Dart Platform Stability License

A Flutter plugin for running Large Language Models (LLMs) locally on Android and iOS using llamadart (which wraps llama.cpp). Also provides a unified interface for cloud AI chat providers (OpenAI, Gemini, Claude, Mistral) and a fully local RAG (Retrieval-Augmented Generation) pipeline.


Table of Contents

  1. Installation
  2. Local LLM Inference (GGUF)
  3. Vision (Multimodal)
  4. Cloud API Providers
  5. Local RAG Pipeline

Installation

Add to your pubspec.yaml:

dependencies:
  mt_llmkit: ^0.0.1-beta.1

Then run:

flutter pub get

Import the library:

import 'package:mt_llmkit/llmcpp.dart';

Local LLM Inference (GGUF)

Run quantized GGUF models entirely on-device — no internet connection required.

Quick start

final model = LocalModel(
  config: LlmConfig(temp: 0.7, nCtx: 2048, nGpuLayers: 4),
);

await model.loadModel('/path/to/model.gguf');

// Stream tokens as they are generated
model.sendPrompt('What is Flutter?').listen((token) {
  stdout.write(token);
});

model.dispose();

Backends

LocalModel supports two backends controlled by the backend parameter:

Backend Class When to use
ModelBackend.isolate (default) LlmModelIsolated Production. Runs in a Dart Isolate — no UI jank. Required when loading multiple models (e.g. RAG).
ModelBackend.inProcess LlmModelStandard Lighter startup cost. Supports clean() to reset context without reloading the model.
// Isolate backend (default)
final model = LocalModel(backend: ModelBackend.isolate);

// In-process backend
final model = LocalModel(backend: ModelBackend.inProcess);
model.clean(); // reset context — only available with inProcess

Note: clean() throws UnsupportedError when called on the isolate backend.

Configuration

All parameters are optional; sensible defaults are applied automatically.

final config = LlmConfig(
  nGpuLayers: 4,    // GPU layers offloaded (default: 64)
  nCtx: 2048,       // context window in tokens (default: 8192)
  nBatch: 512,      // batch size (default: 4096)
  nPredict: 1024,   // max tokens to generate (default: 8192)
  nThreads: 4,      // CPU threads (default: 6)
  temp: 0.7,        // temperature (default: 0.72)
  topK: 40,         // top-K sampling (default: 64)
  topP: 0.9,        // top-P sampling (default: 0.95)
  penaltyRepeat: 1.1, // repetition penalty (default: 1.1)
);

Generation methods

Three methods are available on LocalModel (and any LlmInterface implementation):

Method Return type Description
sendPrompt(prompt) Stream<String> Raw token stream. Lowest overhead.
sendPromptComplete(prompt) Future<String> Waits for the full response and returns it as a single string.
sendPromptStream(prompt) Stream<StreamingChunk> Recommended. Token stream with live performance metrics.
// 1. Raw token stream
model.sendPrompt('Hello').listen(stdout.write);

// 2. Full response at once
final response = await model.sendPromptComplete('Hello');
print(response);

// 3. Streaming with live metrics (recommended)
model.sendPromptStream('Hello').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    final m = chunk.metrics!;
    print('\n--- ${m.tokensGenerated} tokens, ${m.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

StreamingChunk fields:

Field Type Description
text String The generated text fragment.
isFinal bool true on the last chunk of the response.
metrics PerformanceMetrics? Available on every chunk; most useful on the final one.

PerformanceMetrics fields: tokensGenerated, durationMs, tokensPerSecond, msPerToken.

Prompt format

Override the model's built-in chat template by passing a raw GGUF/Jinja template string via LlmConfig.chatTemplate. When chatTemplate is null (the default), the template embedded in the model file is used automatically.

final config = LlmConfig(
  chatTemplate: '<|user|>\n{prompt}<|end|>\n<|assistant|>\n', // custom override
);

Performance metrics

PerformanceMetrics is updated incrementally with every StreamingChunk:

model.sendPromptStream('Explain Dart isolates in detail.').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.metrics != null) {
    final m = chunk.metrics!;
    // Update UI progress indicator
    print('${m.tokensGenerated} tokens | ${m.tokensPerSecond.toStringAsFixed(2)} t/s');
  }
});

Vision (Multimodal)

LocalModel supports multimodal vision models (LLaVA, Gemma 3, Qwen VL, SmolVLM, etc.) that can analyse images alongside a text prompt. Vision requires two GGUF files: the main language model and a multimodal projector (mmproj-*.gguf).

Quick start (vision) {#quick-start-vision}

final model = LocalModel(
  config: LlmConfig(
    mmprojPath: '/path/to/mmproj-model-f16-4B.gguf',
    nGpuLayers: 4,
    nCtx: 4096,
    nPredict: 512,
    temp: 0.3,
  ),
);

await model.loadModel('/path/to/gemma-3-4b-it-q4_0.gguf');

final image = LlamaImageContent(path: '/path/to/photo.jpg');

model.sendPromptStream(
  'Describe what you see in this image. <image>',
  images: [image],
).listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n--- ${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

Important: The prompt must contain one <image> placeholder per image passed in the list.

Supported models

Any vision GGUF model that uses the libmtmd multimodal projection layer is supported. Tested models:

Model Notes
Gemma 3 (4B, 12B, 27B) Recommended. Good accuracy, available in Q4 quantisation.
Qwen 2.5 VL Strong OCR and document understanding.
LLaVA 1.5 / 1.6 Classic CLIP-based architecture.
SmolVLM Compact, fast, good for mobile devices.

Each model has a corresponding mmproj-*.gguf file available on Hugging Face alongside the main model.

Image input

LlamaImageContent is created by providing the image file path:

final image = LlamaImageContent(path: '/path/to/photo.jpg');

Supported formats: JPEG, PNG, and any format supported by the underlying libmtmd library.

Generation methods (vision) {#generation-methods-1}

All three standard generation methods accept an optional images parameter:

Method Return type Description
sendPrompt(prompt, images: images) Stream<String> Raw token stream.
sendPromptComplete(prompt, images: images) Future<String> Full response as a single string.
sendPromptStream(prompt, images: images) Stream<StreamingChunk> Recommended. Streaming with live performance metrics.

All three methods throw UnsupportedError if LlmConfig.mmprojPath was not set.

// Full response at once
final response = await model.sendPromptComplete(
  'What objects are visible in this photo? <image>',
  images: [LlamaImageContent(path: '/path/to/photo.jpg')],
);
print(response);

Cloud API Providers

mt_llmkit includes a unified AIChatProvider interface for four cloud LLM providers. All providers share the same API surface, making it easy to swap backends.

Supported providers

Provider Enum value Default model
OpenAI AIChatProviderType.openai gpt-4o-mini
Google Gemini AIChatProviderType.gemini gemini-1.5-flash
Anthropic Claude AIChatProviderType.claude claude-haiku-4-5-20251001
Mistral AI AIChatProviderType.mistral mistral-small-latest

Basic usage

Use AIChatProviderFactory to create a provider without importing the concrete class:

// Create and initialize in one step
final provider = await AIChatProviderFactory.createAndInitialize(
  AIChatProviderType.openai,
  {'apiKey': 'sk-...'},
);

final response = await provider.sendMessage('What is Flutter?');
print(response.message.content);
print('Tokens used: ${response.inputTokens} in / ${response.outputTokens} out');

await provider.dispose();

Or manage the lifecycle manually:

final provider = AIChatProviderFactory.create(AIChatProviderType.gemini);
await provider.initialize({'apiKey': 'AIza...'});

final response = await provider.sendMessage('Hello!');
print(response.message.content);

await provider.dispose();

Multi-turn conversations

Build conversation history with ChatMessage:

final history = <ChatMessage>[
  ChatMessage.system('You are a concise assistant. Reply in three sentences max.'),
  ChatMessage.user('What is a Dart isolate?'),
  ChatMessage.assistant('A Dart isolate is an independent thread of execution...'),
];

// Continue the conversation
final r = await provider.sendMessage('Give me a code example.', history: history);
print(r.message.content);

// Append the reply to keep history growing
history.add(r.message);

For full control, pass a complete message list directly:

final response = await provider.sendChatMessages([
  ChatMessage.system('You are a poet.'),
  ChatMessage.user('Write a haiku about Flutter.'),
]);
print(response.message.content);

Streaming

All providers support token streaming via sendMessageStream:

await for (final token in provider.sendMessageStream('Tell me a story.')) {
  stdout.write(token);
}

With conversation history:

final history = [ChatMessage.system('Reply only in Spanish.')];

await for (final token in provider.sendMessageStream('Hello!', history: history)) {
  stdout.write(token);
}

Provider-specific config

OpenAI

await provider.initialize({
  'apiKey': 'sk-...',
  'model': 'gpt-4o',               // optional, default: gpt-4o-mini
  'baseUrl': 'https://...',        // optional, for Azure OpenAI or proxies
});

Google Gemini

await provider.initialize({
  'apiKey': 'AIza...',
  'model': 'gemini-1.5-pro',       // optional
});

Anthropic Claude

await provider.initialize({
  'apiKey': 'sk-ant-...',
  'model': 'claude-opus-4-6',      // optional
});

Mistral AI

await provider.initialize({
  'apiKey': '...',
  'model': 'mistral-large-latest', // optional
});

Error handling

All providers throw typed exceptions from chat_exceptions.dart:

Exception Cause
APIKeyException Invalid or missing API key (HTTP 401/403)
NetworkException Transport error (timeout, DNS, connection reset)
RateLimitException Quota exceeded (HTTP 429); contains retryAfter
AIChatException Base class for any other API error

Network and rate-limit errors are automatically retried up to 3 times with exponential back-off.

try {
  final response = await provider.sendMessage('Hello');
  print(response.message.content);
} on APIKeyException catch (e) {
  print('Check your API key: $e');
} on RateLimitException catch (e) {
  print('Rate limited. Retry after ${e.retryAfter?.inSeconds}s');
} on AIChatException catch (e) {
  print('API error: $e');
}

Local RAG Pipeline

RagEngine provides a fully on-device Retrieval-Augmented Generation pipeline. Documents are chunked, embedded with a local embedding model, stored in an in-memory vector store, and retrieved at query time to ground the generation model's response.

How it works

Ingestion:  Document → TextChunker → chunks → EmbeddingModel → VectorStore
Query:      question → EmbeddingModel → VectorStore.search() → prompt + context → GenerationModel

Both the embedding model and the generation model run inside a single worker isolate, avoiding threading issues that arise when multiple native model instances share global state — this is handled transparently by llamadart.

Quick start

You need two GGUF models:

  • A generation model (e.g. Llama 3, Mistral) for producing answers.
  • An embedding model (e.g. nomic-embed-text) for vectorising text.
final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  genConfig: LlmConfig(temp: 0.3, nCtx: 4096, nGpuLayers: 4),
);

await rag.initialize();

Document ingestion

Create a Document from text or PDF content and ingest it into the vector store:

// From plain text
final doc = Document.fromText(
  'Flutter is Google\'s UI toolkit for building natively compiled applications...',
  source: 'flutter_intro.txt',
);

// From extracted PDF text (use a PDF parser to extract the string first)
final pdfDoc = Document.fromPdf(
  extractedText,
  source: '/path/to/manual.pdf',
  pageCount: 42,
);

// Ingest — stream progress for UI updates
await for (final progress in rag.ingestDocument(doc)) {
  print('${progress.embeddedChunks}/${progress.totalChunks} — ${progress.currentPreview}');
  // progress.fraction gives 0.0–1.0 for a progress bar
}

print('Indexed: ${rag.indexedSize} chunks across ${rag.documentIds.length} documents');

Manage the index:

// Remove a single document (all its chunks)
await rag.removeDocument(doc.id);

// Clear everything
await rag.clearIndex();

Querying

// Stream the generated answer
await for (final chunk in rag.query('What is Flutter?')) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s');
  }
}

Optional query parameters:

rag.query(
  'Explain the rendering pipeline.',
  topK: 3,              // number of context chunks (default: 5)
  minSimilarity: 0.35,  // minimum cosine similarity 0.0–1.0 (default: 0.25)
)

Retrieve relevant chunks without generating an answer (useful for inspection):

final results = await rag.findRelevant('rendering pipeline', topK: 3);
for (final r in results) {
  print('${(r.similarity * 100).toStringAsFixed(0)}% — ${r.chunk.text.substring(0, 80)}');
}

Index persistence

Pass indexPath to automatically save and restore the vector index between sessions:

final dir = await getApplicationDocumentsDirectory();

final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  indexPath: '${dir.path}/rag_index.json',  // auto-save on every change
);

await rag.initialize(); // loads existing index if the file exists

When indexPath is null, the store is in-memory only and data is lost when the app restarts.

Advanced: custom prompt template

The default template instructs the model to answer only from the provided context. Override it if you need different behaviour:

final rag = RagEngine(
  genModelPath: '...',
  embedModelPath: '...',
  promptTemplate:
    'You are a helpful assistant. Use the context below to answer.\n\n'
    'CONTEXT:\n{context}\n\nQUESTION: {question}\n\nANSWER:',
);

The template must contain {context} and {question} placeholders.

Cleanup

rag.dispose(); // releases the worker isolate and both models; safe to call twice

Libraries

mt_llmkit