mt_llmkit

Flutter Dart Platform Stability License

A Flutter plugin for running Large Language Models (LLMs) locally on Android and iOS using llamadart (which wraps llama.cpp). Also provides a unified interface for cloud AI chat providers (OpenAI, Gemini, Claude, Mistral) and a fully local RAG (Retrieval-Augmented Generation) pipeline.

Installation
Local LLM Inference (GGUF)
Vision (Multimodal)
Cloud API Providers
Local RAG Pipeline

Installation

Add to your pubspec.yaml:

dependencies:
  mt_llmkit: ^0.0.1-beta.1

Then run:

flutter pub get

Import the library:

import 'package:mt_llmkit/llmcpp.dart';

Local LLM Inference (GGUF)

Run quantized GGUF models entirely on-device — no internet connection required.

Quick start

final model = LocalModel(
  config: LlmConfig(temp: 0.7, nCtx: 2048, nGpuLayers: 4),
);

await model.loadModel('/path/to/model.gguf');

// Stream tokens as they are generated
model.sendPrompt('What is Flutter?').listen((token) {
  stdout.write(token);
});

model.dispose();

Backends

LocalModel supports two backends controlled by the backend parameter:

Backend	Class	When to use
`ModelBackend.isolate` (default)	`LlmModelIsolated`	Production. Runs in a Dart Isolate — no UI jank. Required when loading multiple models (e.g. RAG).
`ModelBackend.inProcess`	`LlmModelStandard`	Lighter startup cost. Supports `clean()` to reset context without reloading the model.

// Isolate backend (default)
final model = LocalModel(backend: ModelBackend.isolate);

// In-process backend
final model = LocalModel(backend: ModelBackend.inProcess);
model.clean(); // reset context — only available with inProcess

Note: clean() throws UnsupportedError when called on the isolate backend.

Configuration

All parameters are optional; sensible defaults are applied automatically.

final config = LlmConfig(
  nGpuLayers: 4,    // GPU layers offloaded (default: 64)
  nCtx: 2048,       // context window in tokens (default: 8192)
  nBatch: 512,      // batch size (default: 4096)
  nPredict: 1024,   // max tokens to generate (default: 8192)
  nThreads: 4,      // CPU threads (default: 6)
  temp: 0.7,        // temperature (default: 0.72)
  topK: 40,         // top-K sampling (default: 64)
  topP: 0.9,        // top-P sampling (default: 0.95)
  penaltyRepeat: 1.1, // repetition penalty (default: 1.1)
);

Generation methods

Three methods are available on LocalModel (and any LlmInterface implementation):

Method	Return type	Description
`sendPrompt(prompt)`	`Stream<String>`	Raw token stream. Lowest overhead.
`sendPromptComplete(prompt)`	`Future<String>`	Waits for the full response and returns it as a single string.
`sendPromptStream(prompt)`	`Stream<StreamingChunk>`	Recommended. Token stream with live performance metrics.

// 1. Raw token stream
model.sendPrompt('Hello').listen(stdout.write);

// 2. Full response at once
final response = await model.sendPromptComplete('Hello');
print(response);

// 3. Streaming with live metrics (recommended)
model.sendPromptStream('Hello').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    final m = chunk.metrics!;
    print('\n--- ${m.tokensGenerated} tokens, ${m.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

StreamingChunk fields:

Field	Type	Description
`text`	`String`	The generated text fragment.
`isFinal`	`bool`	`true` on the last chunk of the response.
`metrics`	`PerformanceMetrics?`	Available on every chunk; most useful on the final one.

PerformanceMetrics fields: tokensGenerated, durationMs, tokensPerSecond, msPerToken.

Prompt format

Override the model's built-in chat template by passing a raw GGUF/Jinja template string via LlmConfig.chatTemplate. When chatTemplate is null (the default), the template embedded in the model file is used automatically.

final config = LlmConfig(
  chatTemplate: '<|user|>\n{prompt}<|end|>\n<|assistant|>\n', // custom override
);

Performance metrics

PerformanceMetrics is updated incrementally with every StreamingChunk:

model.sendPromptStream('Explain Dart isolates in detail.').listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.metrics != null) {
    final m = chunk.metrics!;
    // Update UI progress indicator
    print('${m.tokensGenerated} tokens | ${m.tokensPerSecond.toStringAsFixed(2)} t/s');
  }
});

Vision (Multimodal)

LocalModel supports multimodal vision models (LLaVA, Gemma 3, Qwen VL, SmolVLM, etc.) that can analyse images alongside a text prompt. Vision requires two GGUF files: the main language model and a multimodal projector (mmproj-*.gguf).

Quick start (vision) {#quick-start-vision}

final model = LocalModel(
  config: LlmConfig(
    mmprojPath: '/path/to/mmproj-model-f16-4B.gguf',
    nGpuLayers: 4,
    nCtx: 4096,
    nPredict: 512,
    temp: 0.3,
  ),
);

await model.loadModel('/path/to/gemma-3-4b-it-q4_0.gguf');

final image = LlamaImageContent(path: '/path/to/photo.jpg');

model.sendPromptStream(
  'Describe what you see in this image. <image>',
  images: [image],
).listen((chunk) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n--- ${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s ---');
  }
});

Important: The prompt must contain one <image> placeholder per image passed in the list.

Supported models

Any vision GGUF model that uses the libmtmd multimodal projection layer is supported. Tested models:

Model	Notes
Gemma 3 (4B, 12B, 27B)	Recommended. Good accuracy, available in Q4 quantisation.
Qwen 2.5 VL	Strong OCR and document understanding.
LLaVA 1.5 / 1.6	Classic CLIP-based architecture.
SmolVLM	Compact, fast, good for mobile devices.

Each model has a corresponding mmproj-*.gguf file available on Hugging Face alongside the main model.

Image input

LlamaImageContent is created by providing the image file path:

final image = LlamaImageContent(path: '/path/to/photo.jpg');

Supported formats: JPEG, PNG, and any format supported by the underlying libmtmd library.

Generation methods (vision) {#generation-methods-1}

All three standard generation methods accept an optional images parameter:

Method	Return type	Description
`sendPrompt(prompt, images: images)`	`Stream<String>`	Raw token stream.
`sendPromptComplete(prompt, images: images)`	`Future<String>`	Full response as a single string.
`sendPromptStream(prompt, images: images)`	`Stream<StreamingChunk>`	Recommended. Streaming with live performance metrics.

All three methods throw UnsupportedError if LlmConfig.mmprojPath was not set.

// Full response at once
final response = await model.sendPromptComplete(
  'What objects are visible in this photo? <image>',
  images: [LlamaImageContent(path: '/path/to/photo.jpg')],
);
print(response);

Cloud API Providers

mt_llmkit includes a unified AIChatProvider interface for four cloud LLM providers. All providers share the same API surface, making it easy to swap backends.

Supported providers

Provider	Enum value	Default model
OpenAI	`AIChatProviderType.openai`	`gpt-4o-mini`
Google Gemini	`AIChatProviderType.gemini`	`gemini-1.5-flash`
Anthropic Claude	`AIChatProviderType.claude`	`claude-haiku-4-5-20251001`
Mistral AI	`AIChatProviderType.mistral`	`mistral-small-latest`

Basic usage

Use AIChatProviderFactory to create a provider without importing the concrete class:

// Create and initialize in one step
final provider = await AIChatProviderFactory.createAndInitialize(
  AIChatProviderType.openai,
  {'apiKey': 'sk-...'},
);

final response = await provider.sendMessage('What is Flutter?');
print(response.message.content);
print('Tokens used: ${response.inputTokens} in / ${response.outputTokens} out');

await provider.dispose();

Or manage the lifecycle manually:

final provider = AIChatProviderFactory.create(AIChatProviderType.gemini);
await provider.initialize({'apiKey': 'AIza...'});

final response = await provider.sendMessage('Hello!');
print(response.message.content);

await provider.dispose();

Multi-turn conversations

Build conversation history with ChatMessage:

final history = <ChatMessage>[
  ChatMessage.system('You are a concise assistant. Reply in three sentences max.'),
  ChatMessage.user('What is a Dart isolate?'),
  ChatMessage.assistant('A Dart isolate is an independent thread of execution...'),
];

// Continue the conversation
final r = await provider.sendMessage('Give me a code example.', history: history);
print(r.message.content);

// Append the reply to keep history growing
history.add(r.message);

For full control, pass a complete message list directly:

final response = await provider.sendChatMessages([
  ChatMessage.system('You are a poet.'),
  ChatMessage.user('Write a haiku about Flutter.'),
]);
print(response.message.content);

Streaming

All providers support token streaming via sendMessageStream:

await for (final token in provider.sendMessageStream('Tell me a story.')) {
  stdout.write(token);
}

With conversation history:

final history = [ChatMessage.system('Reply only in Spanish.')];

await for (final token in provider.sendMessageStream('Hello!', history: history)) {
  stdout.write(token);
}

Provider-specific config

OpenAI

await provider.initialize({
  'apiKey': 'sk-...',
  'model': 'gpt-4o',               // optional, default: gpt-4o-mini
  'baseUrl': 'https://...',        // optional, for Azure OpenAI or proxies
});

Google Gemini

await provider.initialize({
  'apiKey': 'AIza...',
  'model': 'gemini-1.5-pro',       // optional
});

Anthropic Claude

await provider.initialize({
  'apiKey': 'sk-ant-...',
  'model': 'claude-opus-4-6',      // optional
});

Mistral AI

await provider.initialize({
  'apiKey': '...',
  'model': 'mistral-large-latest', // optional
});

Error handling

All providers throw typed exceptions from chat_exceptions.dart:

Exception	Cause
`APIKeyException`	Invalid or missing API key (HTTP 401/403)
`NetworkException`	Transport error (timeout, DNS, connection reset)
`RateLimitException`	Quota exceeded (HTTP 429); contains `retryAfter`
`AIChatException`	Base class for any other API error

Network and rate-limit errors are automatically retried up to 3 times with exponential back-off.

try {
  final response = await provider.sendMessage('Hello');
  print(response.message.content);
} on APIKeyException catch (e) {
  print('Check your API key: $e');
} on RateLimitException catch (e) {
  print('Rate limited. Retry after ${e.retryAfter?.inSeconds}s');
} on AIChatException catch (e) {
  print('API error: $e');
}

Local RAG Pipeline

RagEngine provides a fully on-device Retrieval-Augmented Generation pipeline. Documents are chunked, embedded with a local embedding model, stored in an in-memory vector store, and retrieved at query time to ground the generation model's response.

How it works

Ingestion:  Document → TextChunker → chunks → EmbeddingModel → VectorStore
Query:      question → EmbeddingModel → VectorStore.search() → prompt + context → GenerationModel

Both the embedding model and the generation model run inside a single worker isolate, avoiding threading issues that arise when multiple native model instances share global state — this is handled transparently by llamadart.

Quick start

You need two GGUF models:

A generation model (e.g. Llama 3, Mistral) for producing answers.
An embedding model (e.g. nomic-embed-text) for vectorising text.

final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  genConfig: LlmConfig(temp: 0.3, nCtx: 4096, nGpuLayers: 4),
);

await rag.initialize();

Document ingestion

Create a Document from text or PDF content and ingest it into the vector store:

// From plain text
final doc = Document.fromText(
  'Flutter is Google\'s UI toolkit for building natively compiled applications...',
  source: 'flutter_intro.txt',
);

// From extracted PDF text (use a PDF parser to extract the string first)
final pdfDoc = Document.fromPdf(
  extractedText,
  source: '/path/to/manual.pdf',
  pageCount: 42,
);

// Ingest — stream progress for UI updates
await for (final progress in rag.ingestDocument(doc)) {
  print('${progress.embeddedChunks}/${progress.totalChunks} — ${progress.currentPreview}');
  // progress.fraction gives 0.0–1.0 for a progress bar
}

print('Indexed: ${rag.indexedSize} chunks across ${rag.documentIds.length} documents');

Manage the index:

// Remove a single document (all its chunks)
await rag.removeDocument(doc.id);

// Clear everything
await rag.clearIndex();

Querying

// Stream the generated answer
await for (final chunk in rag.query('What is Flutter?')) {
  stdout.write(chunk.text);

  if (chunk.isFinal && chunk.metrics != null) {
    print('\n${chunk.metrics!.tokensPerSecond.toStringAsFixed(1)} t/s');
  }
}

Optional query parameters:

rag.query(
  'Explain the rendering pipeline.',
  topK: 3,              // number of context chunks (default: 5)
  minSimilarity: 0.35,  // minimum cosine similarity 0.0–1.0 (default: 0.25)
)

Retrieve relevant chunks without generating an answer (useful for inspection):

final results = await rag.findRelevant('rendering pipeline', topK: 3);
for (final r in results) {
  print('${(r.similarity * 100).toStringAsFixed(0)}% — ${r.chunk.text.substring(0, 80)}');
}

Index persistence

Pass indexPath to automatically save and restore the vector index between sessions:

final dir = await getApplicationDocumentsDirectory();

final rag = RagEngine(
  genModelPath: '/path/to/llama.gguf',
  embedModelPath: '/path/to/nomic-embed.gguf',
  indexPath: '${dir.path}/rag_index.json',  // auto-save on every change
);

await rag.initialize(); // loads existing index if the file exists

When indexPath is null, the store is in-memory only and data is lost when the app restarts.

Advanced: custom prompt template

The default template instructs the model to answer only from the provided context. Override it if you need different behaviour:

final rag = RagEngine(
  genModelPath: '...',
  embedModelPath: '...',
  promptTemplate:
    'You are a helpful assistant. Use the context below to answer.\n\n'
    'CONTEXT:\n{context}\n\nQUESTION: {question}\n\nANSWER:',
);

The template must contain {context} and {question} placeholders.

Cleanup

rag.dispose(); // releases the worker isolate and both models; safe to call twice

mt_llmkit

Table of Contents

Installation

Local LLM Inference (GGUF)

Quick start

Backends

Configuration

Generation methods

Prompt format

Performance metrics

Vision (Multimodal)

Quick start (vision) {#quick-start-vision}

Supported models

Image input

Generation methods (vision) {#generation-methods-1}

Cloud API Providers

Supported providers

Basic usage

Multi-turn conversations

Streaming

Provider-specific config

OpenAI

Google Gemini

Anthropic Claude

Mistral AI

Error handling

Local RAG Pipeline

How it works

Quick start

Document ingestion

Querying

Index persistence

Advanced: custom prompt template

Cleanup

Libraries

mt_llmkit package