flutter_litert_lm 0.1.0
flutter_litert_lm: ^0.1.0 copied to clipboard
Flutter plugin for Google's LiteRT-LM — run Large Language Models on-device with CPU, GPU (OpenCL) or NPU acceleration. Supports Gemma, Qwen, Phi, DeepSeek and more.
flutter_litert_lm #
Flutter plugin for Google's LiteRT-LM — run Large Language Models on-device from your Flutter app, no network, no API keys, no per-token bills.
Supports Gemma, Qwen, Phi, DeepSeek and any other model published in the
litert-community HuggingFace
organization, with hardware acceleration via the device's GPU (OpenCL) or
NPU.
Why on-device? #
- Private — prompts never leave the user's phone.
- Offline — works on a plane, in a tunnel, in a covered market.
- Zero recurring cost — no API calls, no token bills, no rate limits.
- Low latency — first-token latency is local round-trip time, not internet round-trip time.
Features #
- Streaming chat with token-by-token delta delivery
- Multi-turn conversations with system instructions and history
- Multimodal inputs (text, images, audio)
- Tool / function calling for agentic workflows
- CPU, GPU (OpenCL), and NPU backends
- Sampler controls:
temperature,topK,topP - Resource-safe lifecycle:
Engine.dispose()andConversation.dispose() - Ships R8/Proguard keep rules so release builds don't break
- Manifest-merged
<uses-native-library>entries so GPU works on Android 12+ out of the box
Platform support #
| Platform | Status |
|---|---|
| Android | ✅ Stable (com.google.ai.edge.litertlm:litertlm-android:0.10.0) |
| iOS | 🚧 Stub — waiting on Google's LiteRT-LM Swift SDK |
Minimum Android API: 24 (Android 7.0). The shipped AAR contains JNI
binaries for arm64-v8a and x86_64.
Installation #
dependencies:
flutter_litert_lm: ^0.1.0
Then run:
flutter pub get
Quick start #
import 'package:flutter_litert_lm/flutter_litert_lm.dart';
// 1. Load a model into a new engine.
final engine = await LiteLmEngine.create(
LiteLmEngineConfig(
modelPath: '/storage/.../model.litertlm',
backend: LiteLmBackend.gpu, // or .cpu / .npu
),
);
// 2. Start a conversation.
final conversation = await engine.createConversation(
LiteLmConversationConfig(
systemInstruction: 'You are a helpful assistant. Be concise.',
samplerConfig: const LiteLmSamplerConfig(
temperature: 0.7,
topK: 40,
topP: 0.95,
),
),
);
// 3a. Get the full reply at once...
final reply = await conversation.sendMessage('What is Flutter?');
print(reply.text);
// 3b. ...or stream tokens as they arrive.
conversation.sendMessageStream('Tell me a story.').listen((delta) {
stdout.write(delta.text); // each event is the new tokens, not a snapshot
});
// 4. Always release native resources when you're done.
await conversation.dispose();
await engine.dispose();
Streaming chat #
sendMessageStream returns a Stream<LiteLmMessage>. Each event carries
only the new tokens since the previous emission, not a snapshot of the full
response — accumulate as you go:
final buffer = StringBuffer();
await for (final delta in conversation.sendMessageStream('Hello!')) {
buffer.write(delta.text);
print(buffer); // partial reply so far
}
The example app shows how to wire this into a UI with a typing indicator and live token-per-second readout.
Multimodal #
final reply = await conversation.sendMultimodalMessage([
LiteLmContent.text('Describe what you see.'),
LiteLmContent.imageFile('/storage/.../photo.jpg'),
]);
print(reply.text);
LiteLmContent factories: text, imageFile, imageBytes, audioFile,
audioBytes, toolResponse.
Tool calling #
final conversation = await engine.createConversation(
LiteLmConversationConfig(
tools: [
LiteLmTool(
name: 'get_weather',
description: 'Get current weather for a city',
parameters: {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'City name'},
},
'required': ['city'],
},
),
],
),
);
final reply = await conversation.sendMessage('Weather in Tokyo?');
if (reply.toolCalls.isNotEmpty) {
final call = reply.toolCalls.first;
// Run the tool yourself, then feed the result back:
final final_ = await conversation.sendToolResponse(
call.name,
'{"temperature": 22, "condition": "sunny"}',
);
print(final_.text);
}
Backends #
| Backend | When to use it |
|---|---|
cpu |
Always works, including emulators. Slowest. |
gpu |
Real devices with OpenCL (Adreno, Mali, Tensor). Much faster. |
npu |
Devices with a vendor NPU runtime AND a model variant compiled for that chip (Snapdragon HTP, MediaTek APU). Fastest, but model-specific. |
The example app lets you switch backends at runtime — useful for benchmarking.
⚠️ Emulator note: Android emulators ship with no
libOpenCL.so, so thegpubackend cannot initialize there. Usecpuon the emulator andgpuon real hardware.
Getting models #
The litert-community HuggingFace
organization publishes ready-to-run .litertlm files. The example app's
models.dart curates the open-license,
non-gated subset:
| Model | Size | License | Notes |
|---|---|---|---|
| Qwen 3 0.6B | 586 MB | Apache-2.0 | Smallest general-purpose chat |
| Qwen 2.5 1.5B Instruct (q8) | 1.49 GB | Apache-2.0 | Balanced quality / size |
| DeepSeek R1 Distill Qwen 1.5B | 1.71 GB | MIT | Reasoning / chain-of-thought |
| Gemma 4 E2B Instruct | 2.46 GB | Apache-2.0 | Google flagship, ungated |
| Gemma 4 E4B Instruct | 3.40 GB | Apache-2.0 | Highest quality, ~5 GB RAM at load |
Models in the litert-community/Gemma3-* repos are gated under the Gemma
license and require a HuggingFace token to download — visit the model's HF
page first to accept the terms.
You can drop a .litertlm file anywhere readable by your app and pass the
absolute path as modelPath. For app-private storage, use
path_provider to resolve a
writable directory.
API reference #
LiteLmEngine #
| Member | Description |
|---|---|
LiteLmEngine.create(config) |
Load a model and initialize the engine |
engine.createConversation([cfg]) |
Open a new conversation |
engine.countTokens(text) |
Tokenize and count (currently -1 — upstream API doesn't expose this yet) |
engine.dispose() |
Release native handles |
LiteLmConversation #
| Member | Description |
|---|---|
sendMessage(text, {extraContext}) |
Full reply, awaited |
sendMultimodalMessage(contents, {extraContext}) |
Mixed text + image + audio |
sendMessageStream(text, {extraContext}) |
Stream<LiteLmMessage> of token deltas |
sendToolResponse(name, result, {extraContext}) |
Reply to a tool call |
dispose() |
Release the conversation |
Configuration types #
LiteLmEngineConfig—modelPath,backend,cacheDir,visionBackend,audioBackendLiteLmConversationConfig—systemInstruction,initialMessages,samplerConfig,tools,automaticToolCallingLiteLmSamplerConfig—temperature,topK,topPLiteLmBackend—cpu,gpu,npu
Example app #
A full reference implementation lives in example/ and includes:
- Curated model picker with on-device download + progress
- Backend selector (CPU / GPU / NPU)
- Streaming chat with typing indicator
- Per-response inference stats (tokens, tok/s, TTFT, total duration)
cd example
flutter run --release
Troubleshooting #
NoSuchMethodError: Lcom/google/ai/edge/litertlm/SamplerConfig;.getTopK()I
in release builds — R8 stripped the JNI surface. The plugin already ships
keep rules in consumer-rules.pro that AGP merges into your app
automatically; if you somehow still hit this, copy those rules into your
own proguard-rules.pro.
Cannot find OpenCL library on this device on real Android phones with
GPU backend — your app's manifest is missing <uses-native-library>. The
plugin ships these entries and AGP merges them into your manifest, so this
should be automatic. If you've disabled manifest merging, add to your app's
<application>:
<uses-native-library android:name="libOpenCL.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
App is killed silently after loading a large model — out-of-memory. Modern phones often have 6–8 GB RAM but a third of that is used by the system. Models bigger than ~2.5 GB on disk can blow past what's available once loaded. Try a smaller model (Qwen 3 0.6B, Gemma 3 1B) or close background apps.
Streaming feels choppy — make sure you're using sendMessageStream, not
sendMessage. The latter blocks until the entire response is generated.
Contributing #
Contributions are welcome! Please see CONTRIBUTING.md for the workflow. By contributing you agree that your work will be released under this project's Apache 2.0 license.
License #
Licensed under the Apache License 2.0 — see LICENSE for the full text. The same license as upstream LiteRT-LM.
Acknowledgements #
- Google AI Edge for the
LiteRT-LM runtime and the
litert-communitymodel collection. - HuggingFace for hosting the model artifacts.
- The Flutter team for the plugin platform interface.