flutter_litert_lm
Flutter plugin for Google's LiteRT-LM — run Large Language Models on-device from your Flutter app, no network, no API keys, no per-token bills.
Supports Gemma, Qwen, Phi, DeepSeek and any other model published in the
litert-community HuggingFace
organization, with hardware acceleration via the device's GPU (OpenCL) or
NPU.
Why on-device?
- Private — prompts never leave the user's phone.
- Offline — works on a plane, in a tunnel, in a covered market.
- Zero recurring cost — no API calls, no token bills, no rate limits.
- Low latency — first-token latency is local round-trip time, not internet round-trip time.
Features
- Streaming chat with token-by-token delta delivery
- Multi-turn conversations with system instructions and history
- Multimodal inputs (text, images, audio)
- Tool / function calling for agentic workflows
- CPU, GPU (OpenCL), and NPU backends
- Sampler controls:
temperature,topK,topP - Resource-safe lifecycle:
Engine.dispose()andConversation.dispose() - Ships R8/Proguard keep rules so release builds don't break
- Manifest-merged
<uses-native-library>entries so GPU works on Android 12+ out of the box
Platform support
| Platform | Status | Backends |
|---|---|---|
| Android | Stable | CPU (XNNPACK), GPU (OpenCL), NPU (Qualcomm HTP, MediaTek APU) |
| iOS | Beta | CPU (XNNPACK) only |
| iOS Sim arm64 | Beta | CPU (XNNPACK) only |
Minimum Android API 24 (Android 7.0). Minimum iOS 13.0. iOS ships arm64 slices only (no Intel Mac simulator).
iOS notes
Google's LiteRT-LM ships no prebuilt iOS runtime, so the plugin pulls the C++ runtime at install time and compiles it into an XCFramework on the developer's Mac. One-time setup:
# 1. install Bazelisk (it will pick up Bazel 7.6.1 automatically)
brew install bazelisk git-lfs
# 2. clone your app, then from the plugin checkout:
bash scripts/build_ios_frameworks.sh
# 3. wait 30-60 minutes on the first run (Bazel downloads ~2 GB of deps
# and compiles TFLite, protobuf, abseil, etc). Subsequent runs are
# cached — flutter build ios after that takes seconds.
Details on what the script does and how to wire it into CI are in
ios/Frameworks/README.md.
iOS backend limitations: only the CPU (XNNPACK) backend is wired up. The LiteRT-LM Metal GPU and WebGPU accelerators exist as separate dylibs for macOS but are not shipped for iOS yet — see upstream issue google-ai-edge/LiteRT-LM#1050. The picker in the example app auto-hides GPU/NPU on iOS.
The iOS build also uses whatever sampler is baked into the model's own
metadata (kTopK and kGreedy aren't implemented in the C API shipped with
the current runtime), so the Dart-side topK / topP / temperature
knobs are ignored on iOS for now.
Installation
dependencies:
flutter_litert_lm: ^0.2.0
Then run:
flutter pub get
Quick start
import 'package:flutter_litert_lm/flutter_litert_lm.dart';
// 1. Load a model into a new engine.
final engine = await LiteLmEngine.create(
LiteLmEngineConfig(
modelPath: '/storage/.../model.litertlm',
backend: LiteLmBackend.gpu, // or .cpu / .npu
),
);
// 2. Start a conversation.
final conversation = await engine.createConversation(
LiteLmConversationConfig(
systemInstruction: 'You are a helpful assistant. Be concise.',
samplerConfig: const LiteLmSamplerConfig(
temperature: 0.7,
topK: 40,
topP: 0.95,
),
),
);
// 3a. Get the full reply at once...
final reply = await conversation.sendMessage('What is Flutter?');
print(reply.text);
// 3b. ...or stream tokens as they arrive.
conversation.sendMessageStream('Tell me a story.').listen((delta) {
stdout.write(delta.text); // each event is the new tokens, not a snapshot
});
// 4. Always release native resources when you're done.
await conversation.dispose();
await engine.dispose();
Streaming chat
sendMessageStream returns a Stream<LiteLmMessage>. Each event carries
only the new tokens since the previous emission, not a snapshot of the full
response — accumulate as you go:
final buffer = StringBuffer();
await for (final delta in conversation.sendMessageStream('Hello!')) {
buffer.write(delta.text);
print(buffer); // partial reply so far
}
The example app shows how to wire this into a UI with a typing indicator and live token-per-second readout.
Multimodal
final reply = await conversation.sendMultimodalMessage([
LiteLmContent.text('Describe what you see.'),
LiteLmContent.imageFile('/storage/.../photo.jpg'),
]);
print(reply.text);
LiteLmContent factories: text, imageFile, imageBytes, audioFile,
audioBytes, toolResponse.
Tool calling
final conversation = await engine.createConversation(
LiteLmConversationConfig(
tools: [
LiteLmTool(
name: 'get_weather',
description: 'Get current weather for a city',
parameters: {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'City name'},
},
'required': ['city'],
},
),
],
),
);
final reply = await conversation.sendMessage('Weather in Tokyo?');
if (reply.toolCalls.isNotEmpty) {
final call = reply.toolCalls.first;
// Run the tool yourself, then feed the result back:
final final_ = await conversation.sendToolResponse(
call.name,
'{"temperature": 22, "condition": "sunny"}',
);
print(final_.text);
}
Backends
| Backend | Android | iOS |
|---|---|---|
cpu |
Always works (including emulator) | Supported |
gpu |
Real devices with OpenCL | Not yet |
npu |
Devices with vendor NPU runtime | Not yet |
The example app lets you switch backends at runtime on Android — useful for benchmarking. On iOS the backend selector is locked to CPU.
Android emulator note: emulators ship with no
libOpenCL.so, so thegpubackend cannot initialize there. Usecpuon the emulator andgpuon real hardware.
iOS note: only CPU is supported. GPU (Metal) requires the
libLiteRtMetalAccelerator.dylibaccelerator plugin, which Google has not shipped for iOS yet. Tracked upstream in LiteRT-LM issue #1050.
Getting models
The litert-community HuggingFace
organization publishes ready-to-run .litertlm files. The example app's
models.dart curates the open-license,
non-gated subset:
| Model | Size | License | Notes |
|---|---|---|---|
| Qwen 3 0.6B | 586 MB | Apache-2.0 | Smallest general-purpose chat |
| Qwen 2.5 1.5B Instruct (q8) | 1.49 GB | Apache-2.0 | Balanced quality / size |
| DeepSeek R1 Distill Qwen 1.5B | 1.71 GB | MIT | Reasoning / chain-of-thought |
| Gemma 4 E2B Instruct | 2.46 GB | Apache-2.0 | Google flagship, ungated |
| Gemma 4 E4B Instruct | 3.40 GB | Apache-2.0 | Highest quality, ~5 GB RAM at load |
Models in the litert-community/Gemma3-* repos are gated under the Gemma
license and require a HuggingFace token to download — visit the model's HF
page first to accept the terms.
You can drop a .litertlm file anywhere readable by your app and pass the
absolute path as modelPath. For app-private storage, use
path_provider to resolve a
writable directory.
API reference
LiteLmEngine
| Member | Description |
|---|---|
LiteLmEngine.create(config) |
Load a model and initialize the engine |
engine.createConversation([cfg]) |
Open a new conversation |
engine.countTokens(text) |
Tokenize and count (currently -1 — upstream API doesn't expose this yet) |
engine.dispose() |
Release native handles |
LiteLmConversation
| Member | Description |
|---|---|
sendMessage(text, {extraContext}) |
Full reply, awaited |
sendMultimodalMessage(contents, {extraContext}) |
Mixed text + image + audio |
sendMessageStream(text, {extraContext}) |
Stream<LiteLmMessage> of token deltas |
sendToolResponse(name, result, {extraContext}) |
Reply to a tool call |
dispose() |
Release the conversation |
Configuration types
LiteLmEngineConfig—modelPath,backend,cacheDir,visionBackend,audioBackendLiteLmConversationConfig—systemInstruction,initialMessages,samplerConfig,tools,automaticToolCallingLiteLmSamplerConfig—temperature,topK,topPLiteLmBackend—cpu,gpu,npu
Example app
A full reference implementation lives in example/ and includes:
- Curated model picker with on-device download + progress
- Backend selector (CPU / GPU / NPU)
- Streaming chat with typing indicator
- Per-response inference stats (tokens, tok/s, TTFT, total duration)
cd example
flutter run --release
Troubleshooting
NoSuchMethodError: Lcom/google/ai/edge/litertlm/SamplerConfig;.getTopK()I
in release builds — R8 stripped the JNI surface. The plugin already ships
keep rules in consumer-rules.pro that AGP merges into your app
automatically; if you somehow still hit this, copy those rules into your
own proguard-rules.pro.
Cannot find OpenCL library on this device on real Android phones with
GPU backend — your app's manifest is missing <uses-native-library>. The
plugin ships these entries and AGP merges them into your manifest, so this
should be automatic. If you've disabled manifest merging, add to your app's
<application>:
<uses-native-library android:name="libOpenCL.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
App is killed silently after loading a large model — out-of-memory. Modern phones often have 6–8 GB RAM but a third of that is used by the system. Models bigger than ~2.5 GB on disk can blow past what's available once loaded. Try a smaller model (Qwen 3 0.6B, Gemma 3 1B) or close background apps.
Streaming feels choppy — make sure you're using sendMessageStream, not
sendMessage. The latter blocks until the entire response is generated.
iOS-specific
Build input file cannot be found: .../LiteRTLM.xcframework during
flutter build ios — the vendored XCFramework hasn't been built yet.
Run bash scripts/build_ios_frameworks.sh from the plugin checkout
before your first iOS build. First run is 30-60 minutes, subsequent runs
are cached by Bazel.
engine_create returned NULL ... NOT_FOUND: Engine type not found —
the linker has dropped the LiteRT-LM engine factory static constructors.
The podspec already passes -all_load to the pod target to force every
.o file from libc_engine.a to be linked in; if you've customized
pod_target_xcconfig in your own build, make sure -all_load is still
in OTHER_LDFLAGS.
UNIMPLEMENTED: Sampler type: 1 not implemented yet (or 3) — the iOS
C API in the current LiteRT-LM runtime doesn't implement the kTopK /
kGreedy samplers. The plugin passes a NULL session_config on iOS so
the engine falls back to the sampler baked into the model metadata,
which always works.
iOS simulator can't build for x86_64 — the shipped XCFrameworks only
contain arm64 slices (Apple Silicon devices + arm64 simulators). The
plugin's podspec and the example app's Podfile both exclude x86_64
from EXCLUDED_ARCHS[sdk=iphonesimulator*]; if you copy the podspec into
your own app without the Podfile hook, add the same exclusion to
your Runner target manually.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for the workflow. By contributing you agree that your work will be released under this project's Apache 2.0 license.
License
Licensed under the Apache License 2.0 — see LICENSE for the full text. The same license as upstream LiteRT-LM.
Acknowledgements
- Google AI Edge for the
LiteRT-LM runtime and the
litert-communitymodel collection. - HuggingFace for hosting the model artifacts.
- The Flutter team for the plugin platform interface.