flutter_litert_lm #

License: Apache 2.0

Flutter plugin for Google's LiteRT-LM — run Large Language Models on-device from your Flutter app, no network, no API keys, no per-token bills.

Supports Gemma, Qwen, Phi, DeepSeek and any other model published in the litert-community HuggingFace organization, with hardware acceleration via the device's GPU (OpenCL) or NPU.

Why on-device? #

Private — prompts never leave the user's phone.
Offline — works on a plane, in a tunnel, in a covered market.
Zero recurring cost — no API calls, no token bills, no rate limits.
Low latency — first-token latency is local round-trip time, not internet round-trip time.

Features #

Streaming chat with token-by-token delta delivery
Multi-turn conversations with system instructions and history
Multimodal inputs (text, images, audio)
Tool / function calling for agentic workflows
CPU, GPU (OpenCL), and NPU backends
Sampler controls: temperature, topK, topP
Resource-safe lifecycle: Engine.dispose() and Conversation.dispose()
Ships R8/Proguard keep rules so release builds don't break
Manifest-merged <uses-native-library> entries so GPU works on Android 12+ out of the box

Platform support #

Platform	Status
Android	✅ Stable (`com.google.ai.edge.litertlm:litertlm-android:0.10.0`)
iOS	🚧 Stub — waiting on Google's LiteRT-LM Swift SDK

Minimum Android API: 24 (Android 7.0). The shipped AAR contains JNI binaries for arm64-v8a and x86_64.

Installation #

dependencies:
  flutter_litert_lm: ^0.1.0

Then run:

flutter pub get

Quick start #

import 'package:flutter_litert_lm/flutter_litert_lm.dart';

// 1. Load a model into a new engine.
final engine = await LiteLmEngine.create(
  LiteLmEngineConfig(
    modelPath: '/storage/.../model.litertlm',
    backend: LiteLmBackend.gpu, // or .cpu / .npu
  ),
);

// 2. Start a conversation.
final conversation = await engine.createConversation(
  LiteLmConversationConfig(
    systemInstruction: 'You are a helpful assistant. Be concise.',
    samplerConfig: const LiteLmSamplerConfig(
      temperature: 0.7,
      topK: 40,
      topP: 0.95,
    ),
  ),
);

// 3a. Get the full reply at once...
final reply = await conversation.sendMessage('What is Flutter?');
print(reply.text);

// 3b. ...or stream tokens as they arrive.
conversation.sendMessageStream('Tell me a story.').listen((delta) {
  stdout.write(delta.text); // each event is the new tokens, not a snapshot
});

// 4. Always release native resources when you're done.
await conversation.dispose();
await engine.dispose();

Streaming chat #

sendMessageStream returns a Stream<LiteLmMessage>. Each event carries only the new tokens since the previous emission, not a snapshot of the full response — accumulate as you go:

final buffer = StringBuffer();
await for (final delta in conversation.sendMessageStream('Hello!')) {
  buffer.write(delta.text);
  print(buffer); // partial reply so far
}

The example app shows how to wire this into a UI with a typing indicator and live token-per-second readout.

Multimodal #

final reply = await conversation.sendMultimodalMessage([
  LiteLmContent.text('Describe what you see.'),
  LiteLmContent.imageFile('/storage/.../photo.jpg'),
]);
print(reply.text);

LiteLmContent factories: text, imageFile, imageBytes, audioFile, audioBytes, toolResponse.

Tool calling #

final conversation = await engine.createConversation(
  LiteLmConversationConfig(
    tools: [
      LiteLmTool(
        name: 'get_weather',
        description: 'Get current weather for a city',
        parameters: {
          'type': 'object',
          'properties': {
            'city': {'type': 'string', 'description': 'City name'},
          },
          'required': ['city'],
        },
      ),
    ],
  ),
);

final reply = await conversation.sendMessage('Weather in Tokyo?');
if (reply.toolCalls.isNotEmpty) {
  final call = reply.toolCalls.first;
  // Run the tool yourself, then feed the result back:
  final final_ = await conversation.sendToolResponse(
    call.name,
    '{"temperature": 22, "condition": "sunny"}',
  );
  print(final_.text);
}

Backends #

Backend	When to use it
`cpu`	Always works, including emulators. Slowest.
`gpu`	Real devices with OpenCL (Adreno, Mali, Tensor). Much faster.
`npu`	Devices with a vendor NPU runtime AND a model variant compiled for that chip (Snapdragon HTP, MediaTek APU). Fastest, but model-specific.

The example app lets you switch backends at runtime — useful for benchmarking.

⚠️ Emulator note: Android emulators ship with no libOpenCL.so, so the gpu backend cannot initialize there. Use cpu on the emulator and gpu on real hardware.

Getting models #

The litert-community HuggingFace organization publishes ready-to-run .litertlm files. The example app's models.dart curates the open-license, non-gated subset:

Model	Size	License	Notes
Qwen 3 0.6B	586 MB	Apache-2.0	Smallest general-purpose chat
Qwen 2.5 1.5B Instruct (q8)	1.49 GB	Apache-2.0	Balanced quality / size
DeepSeek R1 Distill Qwen 1.5B	1.71 GB	MIT	Reasoning / chain-of-thought
Gemma 4 E2B Instruct	2.46 GB	Apache-2.0	Google flagship, ungated
Gemma 4 E4B Instruct	3.40 GB	Apache-2.0	Highest quality, ~5 GB RAM at load

Models in the litert-community/Gemma3-* repos are gated under the Gemma license and require a HuggingFace token to download — visit the model's HF page first to accept the terms.

You can drop a .litertlm file anywhere readable by your app and pass the absolute path as modelPath. For app-private storage, use path_provider to resolve a writable directory.

API reference #

`LiteLmEngine` #

Member	Description
`LiteLmEngine.create(config)`	Load a model and initialize the engine
`engine.createConversation([cfg])`	Open a new conversation
`engine.countTokens(text)`	Tokenize and count (currently `-1` — upstream API doesn't expose this yet)
`engine.dispose()`	Release native handles

`LiteLmConversation` #

Member	Description
`sendMessage(text, {extraContext})`	Full reply, awaited
`sendMultimodalMessage(contents, {extraContext})`	Mixed text + image + audio
`sendMessageStream(text, {extraContext})`	`Stream<LiteLmMessage>` of token deltas
`sendToolResponse(name, result, {extraContext})`	Reply to a tool call
`dispose()`	Release the conversation

Configuration types #

LiteLmEngineConfig — modelPath, backend, cacheDir, visionBackend, audioBackend
LiteLmConversationConfig — systemInstruction, initialMessages, samplerConfig, tools, automaticToolCalling
LiteLmSamplerConfig — temperature, topK, topP
LiteLmBackend — cpu, gpu, npu

Example app #

A full reference implementation lives in example/ and includes:

Curated model picker with on-device download + progress
Backend selector (CPU / GPU / NPU)
Streaming chat with typing indicator
Per-response inference stats (tokens, tok/s, TTFT, total duration)

cd example
flutter run --release

Troubleshooting #

NoSuchMethodError: Lcom/google/ai/edge/litertlm/SamplerConfig;.getTopK()I in release builds — R8 stripped the JNI surface. The plugin already ships keep rules in consumer-rules.pro that AGP merges into your app automatically; if you somehow still hit this, copy those rules into your own proguard-rules.pro.

Cannot find OpenCL library on this device on real Android phones with GPU backend — your app's manifest is missing <uses-native-library>. The plugin ships these entries and AGP merges them into your manifest, so this should be automatic. If you've disabled manifest merging, add to your app's <application>:

<uses-native-library android:name="libOpenCL.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
<uses-native-library android:name="libOpenCL-car.so" android:required="false"/>

App is killed silently after loading a large model — out-of-memory. Modern phones often have 6–8 GB RAM but a third of that is used by the system. Models bigger than ~2.5 GB on disk can blow past what's available once loaded. Try a smaller model (Qwen 3 0.6B, Gemma 3 1B) or close background apps.

Streaming feels choppy — make sure you're using sendMessageStream, not sendMessage. The latter blocks until the entire response is generated.

Contributing #

Contributions are welcome! Please see CONTRIBUTING.md for the workflow. By contributing you agree that your work will be released under this project's Apache 2.0 license.

License #

Licensed under the Apache License 2.0 — see LICENSE for the full text. The same license as upstream LiteRT-LM.

Acknowledgements #

Google AI Edge for the LiteRT-LM runtime and the litert-community model collection.
HuggingFace for hosting the model artifacts.
The Flutter team for the plugin platform interface.

flutter_litert_lm 0.1.0
flutter_litert_lm: ^0.1.0 copied to clipboard

Metadata

flutter_litert_lm #

Why on-device? #

Features #

Platform support #

Installation #

Quick start #

Streaming chat #

Multimodal #

Tool calling #

Backends #

Getting models #

API reference #

`LiteLmEngine` #

`LiteLmConversation` #

Configuration types #

Example app #

Troubleshooting #

Contributing #

License #

Acknowledgements #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

flutter_litert_lm 0.1.0 flutter_litert_lm: ^0.1.0 copied to clipboard

Metadata

flutter_litert_lm #

Why on-device? #

Features #

Platform support #

Installation #

Quick start #

Streaming chat #

Multimodal #

Tool calling #

Backends #

Getting models #

API reference #

LiteLmEngine #

LiteLmConversation #

Configuration types #

Example app #

Troubleshooting #

Contributing #

License #

Acknowledgements #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

flutter_litert_lm 0.1.0
flutter_litert_lm: ^0.1.0 copied to clipboard

`LiteLmEngine` #

`LiteLmConversation` #