onde_inference 1.1.0
onde_inference: ^1.1.0 copied to clipboard
On-device LLM inference for Flutter & Dart. Run Qwen 2.5 models locally with Metal on iOS and macOS, CPU on Android and desktop. No cloud, no API key.
Onde Inference
Run LLMs on-device from Flutter and Dart with Onde Inference. Metal on iOS and macOS, CPU everywhere else.
Rust SDK · Swift SDK · Kotlin Multiplatform SDK · React Native SDK · Website
Run Qwen 2.5 models directly inside your Flutter app. The model downloads from Hugging Face on first launch, then everything runs locally. No server, no API key, and no user data leaves the device. On an iPhone 15 Pro, Metal reaches around 15 tok/s. Android, Linux, and Windows run on CPU, so they are slower but still useful for fully local inference.
You get multi-turn chat, streaming, one-shot generation, configurable sampling, and structured tool call metadata in one package.
Platform support #
| Platform | Backend | Default model | Notes |
|---|---|---|---|
| iOS 13+ | Metal | Qwen 2.5 Coder 1.5B (~941 MB) | CocoaPods and Swift Package Manager plugin manifests are included |
| macOS 10.15+ | Metal | Qwen 2.5 Coder 3B (~1.93 GB) | CocoaPods and Swift Package Manager plugin manifests are included |
| Android API 21+ | CPU | Qwen 2.5 Coder 1.5B (~941 MB) | arm64-v8a, armeabi-v7a, x86_64, x86 |
| Linux x86_64 | CPU | Qwen 2.5 Coder 3B (~1.93 GB) | CUDA possible, see docs |
| Windows x86_64 | CPU | Qwen 2.5 Coder 3B (~1.93 GB) | CUDA possible, see docs |
Web is not supported. On-device inference needs native system access that browsers do not expose.
Quick start #
dependencies:
onde_inference: ^1.0.2
The inference engine is written in Rust and connected to Dart through flutter_rust_bridge. You need a working Rust toolchain. The first build is usually slow because it compiles the full native dependency tree.
Initialize #
Call this once at startup before creating any OndeChatEngine:
import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';
Future<void> main() async {
WidgetsFlutterBinding.ensureInitialized();
await OndeInference.init();
runApp(const MyApp());
}
Load a model #
final engine = OndeChatEngine();
final elapsed = await engine.loadDefaultModel(
systemPrompt: 'You are a helpful assistant.',
);
print('Model loaded in ${elapsed.toStringAsFixed(1)} s');
For production, you can load the model assigned to your Onde app from the dashboard:
final assignedElapsed = await engine.loadAssignedModel(
appId: 'your-app-id',
appSecret: 'your-app-secret',
systemPrompt: 'You are a helpful assistant.',
);
print('Assigned model loaded in ${assignedElapsed.toStringAsFixed(1)} s');
Chat #
final result = await engine.sendMessage(
message: 'What is Rust ownership?',
);
print(result.text);
print(result.durationDisplay);
print(result.toolCalls);
Stream #
final buffer = StringBuffer();
await for (final chunk in engine.streamMessage(message: 'Tell me a short story.')) {
buffer.write(chunk.delta);
if (chunk.done) break;
}
print(buffer.toString());
Status and history #
final info = await engine.info();
print(info.status);
print(info.modelName);
print(info.approxMemory);
print(info.historyLength);
final history = await engine.history();
for (final msg in history) {
print('${msg.role}: ${msg.content}');
}
final removed = await engine.clearHistoryCount();
print('Cleared $removed messages.');
One-shot generation #
This runs inference without modifying conversation history.
final result = await engine.generate(
messages: [
ChatMessage(role: ChatRole.system, content: 'Output valid JSON only.'),
ChatMessage(role: ChatRole.user, content: 'Name: Alice, Age: 30'),
],
sampling: OndeInference.deterministicSamplingConfig(),
);
print(result.text);
Unload #
await engine.unloadModel();
Model selection #
final config = OndeInference.defaultModelConfig();
final small = OndeInference.qwen2515bConfig();
final medium = OndeInference.qwen253bConfig();
final coder = OndeInference.qwen25Coder3bConfig();
await engine.loadGgufModel(
config: coder,
systemPrompt: 'You are an expert software engineer.',
);
| Model | Size | Good for |
|---|---|---|
| Qwen 2.5 1.5B Instruct Q4_K_M | ~941 MB | iOS, tvOS, Android |
| Qwen 2.5 3B Instruct Q4_K_M | ~1.93 GB | macOS, Linux, Windows |
| Qwen 2.5 Coder 1.5B Instruct Q4_K_M | ~941 MB | Code on mobile |
| Qwen 2.5 Coder 3B Instruct Q4_K_M | ~1.93 GB | Code on desktop |
Sampling #
All sampling fields are optional. null means "use the engine default".
final sampling = SamplingConfig(
temperature: 0.7,
topP: 0.95,
topK: BigInt.from(40),
maxTokens: BigInt.from(256),
);
await engine.setSampling(sampling: sampling);
Presets:
OndeInference.defaultSamplingConfig();
OndeInference.deterministicSamplingConfig();
OndeInference.mobileSamplingConfig();
Error handling #
The generated bridge throws OndeError values directly:
try {
await engine.loadDefaultModel();
} on OndeError catch (e) {
debugPrint('Inference error: $e');
}
Common causes include calling sendMessage before loading a model, having no internet on first run while the model still needs to download, or running out of memory.
Sandboxed app setup (iOS / macOS / Android) #
On iOS, macOS, and Android, configure the Hugging Face cache directory before loading a model. On Apple platforms, Onde first tries the shared App Group container (group.com.ondeinference.apps) and falls back to your provided directory.
import 'dart:io' show Platform;
import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';
Future<void> main() async {
WidgetsFlutterBinding.ensureInitialized();
await OndeInference.init();
String? fallbackDir;
if (Platform.isIOS || Platform.isAndroid) {
final dir = await getApplicationSupportDirectory();
fallbackDir = dir.path;
}
await OndeInference.setupCacheDir(fallbackDir: fallbackDir);
runApp(const MyApp());
}
Example app #
A full Flutter example lives in example/. It demonstrates:
OndeChatEngine()lifecycle- assigned-model loading with dashboard credentials
- streaming chat UI
- sampling preset switching
- cache directory setup for sandboxed platforms
Run it locally from sdk/dart/example/.
Contributing #
The source lives at github.com/ondeinference/onde:
- Rust core:
src/ - Dart bridge crate:
sdk/dart/rust/ - Dart library:
sdk/dart/lib/ - Example app:
sdk/dart/example/
Open an issue before sending large PRs.
License #
Onde is dual-licensed under MIT and Apache 2.0. You can use either one.
© 2026 Splitfire AB
Copyright #
© 2026 Onde Inference (Splitfire AB).