onde_inference 0.1.1
onde_inference: ^0.1.1 copied to clipboard
On-device LLM inference SDK for Flutter & Dart. Runs Qwen 2.5 models locally with Metal (Apple silicon) and CPU acceleration — no cloud, no data leaving the device. Powered by the Onde Rust engine and [...]
Onde Inference
On-device LLM inference for Flutter & Dart — optimized for Apple silicon.
Rust SDK · Swift SDK · Website
Features #
- 🚀 On-device inference — models run entirely on the local CPU or GPU; no network request is ever made during inference
- ⚡ Metal acceleration on iOS and macOS (Apple silicon) for fast token generation
- 💬 Multi-turn chat with automatic conversation history management
- 🌊 Streaming token delivery via Dart
Stream<StreamChunk>— display tokens as they are generated - 🤖 Qwen 2.5 1.5B and 3B GGUF Q4_K_M models, downloaded from HuggingFace Hub on first use and cached locally
- 🎛️ Configurable sampling — temperature, top-p, top-k, min-p, max tokens, frequency/presence penalties
- 📱 Platform-aware defaults — automatically selects the 1.5B model on mobile and the 3B model on desktop
- 🦀 Rust core — the inference engine is written in Rust for safety, performance, and zero-overhead FFI
Platform support #
| Platform | GPU backend | Default model | Notes |
|---|---|---|---|
| iOS 13+ | Metal | Qwen 2.5 1.5B (~941 MB) | Simulator uses aarch64-apple-ios-sim |
| macOS 10.15+ | Metal | Qwen 2.5 3B (~1.93 GB) | Apple silicon & Intel supported |
| Android (API 21+) | CPU | Qwen 2.5 1.5B (~941 MB) | arm64-v8a, armeabi-v7a, x86_64, x86 |
| Linux (x86_64) | CPU | Qwen 2.5 3B (~1.93 GB) | CUDA builds possible — see docs |
| Windows (x86_64) | CPU | Qwen 2.5 3B (~1.93 GB) | CUDA builds possible — see docs |
Web is not supported. On-device LLM inference requires native system access that is not available in a browser sandbox.
Quick start #
Add the dependency #
dependencies:
onde_inference: ^0.1.0
Note: The native inference engine is written in Rust and compiled automatically during the Flutter build. A working Rust toolchain is required. The first build compiles the full dependency tree (~5–10 minutes cold, <1 minute incremental).
Initialize #
Call OndeInference.init() once at application startup, before creating any OndeChatEngine:
import 'package:onde_inference/onde_inference.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await OndeInference.init();
runApp(const MyApp());
}
Load a model #
// Create the engine (synchronous — no model is loaded yet).
final engine = await OndeChatEngine.create();
// Load the platform-appropriate default model.
// On iOS / Android → Qwen 2.5 1.5B (~941 MB)
// On macOS / Linux / Windows → Qwen 2.5 3B (~1.93 GB)
final elapsed = await engine.loadDefaultModel(
systemPrompt: 'You are a helpful assistant.',
);
print('Model loaded in ${elapsed.toStringAsFixed(1)} s');
Chat #
final result = await engine.sendMessage('What is Rust's ownership model?');
print(result.text);
print('Generated in ${result.durationDisplay}');
Stream #
final buffer = StringBuffer();
await for (final chunk in engine.streamMessage('Tell me a short story.')) {
buffer.write(chunk.delta);
// Update your UI with the partial text on each chunk.
setState(() => _displayText = buffer.toString());
if (chunk.done) break;
}
Check engine status #
final info = await engine.info();
print(info.status); // EngineStatus.ready
print(info.modelName); // "Qwen 2.5 3B"
print(info.approxMemory); // "~1.93 GB"
print(info.historyLength); // number of turns in the conversation
Manage conversation history #
// Retrieve the full history.
final history = await engine.history();
for (final msg in history) {
print('${msg.role}: ${msg.content}');
}
// Clear history (keeps the model loaded).
final removed = await engine.clearHistory();
print('Cleared $removed messages.');
// Seed history from a saved session without running inference.
await engine.pushHistory(ChatMessage.user('Hello from last session!'));
await engine.pushHistory(ChatMessage.assistant('Hi! How can I help today?'));
One-shot generation (does not affect history) #
// Useful for prompt enhancement, classification, summarisation, etc.
final result = await engine.generate(
[
ChatMessage.system('You are a JSON formatter. Output only valid JSON.'),
ChatMessage.user('Name: Alice, Age: 30, City: Stockholm'),
],
sampling: SamplingConfig.deterministic(),
);
print(result.text);
Unload the model #
// Release GPU / CPU memory when inference is no longer needed.
await engine.unloadModel();
Model selection #
Use OndeInference static helpers to pick a specific model:
// Platform-aware default (recommended).
final config = OndeInference.defaultModelConfig();
// Force a specific model regardless of platform.
final small = OndeInference.qwen251_5bConfig(); // ~941 MB
final medium = OndeInference.qwen253bConfig(); // ~1.93 GB
final coder = OndeInference.qwen25Coder3bConfig(); // ~1.93 GB, code-tuned
await engine.loadGgufModel(
medium,
systemPrompt: 'You are an expert software engineer.',
);
Supported models #
| Model | Size | Best for |
|---|---|---|
| Qwen 2.5 1.5B Instruct Q4_K_M | ~941 MB | iOS, tvOS, Android |
| Qwen 2.5 3B Instruct Q4_K_M | ~1.93 GB | macOS, Linux, Windows |
| Qwen 2.5 Coder 1.5B Instruct Q4_K_M | ~941 MB | Code generation on mobile |
| Qwen 2.5 Coder 3B Instruct Q4_K_M | ~1.93 GB | Code generation on desktop |
Sampling #
// All fields are optional — null means "use the engine default".
final sampling = SamplingConfig(
temperature: 0.7, // Higher = more creative, lower = more focused
topP: 0.95, // Nucleus sampling cutoff
topK: 40, // Top-k token limit
maxTokens: 256, // Maximum reply length in tokens
);
await engine.setSampling(sampling);
// Or use a preset:
await engine.setSampling(SamplingConfig.deterministic()); // greedy, temp=0.0
await engine.setSampling(SamplingConfig.mobile()); // temp=0.7, max 128 tokens
await engine.setSampling(SamplingConfig.defaultConfig()); // temp=0.7, max 512 tokens
Error handling #
All OndeChatEngine methods throw OndeException on failure:
try {
await engine.loadDefaultModel();
} on OndeException catch (e) {
debugPrint('Inference error: ${e.message}');
}
Common causes:
- No model loaded — calling
sendMessagebeforeloadDefaultModel/loadGgufModel - Download failure — check internet connectivity on first run (model files are fetched from HuggingFace Hub)
- Out of memory — the 3B model requires ~2 GB of free RAM; use the 1.5B model on constrained devices
Sandboxed app setup (iOS / macOS) #
import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await OndeInference.init();
// Resolve shared App Group container (iOS/macOS) or private sandbox (Android).
String? fallback;
if (Platform.isIOS || Platform.isAndroid) {
final dir = await getApplicationSupportDirectory();
fallback = dir.path;
}
await OndeInference.setupCacheDir(fallbackDir: fallback);
runApp(const MyApp());
}
On iOS and macOS,
setupCacheDir()first tries the App Group shared container (group.com.ondeinference.apps) so all Onde-powered apps share downloaded models. If unavailable, it falls back to the app's private directory.
Contributing #
Contributions are welcome! The project is hosted at github.com/ondeinference/onde.
- Rust source:
onde/src/ - Dart bridge Rust crate:
onde/sdk/dart/rust/ - Dart library:
onde/sdk/dart/lib/ - Example app:
onde/sdk/dart/example/
Please open an issue before submitting a pull request for significant changes.
License #
MIT © Splitfire AB — see LICENSE.
© 2026 Onde Inference