Onde Inference

Onde Inference

On-device LLM inference for Flutter & Dart โ€” optimized for Apple silicon.

pub.dev Website App Store

Rust SDK ยท Swift SDK ยท Website


Features

  • ๐Ÿš€ On-device inference โ€” models run entirely on the local CPU or GPU; no network request is ever made during inference
  • โšก Metal acceleration on iOS and macOS (Apple silicon) for fast token generation
  • ๐Ÿ’ฌ Multi-turn chat with automatic conversation history management
  • ๐ŸŒŠ Streaming token delivery via Dart Stream<StreamChunk> โ€” display tokens as they are generated
  • ๐Ÿค– Qwen 2.5 1.5B and 3B GGUF Q4_K_M models, downloaded from HuggingFace Hub on first use and cached locally
  • ๐ŸŽ›๏ธ Configurable sampling โ€” temperature, top-p, top-k, min-p, max tokens, frequency/presence penalties
  • ๐Ÿ“ฑ Platform-aware defaults โ€” automatically selects the 1.5B model on mobile and the 3B model on desktop
  • ๐Ÿฆ€ Rust core โ€” the inference engine is written in Rust for safety, performance, and zero-overhead FFI

Platform support

Platform GPU backend Default model Notes
iOS 13+ Metal Qwen 2.5 1.5B (~941 MB) Simulator uses aarch64-apple-ios-sim
macOS 10.15+ Metal Qwen 2.5 3B (~1.93 GB) Apple silicon & Intel supported
Android (API 21+) CPU Qwen 2.5 1.5B (~941 MB) arm64-v8a, armeabi-v7a, x86_64, x86
Linux (x86_64) CPU Qwen 2.5 3B (~1.93 GB) CUDA builds possible โ€” see docs
Windows (x86_64) CPU Qwen 2.5 3B (~1.93 GB) CUDA builds possible โ€” see docs

Web is not supported. On-device LLM inference requires native system access that is not available in a browser sandbox.


Quick start

Add the dependency

dependencies:
  onde_inference: ^0.1.0

Note: The native inference engine is written in Rust and compiled automatically during the Flutter build. A working Rust toolchain is required. The first build compiles the full dependency tree (~5โ€“10 minutes cold, <1 minute incremental).

Initialize

Call OndeInference.init() once at application startup, before creating any OndeChatEngine:

import 'package:onde_inference/onde_inference.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();
  runApp(const MyApp());
}

Load a model

// Create the engine (synchronous โ€” no model is loaded yet).
final engine = await OndeChatEngine.create();

// Load the platform-appropriate default model.
// On iOS / Android โ†’ Qwen 2.5 1.5B (~941 MB)
// On macOS / Linux / Windows โ†’ Qwen 2.5 3B (~1.93 GB)
final elapsed = await engine.loadDefaultModel(
  systemPrompt: 'You are a helpful assistant.',
);
print('Model loaded in ${elapsed.toStringAsFixed(1)} s');

Chat

final result = await engine.sendMessage('What is Rust's ownership model?');
print(result.text);
print('Generated in ${result.durationDisplay}');

Stream

final buffer = StringBuffer();

await for (final chunk in engine.streamMessage('Tell me a short story.')) {
  buffer.write(chunk.delta);

  // Update your UI with the partial text on each chunk.
  setState(() => _displayText = buffer.toString());

  if (chunk.done) break;
}

Check engine status

final info = await engine.info();

print(info.status);        // EngineStatus.ready
print(info.modelName);     // "Qwen 2.5 3B"
print(info.approxMemory);  // "~1.93 GB"
print(info.historyLength); // number of turns in the conversation

Manage conversation history

// Retrieve the full history.
final history = await engine.history();
for (final msg in history) {
  print('${msg.role}: ${msg.content}');
}

// Clear history (keeps the model loaded).
final removed = await engine.clearHistory();
print('Cleared $removed messages.');

// Seed history from a saved session without running inference.
await engine.pushHistory(ChatMessage.user('Hello from last session!'));
await engine.pushHistory(ChatMessage.assistant('Hi! How can I help today?'));

One-shot generation (does not affect history)

// Useful for prompt enhancement, classification, summarisation, etc.
final result = await engine.generate(
  [
    ChatMessage.system('You are a JSON formatter. Output only valid JSON.'),
    ChatMessage.user('Name: Alice, Age: 30, City: Stockholm'),
  ],
  sampling: SamplingConfig.deterministic(),
);
print(result.text);

Unload the model

// Release GPU / CPU memory when inference is no longer needed.
await engine.unloadModel();

Model selection

Use OndeInference static helpers to pick a specific model:

// Platform-aware default (recommended).
final config = OndeInference.defaultModelConfig();

// Force a specific model regardless of platform.
final small  = OndeInference.qwen251_5bConfig();   // ~941 MB
final medium = OndeInference.qwen253bConfig();      // ~1.93 GB
final coder  = OndeInference.qwen25Coder3bConfig(); // ~1.93 GB, code-tuned

await engine.loadGgufModel(
  medium,
  systemPrompt: 'You are an expert software engineer.',
);

Supported models

Model Size Best for
Qwen 2.5 1.5B Instruct Q4_K_M ~941 MB iOS, tvOS, Android
Qwen 2.5 3B Instruct Q4_K_M ~1.93 GB macOS, Linux, Windows
Qwen 2.5 Coder 1.5B Instruct Q4_K_M ~941 MB Code generation on mobile
Qwen 2.5 Coder 3B Instruct Q4_K_M ~1.93 GB Code generation on desktop

Sampling

// All fields are optional โ€” null means "use the engine default".
final sampling = SamplingConfig(
  temperature: 0.7,    // Higher = more creative, lower = more focused
  topP: 0.95,          // Nucleus sampling cutoff
  topK: 40,            // Top-k token limit
  maxTokens: 256,      // Maximum reply length in tokens
);

await engine.setSampling(sampling);

// Or use a preset:
await engine.setSampling(SamplingConfig.deterministic()); // greedy, temp=0.0
await engine.setSampling(SamplingConfig.mobile());        // temp=0.7, max 128 tokens
await engine.setSampling(SamplingConfig.defaultConfig()); // temp=0.7, max 512 tokens

Error handling

All OndeChatEngine methods throw OndeException on failure:

try {
  await engine.loadDefaultModel();
} on OndeException catch (e) {
  debugPrint('Inference error: ${e.message}');
}

Common causes:

  • No model loaded โ€” calling sendMessage before loadDefaultModel / loadGgufModel
  • Download failure โ€” check internet connectivity on first run (model files are fetched from HuggingFace Hub)
  • Out of memory โ€” the 3B model requires ~2 GB of free RAM; use the 1.5B model on constrained devices

Sandboxed app setup (iOS / macOS)

import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();

  // Resolve shared App Group container (iOS/macOS) or private sandbox (Android).
  String? fallback;
  if (Platform.isIOS || Platform.isAndroid) {
    final dir = await getApplicationSupportDirectory();
    fallback = dir.path;
  }
  await OndeInference.setupCacheDir(fallbackDir: fallback);

  runApp(const MyApp());
}

On iOS and macOS, setupCacheDir() first tries the App Group shared container (group.com.ondeinference.apps) so all Onde-powered apps share downloaded models. If unavailable, it falls back to the app's private directory.


Contributing

Contributions are welcome! The project is hosted at github.com/ondeinference/onde.

  • Rust source: onde/src/
  • Dart bridge Rust crate: onde/sdk/dart/rust/
  • Dart library: onde/sdk/dart/lib/
  • Example app: onde/sdk/dart/example/

Please open an issue before submitting a pull request for significant changes.


License

MIT ยฉ Splitfire AB โ€” see LICENSE.

ยฉ 2026 Onde Inference

Libraries

onde_inference
On-device LLM inference SDK for Flutter & Dart.