onde_inference 0.1.1 copy "onde_inference: ^0.1.1" to clipboard
onde_inference: ^0.1.1 copied to clipboard

On-device LLM inference SDK for Flutter & Dart. Runs Qwen 2.5 models locally with Metal (Apple silicon) and CPU acceleration — no cloud, no data leaving the device. Powered by the Onde Rust engine and [...]

Onde Inference

Onde Inference

On-device LLM inference for Flutter & Dart — optimized for Apple silicon.

pub.dev Website App Store

Rust SDK · Swift SDK · Website


Features #

  • 🚀 On-device inference — models run entirely on the local CPU or GPU; no network request is ever made during inference
  • Metal acceleration on iOS and macOS (Apple silicon) for fast token generation
  • 💬 Multi-turn chat with automatic conversation history management
  • 🌊 Streaming token delivery via Dart Stream<StreamChunk> — display tokens as they are generated
  • 🤖 Qwen 2.5 1.5B and 3B GGUF Q4_K_M models, downloaded from HuggingFace Hub on first use and cached locally
  • 🎛️ Configurable sampling — temperature, top-p, top-k, min-p, max tokens, frequency/presence penalties
  • 📱 Platform-aware defaults — automatically selects the 1.5B model on mobile and the 3B model on desktop
  • 🦀 Rust core — the inference engine is written in Rust for safety, performance, and zero-overhead FFI

Platform support #

Platform GPU backend Default model Notes
iOS 13+ Metal Qwen 2.5 1.5B (~941 MB) Simulator uses aarch64-apple-ios-sim
macOS 10.15+ Metal Qwen 2.5 3B (~1.93 GB) Apple silicon & Intel supported
Android (API 21+) CPU Qwen 2.5 1.5B (~941 MB) arm64-v8a, armeabi-v7a, x86_64, x86
Linux (x86_64) CPU Qwen 2.5 3B (~1.93 GB) CUDA builds possible — see docs
Windows (x86_64) CPU Qwen 2.5 3B (~1.93 GB) CUDA builds possible — see docs

Web is not supported. On-device LLM inference requires native system access that is not available in a browser sandbox.


Quick start #

Add the dependency #

dependencies:
  onde_inference: ^0.1.0

Note: The native inference engine is written in Rust and compiled automatically during the Flutter build. A working Rust toolchain is required. The first build compiles the full dependency tree (~5–10 minutes cold, <1 minute incremental).

Initialize #

Call OndeInference.init() once at application startup, before creating any OndeChatEngine:

import 'package:onde_inference/onde_inference.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();
  runApp(const MyApp());
}

Load a model #

// Create the engine (synchronous — no model is loaded yet).
final engine = await OndeChatEngine.create();

// Load the platform-appropriate default model.
// On iOS / Android → Qwen 2.5 1.5B (~941 MB)
// On macOS / Linux / Windows → Qwen 2.5 3B (~1.93 GB)
final elapsed = await engine.loadDefaultModel(
  systemPrompt: 'You are a helpful assistant.',
);
print('Model loaded in ${elapsed.toStringAsFixed(1)} s');

Chat #

final result = await engine.sendMessage('What is Rust's ownership model?');
print(result.text);
print('Generated in ${result.durationDisplay}');

Stream #

final buffer = StringBuffer();

await for (final chunk in engine.streamMessage('Tell me a short story.')) {
  buffer.write(chunk.delta);

  // Update your UI with the partial text on each chunk.
  setState(() => _displayText = buffer.toString());

  if (chunk.done) break;
}

Check engine status #

final info = await engine.info();

print(info.status);        // EngineStatus.ready
print(info.modelName);     // "Qwen 2.5 3B"
print(info.approxMemory);  // "~1.93 GB"
print(info.historyLength); // number of turns in the conversation

Manage conversation history #

// Retrieve the full history.
final history = await engine.history();
for (final msg in history) {
  print('${msg.role}: ${msg.content}');
}

// Clear history (keeps the model loaded).
final removed = await engine.clearHistory();
print('Cleared $removed messages.');

// Seed history from a saved session without running inference.
await engine.pushHistory(ChatMessage.user('Hello from last session!'));
await engine.pushHistory(ChatMessage.assistant('Hi! How can I help today?'));

One-shot generation (does not affect history) #

// Useful for prompt enhancement, classification, summarisation, etc.
final result = await engine.generate(
  [
    ChatMessage.system('You are a JSON formatter. Output only valid JSON.'),
    ChatMessage.user('Name: Alice, Age: 30, City: Stockholm'),
  ],
  sampling: SamplingConfig.deterministic(),
);
print(result.text);

Unload the model #

// Release GPU / CPU memory when inference is no longer needed.
await engine.unloadModel();

Model selection #

Use OndeInference static helpers to pick a specific model:

// Platform-aware default (recommended).
final config = OndeInference.defaultModelConfig();

// Force a specific model regardless of platform.
final small  = OndeInference.qwen251_5bConfig();   // ~941 MB
final medium = OndeInference.qwen253bConfig();      // ~1.93 GB
final coder  = OndeInference.qwen25Coder3bConfig(); // ~1.93 GB, code-tuned

await engine.loadGgufModel(
  medium,
  systemPrompt: 'You are an expert software engineer.',
);

Supported models #

Model Size Best for
Qwen 2.5 1.5B Instruct Q4_K_M ~941 MB iOS, tvOS, Android
Qwen 2.5 3B Instruct Q4_K_M ~1.93 GB macOS, Linux, Windows
Qwen 2.5 Coder 1.5B Instruct Q4_K_M ~941 MB Code generation on mobile
Qwen 2.5 Coder 3B Instruct Q4_K_M ~1.93 GB Code generation on desktop

Sampling #

// All fields are optional — null means "use the engine default".
final sampling = SamplingConfig(
  temperature: 0.7,    // Higher = more creative, lower = more focused
  topP: 0.95,          // Nucleus sampling cutoff
  topK: 40,            // Top-k token limit
  maxTokens: 256,      // Maximum reply length in tokens
);

await engine.setSampling(sampling);

// Or use a preset:
await engine.setSampling(SamplingConfig.deterministic()); // greedy, temp=0.0
await engine.setSampling(SamplingConfig.mobile());        // temp=0.7, max 128 tokens
await engine.setSampling(SamplingConfig.defaultConfig()); // temp=0.7, max 512 tokens

Error handling #

All OndeChatEngine methods throw OndeException on failure:

try {
  await engine.loadDefaultModel();
} on OndeException catch (e) {
  debugPrint('Inference error: ${e.message}');
}

Common causes:

  • No model loaded — calling sendMessage before loadDefaultModel / loadGgufModel
  • Download failure — check internet connectivity on first run (model files are fetched from HuggingFace Hub)
  • Out of memory — the 3B model requires ~2 GB of free RAM; use the 1.5B model on constrained devices

Sandboxed app setup (iOS / macOS) #

import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();

  // Resolve shared App Group container (iOS/macOS) or private sandbox (Android).
  String? fallback;
  if (Platform.isIOS || Platform.isAndroid) {
    final dir = await getApplicationSupportDirectory();
    fallback = dir.path;
  }
  await OndeInference.setupCacheDir(fallbackDir: fallback);

  runApp(const MyApp());
}

On iOS and macOS, setupCacheDir() first tries the App Group shared container (group.com.ondeinference.apps) so all Onde-powered apps share downloaded models. If unavailable, it falls back to the app's private directory.


Contributing #

Contributions are welcome! The project is hosted at github.com/ondeinference/onde.

  • Rust source: onde/src/
  • Dart bridge Rust crate: onde/sdk/dart/rust/
  • Dart library: onde/sdk/dart/lib/
  • Example app: onde/sdk/dart/example/

Please open an issue before submitting a pull request for significant changes.


License #

MIT © Splitfire AB — see LICENSE.

© 2026 Onde Inference

3
likes
110
points
0
downloads

Documentation

API reference

Publisher

verified publisherondeinference.com

Weekly Downloads

On-device LLM inference SDK for Flutter & Dart. Runs Qwen 2.5 models locally with Metal (Apple silicon) and CPU acceleration — no cloud, no data leaving the device. Powered by the Onde Rust engine and mistral.rs.

Homepage
Repository (GitHub)
View/report issues

License

MIT (license)

Dependencies

flutter, flutter_rust_bridge, freezed_annotation

More

Packages that depend on onde_inference

Packages that implement onde_inference