onde_inference 1.0.0 copy "onde_inference: ^1.0.0" to clipboard
onde_inference: ^1.0.0 copied to clipboard

On-device LLM inference for Flutter & Dart. Run Qwen 2.5 models locally with Metal on iOS and macOS, CPU on Android and desktop. No cloud, no API key.

Onde Inference

Onde Inference

On-device LLM inference for Flutter & Dart — Metal on iOS and macOS, CPU everywhere else.

pub.dev crates.io Swift Package Index npm Website App Store

Rust SDK · Swift SDK · React Native SDK · Website


Run Qwen 2.5 models inside your Flutter app. The model downloads from HuggingFace on first launch, then everything runs locally — no server, no API key, nothing leaves the device. Metal gives you ~15 tok/s on an iPhone 15 Pro; Android and desktop run on CPU, slower but it works.

Multi-turn chat, streaming, one-shot generation, configurable sampling — the full API is one import away.

Platform support #

Platform Backend Default model Notes
iOS 13+ Metal Qwen 2.5 1.5B (~941 MB) Simulator uses aarch64-apple-ios-sim
macOS 10.15+ Metal Qwen 2.5 3B (~1.93 GB) Apple silicon and Intel
Android API 21+ CPU Qwen 2.5 1.5B (~941 MB) arm64-v8a, armeabi-v7a, x86_64, x86
Linux x86_64 CPU Qwen 2.5 3B (~1.93 GB) CUDA possible, see docs
Windows x86_64 CPU Qwen 2.5 3B (~1.93 GB) CUDA possible, see docs

Web is not supported. On-device inference needs native system access that browsers don't expose.


Quick start #

dependencies:
  onde_inference: ^0.1.0

The inference engine is Rust compiled via flutter_rust_bridge. You need a working Rust toolchain. First build is slow (~5–10 min, compiling the full dep tree); incremental builds are under a minute.

Initialize #

Call once at startup, before anything else:

import 'package:onde_inference/onde_inference.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();
  runApp(const MyApp());
}

Load a model #

final engine = await OndeChatEngine.create();

// Picks the right model for the device:
//   iOS / Android → Qwen 2.5 1.5B (~941 MB)
//   macOS / Linux / Windows → Qwen 2.5 3B (~1.93 GB)
final elapsed = await engine.loadDefaultModel(
  systemPrompt: 'You are a helpful assistant.',
);
print('Model loaded in ${elapsed.toStringAsFixed(1)} s');

Chat #

final result = await engine.sendMessage('What is Rust's ownership model?');
print(result.text);
print('Generated in ${result.durationDisplay}');

Stream #

final buffer = StringBuffer();

await for (final chunk in engine.streamMessage('Tell me a short story.')) {
  buffer.write(chunk.delta);
  setState(() => _displayText = buffer.toString());
  if (chunk.done) break;
}

Engine status #

final info = await engine.info();

print(info.status);        // EngineStatus.ready
print(info.modelName);     // "Qwen 2.5 3B"
print(info.approxMemory);  // "~1.93 GB"
print(info.historyLength); // number of turns so far

History #

final history = await engine.history();
for (final msg in history) {
  print('${msg.role}: ${msg.content}');
}

// Clear history but keep the model loaded.
final removed = await engine.clearHistory();
print('Cleared $removed messages.');

// Seed from a saved session — no inference runs.
await engine.pushHistory(ChatMessage.user('Hello from last session!'));
await engine.pushHistory(ChatMessage.assistant('Hi! How can I help today?'));

One-shot generation #

Runs inference without touching conversation history. Good for prompt enhancement, classification, formatting.

final result = await engine.generate(
  [
    ChatMessage.system('You are a JSON formatter. Output only valid JSON.'),
    ChatMessage.user('Name: Alice, Age: 30, City: Stockholm'),
  ],
  sampling: SamplingConfig.deterministic(),
);
print(result.text);

Unload #

await engine.unloadModel();

Model selection #

// Platform-aware default (recommended).
final config = OndeInference.defaultModelConfig();

// Force a specific model.
final small  = OndeInference.qwen251_5bConfig();   // ~941 MB
final medium = OndeInference.qwen253bConfig();      // ~1.93 GB
final coder  = OndeInference.qwen25Coder3bConfig(); // ~1.93 GB, code-tuned

await engine.loadGgufModel(
  medium,
  systemPrompt: 'You are an expert software engineer.',
);
Model Size Good for
Qwen 2.5 1.5B Instruct Q4_K_M ~941 MB iOS, tvOS, Android
Qwen 2.5 3B Instruct Q4_K_M ~1.93 GB macOS, Linux, Windows
Qwen 2.5 Coder 1.5B Instruct Q4_K_M ~941 MB Code on mobile
Qwen 2.5 Coder 3B Instruct Q4_K_M ~1.93 GB Code on desktop

Sampling #

All fields are optional. null means "use the engine default".

final sampling = SamplingConfig(
  temperature: 0.7,
  topP: 0.95,
  topK: 40,
  maxTokens: 256,
);

await engine.setSampling(sampling);

Presets:

SamplingConfig.defaultConfig()   // temp=0.7, max 512 tokens
SamplingConfig.deterministic()   // greedy, temp=0.0
SamplingConfig.mobile()          // temp=0.7, max 128 tokens

Error handling #

All engine methods throw OndeException on failure:

try {
  await engine.loadDefaultModel();
} on OndeException catch (e) {
  debugPrint('Inference error: ${e.message}');
}

Common causes: calling sendMessage before loading a model, no internet on first run (the model needs to download), or out of memory (the 3B model needs ~2 GB free — use 1.5B on constrained devices).


Sandboxed app setup (iOS / macOS) #

On iOS and sandboxed macOS, the default HuggingFace cache path is outside the app container. Call setupCacheDir() once at startup to point it somewhere accessible:

import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();

  String? fallback;
  if (Platform.isIOS || Platform.isAndroid) {
    final dir = await getApplicationSupportDirectory();
    fallback = dir.path;
  }
  await OndeInference.setupCacheDir(fallbackDir: fallback);

  runApp(const MyApp());
}

This first tries the App Group shared container (group.com.ondeinference.apps) so all Onde-powered apps share downloaded models. Falls back to the app's private directory if the App Group isn't configured.


Contributing #

Source lives at github.com/ondeinference/onde:

  • Rust core: src/
  • Dart bridge crate: sdk/dart/rust/
  • Dart library: sdk/dart/lib/
  • Example app: sdk/dart/example/

Open an issue before sending large PRs.

License #

Dual-licensed under MIT and Apache 2.0. Pick whichever works for you.

© 2026 Splitfire AB


© 2026 Onde Inference (Splitfire AB).

7
likes
70
points
551
downloads

Documentation

API reference

Publisher

verified publisherondeinference.com

Weekly Downloads

On-device LLM inference for Flutter & Dart. Run Qwen 2.5 models locally with Metal on iOS and macOS, CPU on Android and desktop. No cloud, no API key.

Homepage
Repository (GitHub)
View/report issues

License

MIT (license)

Dependencies

flutter, flutter_rust_bridge, freezed_annotation

More

Packages that depend on onde_inference

Packages that implement onde_inference