onde_inference 1.1.1 copy "onde_inference: ^1.1.1" to clipboard
onde_inference: ^1.1.1 copied to clipboard

On-device LLM inference for Flutter & Dart. Run Qwen 2.5 models locally with Metal on iOS and macOS, CPU on Android and desktop. No cloud, no API key.

Onde Inference

Onde Inference

Run LLMs on-device from Flutter and Dart with Onde Inference. Metal on iOS and macOS, CPU everywhere else.

pub.dev Website License

Rust SDK · Swift SDK · Kotlin Multiplatform SDK · React Native SDK · Website


Run Qwen 2.5 models directly inside your Flutter app. The model downloads from Hugging Face on first launch, then everything runs locally. No server, no API key, and no user data leaves the device. On an iPhone 15 Pro, Metal reaches around 15 tok/s. Android, Linux, and Windows run on CPU, so they are slower but still useful for fully local inference.

You get multi-turn chat, streaming, one-shot generation, configurable sampling, and structured tool call metadata in one package.

Platform support #

Platform Backend Default model Notes
iOS 13+ Metal Qwen 2.5 Coder 1.5B (~941 MB) CocoaPods and Swift Package Manager plugin manifests are included
macOS 10.15+ Metal Qwen 2.5 Coder 3B (~1.93 GB) CocoaPods and Swift Package Manager plugin manifests are included
Android API 21+ CPU Qwen 2.5 Coder 1.5B (~941 MB) arm64-v8a, armeabi-v7a, x86_64, x86
Linux x86_64 CPU Qwen 2.5 Coder 3B (~1.93 GB) CUDA possible, see docs
Windows x86_64 CPU Qwen 2.5 Coder 3B (~1.93 GB) CUDA possible, see docs

Web is not supported. On-device inference needs native system access that browsers do not expose.


Quick start #

dependencies:
  onde_inference: ^1.0.2

The inference engine is written in Rust and connected to Dart through flutter_rust_bridge. You need a working Rust toolchain. The first build is usually slow because it compiles the full native dependency tree.

Initialize #

Call this once at startup before creating any OndeChatEngine:

import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';

Future<void> main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();
  runApp(const MyApp());
}

Load a model #

final engine = OndeChatEngine();

final elapsed = await engine.loadDefaultModel(
  systemPrompt: 'You are a helpful assistant.',
);

print('Model loaded in ${elapsed.toStringAsFixed(1)} s');

For production, you can load the model assigned to your Onde app from the dashboard:

final assignedElapsed = await engine.loadAssignedModel(
  appId: 'your-app-id',
  appSecret: 'your-app-secret',
  systemPrompt: 'You are a helpful assistant.',
);

print('Assigned model loaded in ${assignedElapsed.toStringAsFixed(1)} s');

Chat #

final result = await engine.sendMessage(
  message: 'What is Rust ownership?',
);

print(result.text);
print(result.durationDisplay);
print(result.toolCalls);

Stream #

final buffer = StringBuffer();

await for (final chunk in engine.streamMessage(message: 'Tell me a short story.')) {
  buffer.write(chunk.delta);
  if (chunk.done) break;
}

print(buffer.toString());

Status and history #

final info = await engine.info();
print(info.status);
print(info.modelName);
print(info.approxMemory);
print(info.historyLength);

final history = await engine.history();
for (final msg in history) {
  print('${msg.role}: ${msg.content}');
}

final removed = await engine.clearHistoryCount();
print('Cleared $removed messages.');

One-shot generation #

This runs inference without modifying conversation history.

final result = await engine.generate(
  messages: [
    ChatMessage(role: ChatRole.system, content: 'Output valid JSON only.'),
    ChatMessage(role: ChatRole.user, content: 'Name: Alice, Age: 30'),
  ],
  sampling: OndeInference.deterministicSamplingConfig(),
);

print(result.text);

Unload #

await engine.unloadModel();

Model selection #

final config = OndeInference.defaultModelConfig();
final small = OndeInference.qwen2515bConfig();
final medium = OndeInference.qwen253bConfig();
final coder = OndeInference.qwen25Coder3bConfig();

await engine.loadGgufModel(
  config: coder,
  systemPrompt: 'You are an expert software engineer.',
);
Model Size Good for
Qwen 2.5 1.5B Instruct Q4_K_M ~941 MB iOS, tvOS, Android
Qwen 2.5 3B Instruct Q4_K_M ~1.93 GB macOS, Linux, Windows
Qwen 2.5 Coder 1.5B Instruct Q4_K_M ~941 MB Code on mobile
Qwen 2.5 Coder 3B Instruct Q4_K_M ~1.93 GB Code on desktop

Sampling #

All sampling fields are optional. null means "use the engine default".

final sampling = SamplingConfig(
  temperature: 0.7,
  topP: 0.95,
  topK: BigInt.from(40),
  maxTokens: BigInt.from(256),
);

await engine.setSampling(sampling: sampling);

Presets:

OndeInference.defaultSamplingConfig();
OndeInference.deterministicSamplingConfig();
OndeInference.mobileSamplingConfig();

Error handling #

The generated bridge throws OndeError values directly:

try {
  await engine.loadDefaultModel();
} on OndeError catch (e) {
  debugPrint('Inference error: $e');
}

Common causes include calling sendMessage before loading a model, having no internet on first run while the model still needs to download, or running out of memory.


Sandboxed app setup (iOS / macOS / Android) #

On iOS, macOS, and Android, configure the Hugging Face cache directory before loading a model. On Apple platforms, Onde first tries the shared App Group container (group.com.ondeinference.apps) and falls back to your provided directory.

import 'dart:io' show Platform;

import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';

Future<void> main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();

  String? fallbackDir;
  if (Platform.isIOS || Platform.isAndroid) {
    final dir = await getApplicationSupportDirectory();
    fallbackDir = dir.path;
  }

  await OndeInference.setupCacheDir(fallbackDir: fallbackDir);
  runApp(const MyApp());
}

Example app #

A full Flutter example lives in example/. It demonstrates:

  • OndeChatEngine() lifecycle
  • assigned-model loading with dashboard credentials
  • streaming chat UI
  • sampling preset switching
  • cache directory setup for sandboxed platforms

Run it locally from sdk/dart/example/.

Contributing #

The source lives at github.com/ondeinference/onde:

  • Rust core: src/
  • Dart bridge crate: sdk/dart/rust/
  • Dart library: sdk/dart/lib/
  • Example app: sdk/dart/example/

Open an issue before sending large PRs.

License #

Onde is dual-licensed under MIT and Apache 2.0. You can use either one.

© 2026 Splitfire AB


© 2026 Onde Inference (Splitfire AB).

7
likes
160
points
337
downloads

Documentation

Documentation
API reference

Publisher

verified publisherondeinference.com

Weekly Downloads

On-device LLM inference for Flutter & Dart. Run Qwen 2.5 models locally with Metal on iOS and macOS, CPU on Android and desktop. No cloud, no API key.

Homepage
Repository (GitHub)
View/report issues

Topics

#flutter #llm #on-device-ai #offline-ai #ai-inference

License

MIT (license)

Dependencies

flutter, flutter_rust_bridge, freezed_annotation

More

Packages that depend on onde_inference

Packages that implement onde_inference