Onde Inference

Run LLMs on-device from Flutter and Dart with Onde Inference. Metal on iOS and macOS, CPU everywhere else.

Rust SDK · Swift SDK · Kotlin Multiplatform SDK · React Native SDK · Website

Run Qwen 2.5 models directly inside your Flutter app. The model downloads from Hugging Face on first launch, then everything runs locally. No server, no API key, and no user data leaves the device. On an iPhone 15 Pro, Metal reaches around 15 tok/s. Android, Linux, and Windows run on CPU, so they are slower but still useful for fully local inference.

You get multi-turn chat, streaming, one-shot generation, configurable sampling, and structured tool call metadata in one package.

Platform support #

Platform	Backend	Default model	Notes
iOS 13+	Metal	Qwen 2.5 Coder 1.5B (~941 MB)	CocoaPods and Swift Package Manager plugin manifests are included
macOS 10.15+	Metal	Qwen 2.5 Coder 3B (~1.93 GB)	CocoaPods and Swift Package Manager plugin manifests are included
Android API 21+	CPU	Qwen 2.5 Coder 1.5B (~941 MB)	arm64-v8a, armeabi-v7a, x86_64, x86
Linux x86_64	CPU	Qwen 2.5 Coder 3B (~1.93 GB)	CUDA possible, see docs
Windows x86_64	CPU	Qwen 2.5 Coder 3B (~1.93 GB)	CUDA possible, see docs

Web is not supported. On-device inference needs native system access that browsers do not expose.

Quick start #

dependencies:
  onde_inference: ^1.0.2

The inference engine is written in Rust and connected to Dart through flutter_rust_bridge. You need a working Rust toolchain. The first build is usually slow because it compiles the full native dependency tree.

Initialize #

Call this once at startup before creating any OndeChatEngine:

import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';

Future<void> main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();
  runApp(const MyApp());
}

Load a model #

final engine = OndeChatEngine();

final elapsed = await engine.loadDefaultModel(
  systemPrompt: 'You are a helpful assistant.',
);

print('Model loaded in ${elapsed.toStringAsFixed(1)} s');

For production, you can load the model assigned to your Onde app from the dashboard:

final assignedElapsed = await engine.loadAssignedModel(
  appId: 'your-app-id',
  appSecret: 'your-app-secret',
  systemPrompt: 'You are a helpful assistant.',
);

print('Assigned model loaded in ${assignedElapsed.toStringAsFixed(1)} s');

Chat #

final result = await engine.sendMessage(
  message: 'What is Rust ownership?',
);

print(result.text);
print(result.durationDisplay);
print(result.toolCalls);

Stream #

final buffer = StringBuffer();

await for (final chunk in engine.streamMessage(message: 'Tell me a short story.')) {
  buffer.write(chunk.delta);
  if (chunk.done) break;
}

print(buffer.toString());

Status and history #

final info = await engine.info();
print(info.status);
print(info.modelName);
print(info.approxMemory);
print(info.historyLength);

final history = await engine.history();
for (final msg in history) {
  print('${msg.role}: ${msg.content}');
}

final removed = await engine.clearHistoryCount();
print('Cleared $removed messages.');

One-shot generation #

This runs inference without modifying conversation history.

final result = await engine.generate(
  messages: [
    ChatMessage(role: ChatRole.system, content: 'Output valid JSON only.'),
    ChatMessage(role: ChatRole.user, content: 'Name: Alice, Age: 30'),
  ],
  sampling: OndeInference.deterministicSamplingConfig(),
);

print(result.text);

Unload #

await engine.unloadModel();

Model selection #

final config = OndeInference.defaultModelConfig();
final small = OndeInference.qwen2515bConfig();
final medium = OndeInference.qwen253bConfig();
final coder = OndeInference.qwen25Coder3bConfig();

await engine.loadGgufModel(
  config: coder,
  systemPrompt: 'You are an expert software engineer.',
);

Model	Size	Good for
Qwen 2.5 1.5B Instruct Q4_K_M	~941 MB	iOS, tvOS, Android
Qwen 2.5 3B Instruct Q4_K_M	~1.93 GB	macOS, Linux, Windows
Qwen 2.5 Coder 1.5B Instruct Q4_K_M	~941 MB	Code on mobile
Qwen 2.5 Coder 3B Instruct Q4_K_M	~1.93 GB	Code on desktop

Sampling #

All sampling fields are optional. null means "use the engine default".

final sampling = SamplingConfig(
  temperature: 0.7,
  topP: 0.95,
  topK: BigInt.from(40),
  maxTokens: BigInt.from(256),
);

await engine.setSampling(sampling: sampling);

Presets:

OndeInference.defaultSamplingConfig();
OndeInference.deterministicSamplingConfig();
OndeInference.mobileSamplingConfig();

Error handling #

The generated bridge throws OndeError values directly:

try {
  await engine.loadDefaultModel();
} on OndeError catch (e) {
  debugPrint('Inference error: $e');
}

Common causes include calling sendMessage before loading a model, having no internet on first run while the model still needs to download, or running out of memory.

Sandboxed app setup (iOS / macOS / Android) #

On iOS, macOS, and Android, configure the Hugging Face cache directory before loading a model. On Apple platforms, Onde first tries the shared App Group container (group.com.ondeinference.apps) and falls back to your provided directory.

import 'dart:io' show Platform;

import 'package:flutter/widgets.dart';
import 'package:onde_inference/onde_inference.dart';
import 'package:path_provider/path_provider.dart';

Future<void> main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await OndeInference.init();

  String? fallbackDir;
  if (Platform.isIOS || Platform.isAndroid) {
    final dir = await getApplicationSupportDirectory();
    fallbackDir = dir.path;
  }

  await OndeInference.setupCacheDir(fallbackDir: fallbackDir);
  runApp(const MyApp());
}

Example app #

A full Flutter example lives in example/. It demonstrates:

OndeChatEngine() lifecycle
assigned-model loading with dashboard credentials
streaming chat UI
sampling preset switching
cache directory setup for sandboxed platforms

Run it locally from sdk/dart/example/.

Contributing #

The source lives at github.com/ondeinference/onde:

Rust core: src/
Dart bridge crate: sdk/dart/rust/
Dart library: sdk/dart/lib/
Example app: sdk/dart/example/

Open an issue before sending large PRs.

License #

Onde is dual-licensed under MIT and Apache 2.0. You can use either one.

onde_inference 1.1.1
onde_inference: ^1.1.1 copied to clipboard

Metadata

Onde Inference

Platform support #

Quick start #

Initialize #

Load a model #

Chat #

Stream #

Status and history #

One-shot generation #

Unload #

Model selection #

Sampling #

Error handling #

Sandboxed app setup (iOS / macOS / Android) #

Example app #

Contributing #

License #

Copyright #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

onde_inference 1.1.1 onde_inference: ^1.1.1 copied to clipboard

Metadata

Onde Inference

Platform support #

Quick start #

Initialize #

Load a model #

Chat #

Stream #

Status and history #

One-shot generation #

Unload #

Model selection #

Sampling #

Error handling #

Sandboxed app setup (iOS / macOS / Android) #

Example app #

Contributing #

License #

Copyright #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

onde_inference 1.1.1
onde_inference: ^1.1.1 copied to clipboard