RunAnywhere LlamaCpp Backend #

High-performance LLM text generation backend for the RunAnywhere Flutter SDK, powered by llama.cpp.

Features #

Feature	Description
GGUF Model Support	Run any GGUF-quantized model (Q4, Q5, Q8, etc.)
Streaming Generation	Token-by-token streaming for real-time UI updates
Metal Acceleration	Hardware acceleration on iOS devices
NEON Acceleration	ARM NEON optimizations on Android
Privacy-First	All processing happens locally on device
Memory Efficient	Quantized models reduce memory footprint

Installation #

Add both the core SDK and this backend to your pubspec.yaml:

dependencies:
  runanywhere: ^0.15.11
  runanywhere_llamacpp: ^0.15.11

Then run:

flutter pub get

Note: This package requires the core runanywhere package. It won't work standalone.

Platform Support #

Platform	Minimum Version	Acceleration
iOS	14.0+	Metal GPU
Android	API 24+	NEON SIMD

Quick Start #

1. Initialize & Register #

import 'package:runanywhere/runanywhere.dart';
import 'package:runanywhere_llamacpp/runanywhere_llamacpp.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();

  // Initialize SDK
  await RunAnywhere.initialize();

  // Register LlamaCpp backend
  await LlamaCpp.register();

  runApp(MyApp());
}

2. Add a Model #

LlamaCpp.addModel(
  id: 'smollm2-360m-q8_0',
  name: 'SmolLM2 360M Q8_0',
  url: 'https://huggingface.co/prithivMLmods/SmolLM2-360M-GGUF/resolve/main/SmolLM2-360M.Q8_0.gguf',
  memoryRequirement: 500000000,  // ~500MB
);

3. Download & Load #

// Download the model
await for (final progress in RunAnywhere.downloadModel('smollm2-360m-q8_0')) {
  print('Progress: ${(progress.percentage * 100).toStringAsFixed(1)}%');
  if (progress.state.isCompleted) break;
}

// Load the model
await RunAnywhere.loadModel('smollm2-360m-q8_0');
print('Model loaded: ${RunAnywhere.isModelLoaded}');

4. Generate Text #

// Simple chat
final response = await RunAnywhere.chat('Hello! How are you?');
print(response);

// Streaming generation
final result = await RunAnywhere.generateStream(
  'Write a short poem about Flutter',
  options: LLMGenerationOptions(maxTokens: 100, temperature: 0.7),
);

await for (final token in result.stream) {
  stdout.write(token);  // Real-time output
}

// Get metrics after completion
final metrics = await result.result;
print('\nTokens/sec: ${metrics.tokensPerSecond.toStringAsFixed(1)}');

API Reference #

LlamaCpp Class #

`register()`

static Future<void> register({int priority = 100})

Parameters:

priority – Backend priority (higher = preferred). Default: 100.

`addModel()`

Add an LLM model to the registry.

static void addModel({
  required String id,
  required String name,
  required String url,
  int memoryRequirement = 0,
  bool supportsThinking = false,
})

Parameters:

id – Unique model identifier
name – Human-readable model name
url – Download URL for the GGUF file
memoryRequirement – Estimated memory usage in bytes
supportsThinking – Whether model supports thinking tokens (e.g., DeepSeek R1)

Supported Models #

Any GGUF model compatible with llama.cpp:

Recommended Models #

Model	Size	Memory	Use Case
SmolLM2 360M Q8_0	~400MB	~500MB	Fast responses, mobile
Qwen2.5 0.5B Q8_0	~600MB	~700MB	Good quality, small
Qwen2.5 1.5B Q4_K_M	~1GB	~1.2GB	Better quality
Phi-3.5-mini Q4_K_M	~2GB	~2.5GB	High quality
Llama 3.2 1B Q4_K_M	~800MB	~1GB	Balanced
DeepSeek R1 1.5B Q4_K_M	~1.2GB	~1.5GB	Reasoning, thinking

Quantization Guide #

Format	Quality	Size	Speed
Q8_0	Highest	Largest	Slower
Q6_K	Very High	Large	Medium
Q5_K_M	High	Medium	Medium
Q4_K_M	Good	Small	Fast
Q4_0	Lower	Smallest	Fastest

Tip: For mobile, Q4_K_M or Q5_K_M offer the best quality/size balance.

Memory Management #

Checking Memory #

// Get available models with their memory requirements
final models = await RunAnywhere.availableModels();
for (final model in models) {
  if (model.downloadSize != null) {
    print('${model.name}: ${(model.downloadSize! / 1e9).toStringAsFixed(1)} GB');
  }
}

Unloading Models #

// Unload to free memory
await RunAnywhere.unloadModel();

Generation Options #

final result = await RunAnywhere.generate(
  'Your prompt here',
  options: LLMGenerationOptions(
    maxTokens: 200,           // Maximum tokens to generate
    temperature: 0.7,         // Randomness (0.0 = deterministic, 1.0 = creative)
    topP: 0.9,               // Nucleus sampling
    systemPrompt: 'You are a helpful assistant.',
  ),
);

Option	Default	Range	Description
`maxTokens`	100	1-4096	Maximum tokens to generate
`temperature`	0.8	0.0-2.0	Response randomness
`topP`	1.0	0.0-1.0	Nucleus sampling threshold
`systemPrompt`	null	-	System prompt prepended to input

Troubleshooting #

Model Loading Fails #

Symptom: SDKError.modelLoadFailed

Solutions:

Verify model is fully downloaded (check model.isDownloaded)
Ensure sufficient memory available
Check model format is GGUF (not GGML or safetensors)

Slow Generation #

Solutions:

Use smaller quantization (Q4_K_M instead of Q8_0)
Use a smaller model
Reduce maxTokens
On iOS, ensure Metal is available (device not in low power mode)

Out of Memory #

Solutions:

Unload current model before loading new one
Use smaller quantization
Use a smaller model

runanywhere — Core SDK (required)
runanywhere_llamacpp — LLM backend (this package)
runanywhere_onnx — STT/TTS/VAD backend

Resources #

License #

This software is licensed under the RunAnywhere License, which is based on Apache 2.0 with additional terms for commercial use. See LICENSE for details.

For commercial licensing inquiries, contact: san@runanywhere.ai

runanywhere_llamacpp 0.16.0
runanywhere_llamacpp: ^0.16.0 copied to clipboard

Metadata

RunAnywhere LlamaCpp Backend #

Features #

Installation #

Platform Support #

Quick Start #

1. Initialize & Register #

2. Add a Model #

3. Download & Load #

4. Generate Text #

API Reference #

LlamaCpp Class #

`register()`

`addModel()`

Supported Models #

Recommended Models #

Quantization Guide #

Memory Management #

Checking Memory #

Unloading Models #

Generation Options #

Troubleshooting #

Model Loading Fails #

Slow Generation #

Out of Memory #

Resources #

License #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Documentation

License

Dependencies

More

runanywhere_llamacpp 0.16.0 runanywhere_llamacpp: ^0.16.0 copied to clipboard

Metadata

RunAnywhere LlamaCpp Backend #

Features #

Installation #

Platform Support #

Quick Start #

1. Initialize & Register #

2. Add a Model #

3. Download & Load #

4. Generate Text #

API Reference #

LlamaCpp Class #

register()

addModel()

Supported Models #

Recommended Models #

Quantization Guide #

Memory Management #

Checking Memory #

Unloading Models #

Generation Options #

Troubleshooting #

Model Loading Fails #

Slow Generation #

Out of Memory #

Related Packages #

Resources #

License #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Documentation

License

Dependencies

More

runanywhere_llamacpp 0.16.0
runanywhere_llamacpp: ^0.16.0 copied to clipboard

`register()`

`addModel()`