Edge-Veda
A managed on-device AI runtime for Flutter — text generation, vision, speech-to-text, embeddings, RAG, and function calling, all running locally with zero cloud dependencies.
~14,000 LOC | 43 C API functions | 31 Dart SDK files | 0 cloud dependencies
Why Edge-Veda Exists
Modern on-device AI demos break instantly in real usage:
- Thermal throttling collapses throughput
- Memory spikes cause silent crashes
- Sessions longer than ~60 seconds become unstable
- Developers have no visibility into runtime behavior
Edge-Veda exists to make on-device AI predictable, observable, and sustainable — not just runnable.
What Edge-Veda Is
A supervised on-device AI runtime that:
- Runs text, vision, speech, and embedding models fully on device
- Keeps models alive across long sessions via persistent worker isolates
- Enforces compute budget contracts (p95 latency, battery, thermal, memory)
- Auto-calibrates to each device's actual performance via adaptive profiles
- Provides structured output and function calling for agent-style apps
- Enables on-device RAG with embeddings, vector search, and retrieval-augmented generation
- Detects model uncertainty via confidence scoring and signals cloud handoff
- Is private by default (no network calls during inference)
Installation
dependencies:
edge_veda: ^1.3.0
Text Generation
final edgeVeda = EdgeVeda();
await edgeVeda.init(EdgeVedaConfig(
modelPath: modelPath,
contextLength: 2048,
useGpu: true,
));
// Streaming
await for (final chunk in edgeVeda.generateStream('Explain recursion briefly')) {
stdout.write(chunk.token);
}
// Blocking
final response = await edgeVeda.generate('Hello from on-device AI');
print(response.text);
Multi-Turn Conversation
final session = ChatSession(
edgeVeda: edgeVeda,
preset: SystemPromptPreset.coder,
);
await for (final chunk in session.sendStream('Write hello world in Python')) {
stdout.write(chunk.token);
}
// Model remembers the conversation
await for (final chunk in session.sendStream('Now convert it to Rust')) {
stdout.write(chunk.token);
}
print('Turns: ${session.turnCount}');
print('Context: ${(session.contextUsage * 100).toInt()}%');
// Start fresh (model stays loaded)
session.reset();
Function Calling
final tools = [
ToolDefinition(
name: 'get_weather',
description: 'Get current weather for a city',
parameters: {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'City name'},
},
'required': ['city'],
},
),
];
// Model selects and invokes tools, multi-round chains
final response = await session.sendWithTools(
'What\'s the weather in Tokyo?',
tools: tools,
toolHandler: (call) async {
// Execute the tool and return result
return ToolResult.success(call.id, {'temp': '22°C', 'condition': 'Sunny'});
},
);
Structured JSON Output
// Grammar-constrained generation ensures valid JSON
final result = await session.sendStructured(
'Extract the person\'s name and age from: "John is 30 years old"',
schema: {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
},
);
// result is guaranteed-valid JSON matching the schema
Speech-to-Text (Whisper)
final whisperWorker = WhisperWorker();
await whisperWorker.spawn();
await whisperWorker.initWhisper(
modelPath: whisperModelPath,
numThreads: 4,
);
// Streaming transcription from microphone
final whisperSession = WhisperSession(worker: whisperWorker);
whisperSession.transcriptionStream.listen((segments) {
for (final segment in segments) {
print(segment.text);
}
});
// Feed audio chunks (16kHz mono Float32)
whisperSession.addAudioChunk(audioData);
await whisperWorker.dispose();
Continuous Vision Inference
final visionWorker = VisionWorker();
await visionWorker.spawn();
await visionWorker.initVision(
modelPath: vlmModelPath,
mmprojPath: mmprojPath,
numThreads: 4,
contextSize: 2048,
useGpu: true,
);
// Process camera frames — model stays loaded across all calls
final result = await visionWorker.describeFrame(
rgbBytes, width, height,
prompt: 'Describe what you see.',
maxTokens: 100,
);
print(result.description);
await visionWorker.dispose();
Text Embeddings
// Generate embeddings with any GGUF embedding model
final result = await edgeVeda.embed('The quick brown fox');
print('Dimensions: ${result.dimensions}');
print('Vector: ${result.embedding.take(5)}...');
Confidence Scoring & Cloud Handoff
// Enable confidence tracking — zero overhead when disabled
final response = await edgeVeda.generate(
'Explain quantum computing',
options: GenerateOptions(confidenceThreshold: 0.3),
);
print('Confidence: ${response.avgConfidence}');
if (response.needsCloudHandoff) {
// Model is uncertain — route to cloud API
print('Low confidence, falling back to cloud');
}
On-Device RAG
// 1. Build a knowledge base
final index = VectorIndex(dimensions: 768);
final docs = ['Flutter is a UI framework', 'Dart is a language', ...];
for (final doc in docs) {
final emb = await edgeVeda.embed(doc);
index.add(doc, emb.embedding, metadata: {'source': 'docs'});
}
// Save index to disk
await index.save(indexPath);
// 2. Query with RAG pipeline
final rag = RagPipeline(
edgeVeda: edgeVeda,
index: index,
config: RagConfig(topK: 3, minScore: 0.5),
);
final answer = await rag.query('What is Flutter?');
print(answer.text); // Answer grounded in your documents
Compute Budget Contracts
Declare runtime guarantees. The Scheduler enforces them.
final scheduler = Scheduler(telemetry: TelemetryService());
// Auto-calibrates to this device's actual performance
scheduler.setBudget(EdgeVedaBudget.adaptive(BudgetProfile.balanced));
scheduler.registerWorkload(WorkloadId.vision, priority: WorkloadPriority.high);
scheduler.start();
// React to violations
scheduler.onBudgetViolation.listen((v) {
print('${v.constraint}: ${v.currentValue} > ${v.budgetValue}');
});
// After warm-up (~40s), inspect measured baseline
final baseline = scheduler.measuredBaseline;
final resolved = scheduler.resolvedBudget;
| Profile | p95 Multiplier | Battery | Thermal | Use Case |
|---|---|---|---|---|
| Conservative | 2.0x | 0.6x (strict) | Floor 1 | Background workloads |
| Balanced | 1.5x | 1.0x (match) | Floor 2 | Default for most apps |
| Performance | 1.1x | 1.5x (generous) | Allow 3 | Latency-sensitive apps |
Architecture
Flutter App (Dart)
|
+-- ChatSession ---------- Chat templates, context summarization, tool calling
|
+-- EdgeVeda ------------- generate(), generateStream(), embed()
|
+-- StreamingWorker ------ Persistent isolate, keeps text model loaded
+-- VisionWorker --------- Persistent isolate, keeps VLM loaded (~600MB)
+-- WhisperWorker -------- Persistent isolate, keeps Whisper model loaded
|
+-- RagPipeline ---------- embed → search → inject context → generate
+-- VectorIndex ---------- Pure Dart HNSW vector search (local_hnsw)
|
+-- Scheduler ------------ Central budget enforcer, priority-based degradation
+-- EdgeVedaBudget ------- Declarative constraints (p95, battery, thermal, memory)
+-- RuntimePolicy -------- Thermal/battery/memory QoS with hysteresis
+-- TelemetryService ----- iOS thermal, battery, memory polling
+-- FrameQueue ----------- Drop-newest backpressure for camera frames
+-- PerfTrace ------------ JSONL flight recorder for offline analysis
|
+-- FFI Bindings --------- 43 C functions via DynamicLibrary.process()
|
XCFramework (libedge_veda_full.a, ~8MB)
+-- engine.cpp ----------- Text inference + embeddings + confidence (wraps llama.cpp)
+-- vision_engine.cpp ---- Vision inference (wraps libmtmd)
+-- whisper_engine.cpp --- Speech-to-text (wraps whisper.cpp)
+-- memory_guard.cpp ----- Cross-platform RSS monitoring, pressure callbacks
+-- llama.cpp b7952 ------ Metal GPU, ARM NEON, GGUF models (unmodified)
+-- whisper.cpp v1.8.3 --- Shared ggml with llama.cpp (unmodified)
Key design constraint: Dart FFI is synchronous — calling native code directly would freeze the UI. All inference runs in background isolates. Native pointers never cross isolate boundaries. The StreamingWorker, VisionWorker, and WhisperWorker maintain persistent contexts so models load once and stay in memory across the entire session.
Runtime Supervision
Edge-Veda continuously monitors device thermal state, available memory, and battery level, then dynamically adjusts quality of service:
| QoS Level | FPS | Resolution | Tokens | Trigger |
|---|---|---|---|---|
| Full | 2 | 640px | 100 | No pressure |
| Reduced | 1 | 480px | 75 | Thermal warning, battery <15%, memory <200MB |
| Minimal | 1 | 320px | 50 | Thermal serious, battery <5%, memory <100MB |
| Paused | 0 | -- | 0 | Thermal critical, memory <50MB |
Escalation is immediate. Restoration requires cooldown (60s per level) to prevent oscillation.
Performance (Vision Soak Test)
Validated on physical iPhone, continuous vision inference:
| Metric | Value |
|---|---|
| Sustained runtime | 12.6 minutes |
| Frames processed | 254 |
| p50 latency | 1,412 ms |
| p95 latency | 2,283 ms |
| p99 latency | 2,597 ms |
| Model reloads | 0 |
| Crashes | 0 |
| Memory stability | No growth over session |
| Thermal handling | Graceful pause and resume |
Supported Models
Pre-configured in ModelRegistry with download URLs and SHA-256 checksums:
Text Generation
| Model | Size | Quantization | Use Case |
|---|---|---|---|
| Llama 3.2 1B Instruct | 668 MB | Q4_K_M | General chat, instruction following |
| Phi 3.5 Mini Instruct | 2.3 GB | Q4_K_M | Reasoning, longer context |
| Gemma 2 2B Instruct | 1.6 GB | Q4_K_M | General purpose |
| TinyLlama 1.1B Chat | 669 MB | Q4_K_M | Lightweight, fast inference |
Vision
| Model | Size | Quantization | Use Case |
|---|---|---|---|
| SmolVLM2 500M | 417 MB + 190 MB mmproj | Q8_0 / F16 | Image description, visual Q&A |
Speech-to-Text
| Model | Size | Quantization | Use Case |
|---|---|---|---|
| Whisper Tiny | ~75 MB | GGUF | Fast transcription, lower accuracy |
| Whisper Base | ~142 MB | GGUF | Balanced speed and accuracy |
Embeddings
| Model | Size | Quantization | Use Case |
|---|---|---|---|
| nomic-embed-text v1.5 | ~70 MB | Q4_K_M | Semantic search, RAG (768 dimensions) |
Any GGUF model compatible with llama.cpp or whisper.cpp can be loaded by file path.
Platform Status
| Platform | GPU | Status |
|---|---|---|
| iOS (device) | Metal | Fully validated on-device |
| iOS (simulator) | CPU | Working (Metal stubs) |
| Android | CPU | Scaffolded, validation pending |
Documentation
License
Built on llama.cpp and whisper.cpp by Georgi Gerganov and contributors.
Libraries
- edge_veda
- Edge Veda SDK - On-device LLM inference for Flutter