edge_veda 1.0.0 copy "edge_veda: ^1.0.0" to clipboard
edge_veda: ^1.0.0 copied to clipboard

On-device LLM inference SDK for Flutter. Run Llama, Phi, and other language models locally with Metal GPU acceleration on iOS devices.

Edge Veda SDK for Flutter #

On-device AI inference for Flutter applications. Run LLMs, Speech-to-Text, and Text-to-Speech directly on mobile devices with hardware acceleration.

Features #

  • On-Device LLM Inference: Run large language models locally with llama.cpp
  • Hardware Acceleration: Metal (iOS), Vulkan (Android) for optimal performance
  • Streaming Support: Real-time token-by-token generation
  • Model Management: Download, verify, and cache models automatically
  • Memory Safe: Configurable memory limits with watchdog protection
  • Privacy First: 100% on-device processing, zero data transmission
  • Offline Ready: Works without internet connectivity

Performance #

  • Latency: Sub-200ms time-to-first-token on modern devices
  • Throughput: >15 tokens/sec for 1B parameter models
  • Memory: Optimized for 4GB devices with 1.5GB safe limit
  • Battery: GPU acceleration for efficient long-form generation

Installation #

Add to your pubspec.yaml:

dependencies:
  edge_veda: ^0.1.0

Run:

flutter pub get

Platform Requirements #

iOS #

  • iOS 13.0+
  • Metal-compatible device (iPhone 6s or later)
  • Xcode 14.0+

Android #

  • Android API 24+ (Android 7.0)
  • ARM64 or ARMv7 device
  • Vulkan 1.0+ support (optional but recommended)

Quick Start #

1. Download a Model #

import 'package:edge_veda/edge_veda.dart';

final modelManager = ModelManager();

// Download Llama 3.2 1B (recommended for most use cases)
final modelPath = await modelManager.downloadModel(
  ModelRegistry.llama32_1b,
);

// Monitor download progress
modelManager.downloadProgress.listen((progress) {
  print('Downloading: ${progress.progressPercent}%');
});

2. Initialize Edge Veda #

final edgeVeda = EdgeVeda();

await edgeVeda.init(EdgeVedaConfig(
  modelPath: modelPath,
  useGpu: true,              // Enable hardware acceleration
  numThreads: 4,             // CPU threads for inference
  contextLength: 2048,       // Max context window
  maxMemoryMb: 1536,         // Memory safety limit
  verbose: true,             // Enable logging
));

3. Generate Text #

Synchronous Generation:

final response = await edgeVeda.generate(
  'What is the capital of France?',
  options: GenerateOptions(
    maxTokens: 100,
    temperature: 0.7,
    topP: 0.9,
    systemPrompt: 'You are a helpful assistant.',
  ),
);

print(response.text);
print('Tokens/sec: ${response.tokensPerSecond}');

Streaming Generation:

final stream = edgeVeda.generateStream(
  'Tell me a story about a robot',
  options: GenerateOptions(
    maxTokens: 256,
    temperature: 0.8,
  ),
);

await for (final chunk in stream) {
  if (!chunk.isFinal) {
    print(chunk.token); // Print each token as it arrives
  }
}

4. Clean Up #

await edgeVeda.dispose();
modelManager.dispose();

Available Models #

  • Size: 668 MB
  • Speed: Very fast
  • Quality: Excellent for most tasks
  • Use Case: General chat, Q&A, summarization
ModelRegistry.llama32_1b

Phi 3.5 Mini Instruct #

  • Size: 2.3 GB
  • Speed: Fast
  • Quality: Superior reasoning
  • Use Case: Complex reasoning, coding, math
ModelRegistry.phi35_mini

Gemma 2 2B Instruct #

  • Size: 1.6 GB
  • Speed: Fast
  • Quality: High quality
  • Use Case: Versatile general-purpose
ModelRegistry.gemma2_2b

TinyLlama 1.1B Chat #

  • Size: 669 MB
  • Speed: Ultra fast
  • Quality: Good for simple tasks
  • Use Case: Resource-constrained devices
ModelRegistry.tinyLlama

Configuration Options #

EdgeVedaConfig #

EdgeVedaConfig(
  modelPath: '/path/to/model.gguf',  // Required
  numThreads: 4,                      // Default: 4
  contextLength: 2048,                // Default: 2048
  useGpu: true,                       // Default: true
  maxMemoryMb: 1536,                  // Default: 1536
  verbose: false,                     // Default: false
)

GenerateOptions #

GenerateOptions(
  systemPrompt: null,                 // Optional system context
  maxTokens: 512,                     // Default: 512
  temperature: 0.7,                   // Default: 0.7 (0.0-1.0)
  topP: 0.9,                         // Default: 0.9
  topK: 40,                          // Default: 40
  repeatPenalty: 1.1,                // Default: 1.1
  stopSequences: [],                 // Optional stop strings
  jsonMode: false,                   // Default: false
)

Model Management #

Check if Model is Downloaded #

final isDownloaded = await modelManager.isModelDownloaded('llama-3.2-1b-instruct-q4');

List Downloaded Models #

final models = await modelManager.getDownloadedModels();
print('Downloaded models: $models');

Get Total Storage Usage #

final totalBytes = await modelManager.getTotalModelsSize();
print('Storage used: ${totalBytes / (1024 * 1024)} MB');

Delete a Model #

await modelManager.deleteModel('llama-3.2-1b-instruct-q4');

Clear All Models #

await modelManager.clearAllModels();

Error Handling #

try {
  await edgeVeda.init(config);
} on InitializationException catch (e) {
  print('Init failed: ${e.message}');
} on ModelLoadException catch (e) {
  print('Model load failed: ${e.message}');
} on MemoryException catch (e) {
  print('Out of memory: ${e.message}');
} on EdgeVedaException catch (e) {
  print('Edge Veda error: ${e.message}');
}

Memory Management #

Monitor memory usage to prevent crashes:

// Get current memory usage
final memoryBytes = edgeVeda.getMemoryUsage();
final memoryMb = edgeVeda.getMemoryUsageMb();

// Check if limit exceeded
if (edgeVeda.isMemoryLimitExceeded()) {
  print('Warning: Memory limit exceeded!');
  // Consider disposing and reinitializing
}

Best Practices #

  1. Initialize Once: Initialize EdgeVeda once per app session, reuse the instance
  2. Memory Monitoring: Check memory usage periodically, especially on low-end devices
  3. Model Selection: Start with Llama 3.2 1B for best balance of speed and quality
  4. GPU Acceleration: Always enable useGpu: true unless testing CPU-only
  5. Context Management: Keep context length at 2048 or lower for optimal performance
  6. Error Handling: Always wrap operations in try-catch blocks
  7. Resource Cleanup: Call dispose() when done to free native memory

Example App #

See the example directory for a complete chat application demonstrating:

  • Model downloading with progress tracking
  • SDK initialization
  • Streaming text generation
  • Memory monitoring
  • Error handling

Run the example:

cd example
flutter run

Architecture #

Edge Veda uses a layered architecture:

┌─────────────────────────────────┐
│     Flutter Application         │
├─────────────────────────────────┤
│     edge_veda.dart (Public API) │
├─────────────────────────────────┤
│  Dart FFI Bindings              │
├─────────────────────────────────┤
│  Native C++ Core (llama.cpp)    │
├─────────────────────────────────┤
│  Hardware Acceleration          │
│  Metal (iOS) / Vulkan (Android) │
└─────────────────────────────────┘

Limitations #

  • Model Format: Only GGUF format supported
  • Platforms: iOS and Android only (Web/Desktop coming soon)
  • Model Size: Limited by device storage and RAM
  • Context Length: Maximum 32K tokens (recommended: 2048)

Troubleshooting #

iOS Build Issues #

cd ios
pod install
cd ..
flutter clean
flutter build ios

Android Build Issues #

flutter clean
cd android
./gradlew clean
cd ..
flutter build apk

Model Download Fails #

  • Check internet connectivity
  • Verify sufficient storage space
  • Try again (downloads are resumable)

Out of Memory #

  • Reduce contextLength
  • Lower maxMemoryMb threshold
  • Use a smaller model (TinyLlama)
  • Close other apps

Contributing #

Contributions are welcome! Please see CONTRIBUTING.md

License #

MIT License - see LICENSE

Support #

Roadmap #

  • ❌ Flutter Web support (WASM + WebGPU)
  • ❌ Speech-to-Text (Whisper)
  • ❌ Text-to-Speech (Kokoro-82M)
  • ❌ Voice Activity Detection
  • ❌ Prompt caching
  • ❌ LoRA adapter support
  • ❌ Custom model fine-tuning
0
likes
150
points
--
downloads

Publisher

unverified uploader

Weekly Downloads

On-device LLM inference SDK for Flutter. Run Llama, Phi, and other language models locally with Metal GPU acceleration on iOS devices.

Repository (GitHub)

Documentation

API reference

License

MIT (license)

Dependencies

crypto, ffi, flutter, http, path, path_provider

More

Packages that depend on edge_veda

Packages that implement edge_veda