# Flutter Gemma
The plugin supports not only Gemma, but also other models. Here's the full list of supported models: Gemma 2B & Gemma 7B, Gemma-2 2B, Gemma-3 1B, Gemma 3 270M, Gemma 3 Nano 2B, Gemma 3 Nano 4B, TinyLlama 1.1B, Hammer 2.1 0.5B, Llama 3.2 1B, Phi-2, Phi-3 , Phi-4, DeepSeek, Qwen2.5-1.5B-Instruct, Falcon-RW-1B, StableLM-3B.
*Note: Currently, the flutter_gemma plugin supports Gemma-3, Gemma 3 270M, Gemma 3 Nano (with multimodal vision support), TinyLlama, Hammer 2.1, Llama 3.2, Phi-4, DeepSeek and Qwen2.5.
Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models
 
Bring the power of Google's lightweight Gemma language models directly to your Flutter applications. With Flutter Gemma, you can seamlessly incorporate advanced AI capabilities into your iOS and Android apps, all without relying on external servers.
There is an example of using:
 
Features
- Local Execution: Run Gemma models directly on user devices for enhanced privacy and offline functionality.
- Platform Support: Compatible with iOS, Android, and Web platforms.
- ๐ผ๏ธ Multimodal Support: Text + Image input with Gemma 3 Nano vision models
- ๐ ๏ธ Function Calling: Enable your models to call external functions and integrate with other services (supported by select models)
- ๐ง Thinking Mode: View the reasoning process of DeepSeek models with
- ๐ Stop Generation: Cancel text generation mid-process on Android devices
- โ๏ธ Backend Switching: Choose between CPU and GPU backends for each model individually in the example app
- ๐ Advanced Model Filtering: Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
- ๐ Model Sorting: Sort models alphabetically, by size, or use default order in the example app
- LoRA Support: Efficient fine-tuning and integration of LoRA (Low-Rank Adaptation) weights for tailored AI behavior.
- ๐ฅ Enhanced Downloads: Smart retry logic and ETag handling for reliable model downloads from HuggingFace CDN
- ๐ง Download Reliability: Automatic resume/restart logic for interrupted downloads with exponential backoff
- ๐ง Model Replace Policy: Configurable model replacement system (keep/replace) with automatic model switching
- ๐ Text Embeddings: Generate vector embeddings from text using EmbeddingGemma and Gecko models
- ๐ง Unified Model Management: Single system for managing both inference and embedding models with automatic validation
Model File Types
Flutter Gemma supports three types of model files:
- .taskfiles: MediaPipe-optimized format with built-in chat templates
- .litertlmfiles: LiterTLM format optimized for web platform compatibility
- .bin/.tflitefiles: Standard format requiring manual chat template formatting
The plugin automatically detects the file type and applies appropriate formatting.
Model Capabilities
The example app offers a curated list of models, each suited for different tasks. Here's a breakdown of the models available and their capabilities:
| Model Family | Best For | Function Calling | Thinking Mode | Vision | Languages | Size | 
|---|---|---|---|---|---|---|
| Gemma 3 Nano | On-device multimodal chat and image analysis. | โ | โ | โ | Multilingual | 3-6GB | 
| DeepSeek R1 | High-performance reasoning and code generation. | โ | โ | โ | Multilingual | 1.7GB | 
| Qwen 2.5 | Strong multilingual chat and instruction following. | โ | โ | โ | Multilingual | 1.6GB | 
| Hammer 2.1 | Lightweight action model for tool usage. | โ | โ | โ | Multilingual | 0.5GB | 
| Gemma 3 1B | Balanced and efficient text generation. | โ | โ | โ | Multilingual | 0.5GB | 
| Gemma 3 270M | Ideal for fine-tuning (LoRA) for specific tasks | โ | โ | โ | Multilingual | 0.3GB | 
| TinyLlama 1.1B | Extremely compact, general-purpose chat. | โ | โ | โ | English-focused | 1.2GB | 
| Llama 3.2 1B | Efficient instruction following | โ | โ | โ | Multilingual | 1.1GB | 
Installation
- 
Add flutter_gemmato yourpubspec.yaml:dependencies: flutter_gemma: latest_version
- 
Run flutter pub getto install.
Quick Start
1. Install a Model (One Time)
import 'package:flutter_gemma/flutter_gemma.dart';
// Install model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
).fromNetwork(
  'https://huggingface.co/google/gemma-3-2b-it/resolve/main/gemma-3-2b-it-gpu-int8.task',
  token: 'your_hf_token',
).withProgress((progress) {
  print('Downloading: ${progress.percentage}%');
}).install();
2. Create and Use Model (Multiple Times)
// Create model with specific configuration
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);
// Use model
final chat = await model.createChat();
await chat.addQueryChunk(Message.text(
  text: 'Explain quantum computing',
  isUser: true,
));
final response = await chat.generateChatResponse();
// Cleanup
await chat.close();
await model.close();
3. Multiple Instances from Same Model
// Install once
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
// Create multiple instances
final quickModel = await FlutterGemma.getActiveModel(maxTokens: 512);
final deepModel = await FlutterGemma.getActiveModel(maxTokens: 4096);
// Both use the SAME model file!
Installation Sources
// Network
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork('https://example.com/model.task', token: 'optional')
  .install();
// Flutter assets
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromAsset('assets/models/model.task')
  .install();
// Native bundle
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromBundled('model.task')
  .install();
// External file
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromFile('/path/to/model.task')
  .install();
Modern API vs Legacy API
Modern API (Recommended) โ
Benefits:
- โ Cleaner, more intuitive
- โ Type-safe ModelSource
- โ Automatic active model management
- โ Install once, create many instances
Usage:
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
final model = await FlutterGemma.getActiveModel(maxTokens: 2048);
Legacy API
Still works but requires manual ModelType specification:
final model = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,  // Must specify every time
  maxTokens: 2048,
);
Initialize Flutter Gemma
Add to your main.dart:
import 'package:flutter_gemma/core/api/flutter_gemma.dart';
void main() {
  WidgetsFlutterBinding.ensureInitialized();
  // Optional: Initialize with HuggingFace token for gated models
  FlutterGemma.initialize(
    huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
    maxDownloadRetries: 10,
  );
  runApp(MyApp());
}
Next Steps:
- ๐ Authentication Setup - Configure tokens for gated models
- ๐ฆ Model Sources - Learn about different model sources
- ๐ Platform Support - Web vs Mobile differences
- ๐ Migration Guide - Upgrade from Legacy API
- ๐ Legacy API Documentation - For backwards compatibility
HuggingFace Authentication ๐
Many models require authentication to download from HuggingFace. Never commit tokens to version control.
โ Recommended: config.json Pattern
This is the most secure way to handle tokens in development and production.
Step 1: Create config template file config.json.example:
{
  "HUGGINGFACE_TOKEN": ""
}
Step 2: Copy and add your token:
cp config.json.example config.json
# Edit config.json and add your token from https://huggingface.co/settings/tokens
Step 3: Add to .gitignore:
# Never commit tokens!
config.json
Step 4: Run with config:
flutter run --dart-define-from-file=config.json
Step 5: Access in code:
void main() {
  WidgetsFlutterBinding.ensureInitialized();
  // Read from environment (populated by --dart-define-from-file)
  const token = String.fromEnvironment('HUGGINGFACE_TOKEN');
  // Initialize with token (optional if all models are public)
  FlutterGemma.initialize(
    huggingFaceToken: token.isNotEmpty ? token : null,
  );
  runApp(MyApp());
}
Alternative: Environment Variables
export HUGGINGFACE_TOKEN=hf_your_token_here
flutter run --dart-define=HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN
Alternative: Per-Download Token
// Pass token directly for specific downloads
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/gemma-3n-E2B-it-int4.task',
    token: 'hf_your_token_here',  // โ ๏ธ Not recommended - use config.json
  )
  .install();
Which Models Require Authentication?
Common gated models:
- โ
 Gemma 3 Nano (E2B, E4B) - google/repos are gated
- โ
 Gemma 3 1B - litert-community/requires access
- โ
 Gemma 3 270M - litert-community/requires access
- โ
 EmbeddingGemma - litert-community/requires access
Public models (no auth needed):
- โ DeepSeek, Qwen2.5, TinyLlama - Public repos
Get your token: https://huggingface.co/settings/tokens
Grant access to gated repos: Visit model page โ "Request Access" button
Model Sources ๐ฆ
Flutter Gemma supports multiple model sources with different capabilities:
| Source Type | Platform | Progress | Resume | Authentication | Use Case | 
|---|---|---|---|---|---|
| NetworkSource | All | โ Detailed | โ Yes | โ Supported | HuggingFace, CDNs, private servers | 
| AssetSource | All | โ ๏ธ End only | โ No | โ N/A | Models bundled in app assets | 
| BundledSource | All | โ ๏ธ End only | โ No | โ N/A | Native platform resources | 
| FileSource | Mobile only | โ ๏ธ End only | โ No | โ N/A | User-selected files (file picker) | 
NetworkSource - Internet Downloads
Downloads models from HTTP/HTTPS URLs with full progress tracking and authentication.
Features:
- โ Progress tracking (0-100%)
- โ Resume after interruption (ETag support)
- โ HuggingFace authentication
- โ Smart retry logic with exponential backoff
- โ Background downloads on mobile
- โ Cancellable downloads with CancelToken
Example:
// Public model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork('https://example.com/model.bin')
  .withProgress((progress) => print('$progress%'))
  .install();
// Private model with authentication
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/model.task',
    token: 'hf_...',  // Or use FlutterGemma.initialize(huggingFaceToken: ...)
  )
  .withProgress((progress) => setState(() => _progress = progress))
  .install();
Cancelling Downloads:
Use CancelToken to cancel downloads in progress:
import 'package:flutter_gemma/core/model_management/cancel_token.dart';
// Create cancel token
final cancelToken = CancelToken();
// Start download with cancel token
final future = FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(url)
  .withCancelToken(cancelToken)  // โ Pass cancel token via builder
  .withProgress((progress) => print('Progress: $progress%'))
  .install();
// Cancel download from another part of your code
// (e.g., user pressed cancel button)
cancelToken.cancel('User cancelled download');
// Handle cancellation
try {
  await future;
  print('Download completed');
} catch (e) {
  if (CancelToken.isCancel(e)) {
    print('Download was cancelled by user');
  } else {
    print('Download failed: $e');
  }
}
// Check if cancelled
if (cancelToken.isCancelled) {
  print('Reason: ${cancelToken.cancelReason}');
}
CancelToken Features:
- โ Non-breaking: Optional parameter, existing code works without changes
- โ Works with network downloads (inference + embedding models)
- โ Cancels ALL files in multi-file downloads (embedding: model + tokenizer)
- โ Platform-independent (Mobile + Web)
- โ
 Throws DownloadCancelledExceptionfor proper error handling
- โ Thread-safe cancellation
AssetSource - Flutter Assets
Copies models from Flutter assets (declared in pubspec.yaml).
Features:
- โ No network required
- โ Fast installation (local copy)
- โ ๏ธ Increases app size significantly
- โ Works offline
Example:
// 1. Add to pubspec.yaml
// assets:
//   - models/gemma-2b-it.bin
// 2. Install from asset
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('models/gemma-2b-it.bin')
  .install();
BundledSource - Native Resources
Production-Ready Offline Models: Include small models directly in your app bundle for instant availability without downloads.
Use Cases:
- โ Offline-first applications (works without internet from first launch)
- โ Small models (Gemma 3 270M ~300MB)
- โ Core features requiring guaranteed availability
- โ ๏ธ Not for large models (increases app size significantly)
Platform Setup:
Android (android/app/src/main/assets/models/)
# Place your model file
android/app/src/main/assets/models/gemma-3-270m-it.task
iOS (Add to Xcode project)
- Drag model file into Xcode project
- Check "Copy items if needed"
- Add to target membership
Web (Standard Flutter assets)
# pubspec.yaml
flutter:
  assets:
    - assets/models/gemma-3-270m-it.task
Features:
- โ Zero network dependency
- โ No installation delay
- โ No storage permission needed
- โ Direct path usage (no file copying)
Example:
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromBundled('gemma-3-270m-it.task')
  .install();
App Size Impact:
- Gemma 3 270M: ~300MB
- TinyLlama 1.1B: ~1.2GB
- Consider hosting large models for download instead
FileSource - External Files (Mobile Only)
References external files (e.g., user-selected via file picker).
Features:
- โ No copying (references original file)
- โ Protected from cleanup
- โ Web not supported (no local file system)
Example:
// Mobile only - after user selects file with file_picker
final path = '/data/user/0/com.app/files/model.task';
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromFile(path)
  .install();
Important: On web, FileSource only works with URLs or asset paths, not local file system paths.
Setup
- Download Model and optionally LoRA Weights: Obtain a pre-trained Gemma model (recommended: 2b or 2b-it) from Kaggle
- For multimodal support, download Gemma 3 Nano models or Gemma 3 Nano in LitertLM format that support vision input
- Optionally, fine-tune a model for your specific use case
- If you have LoRA weights, you can use them to customize the model's behavior without retraining the entire model.
- There is an article that described all approaches
- Platform specific setup:
iOS
- Set minimum iOS version in Podfile:
platform :ios, '16.0'  # Required for MediaPipe GenAI
- Enable file sharing in Info.plist:
<key>UIFileSharingEnabled</key>
<true/>
- Add network access description in Info.plist(for development):
<key>NSLocalNetworkUsageDescription</key>
<string>This app requires local network access for model inference services.</string>
- Enable performance optimization in Info.plist(optional):
<key>CADisableMinimumFrameDurationOnPhone</key>
<true/>
- Add memory entitlements in Runner.entitlements(for large models):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.developer.kernel.extended-virtual-addressing</key>
	<true/>
	<key>com.apple.developer.kernel.increased-memory-limit</key>
	<true/>
	<key>com.apple.developer.kernel.increased-debugging-memory-limit</key>
	<true/>
</dict>
</plist>
- Change the linking type of pods to static in Podfile:
use_frameworks! :linkage => :static
Android
- If you want to use a GPU to work with the model, you need to add OpenGL support in the manifest.xml. If you plan to use only the CPU, you can skip this step.
Add to 'AndroidManifest.xml' above tag </application>
 <uses-native-library
     android:name="libOpenCL.so"
     android:required="false"/>
 <uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
 <uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
- For release builds with ProGuard/R8 enabled, the plugin automatically includes necessary ProGuard rules. If you encounter issues with UnsatisfiedLinkErroror missing classes in release builds, ensure yourproguard-rules.proincludes:
# MediaPipe
-keep class com.google.mediapipe.** { *; }
-dontwarn com.google.mediapipe.**
# Protocol Buffers
-keep class com.google.protobuf.** { *; }
-dontwarn com.google.protobuf.**
# RAG functionality
-keep class com.google.ai.edge.localagents.** { *; }
-dontwarn com.google.ai.edge.localagents.**
Web
- 
Authentication: For gated models (Gemma 3 Nano, Gemma 3 1B/270M), you need to configure HuggingFace token. See HuggingFace Authentication section. 
- 
Web currently works only GPU backend models, CPU backend models are not supported by MediaPipe yet 
- 
Multimodal support (images) is fully supported on web platform 
- 
Model formats: Use .litertlmfiles for optimal web compatibility (recommended for multimodal models)
- 
Add dependencies to index.htmlfile in web folder
  <script type="module">
  import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest';
  window.FilesetResolver = FilesetResolver;
  window.LlmInference = LlmInference;
  </script>
Migration from Legacy to Modern API ๐
If you're upgrading from the Legacy API, here are common migration patterns:
Installing Models
| Legacy API | Modern API | 
|---|---|
|  |  | 
|  |  | 
Checking Model Installation
| Legacy API | Modern API | 
|---|---|
|  |  | 
Key Migration Notes
- โ
 Simpler imports: Use package:flutter_gemma/core/api/flutter_gemma.dart
- โ Builder pattern: Chain methods for cleaner code
- โ Callback-based progress: Simpler than streams for most cases
- โ Type-safe sources: Compile-time validation of source types
- โ ๏ธ Breaking change: Progress values are now int(0-100) instead ofDownloadProgressobject
- โ ๏ธ Separate files: Model and LoRA weights installed independently
Model Creation and Inference
Modern API (Recommended):
// Create model with runtime configuration
final inferenceModel = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);
final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();
Legacy API (Still supported):
// Works with both Legacy and Modern installation methods
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,
  preferredBackend: PreferredBackend.gpu,
  maxTokens: 2048,
);
final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();
Usage (Legacy API)
โ ๏ธ Click to expand Legacy API documentation (for backwards compatibility)
Note: This is the Legacy API. For new projects, we recommend using the Modern API with builder pattern.
Legacy API features:
- Direct method calls on
FlutterGemmaPlugin.instance.modelManager- Stream-based progress tracking
- Manual state management
Modern API features:
- โ Fluent builder pattern
- โ Type-safe source types
- โ Callback-based progress
- โ Better error messages
The new API splits functionality into two parts:
- ModelFileManager: Manages model and LoRA weights file handling.
- InferenceModel: Handles model initialization and response generation.
The updated API splits the functionality into two main parts:
- Import and access the plugin:
import 'package:flutter_gemma/flutter_gemma.dart';
final gemma = FlutterGemmaPlugin.instance;
- Managing Model Files with ModelFileManager
final modelManager = gemma.modelManager;
Place the model in the assets or upload it to a network drive, such as Firebase.
ATTENTION!! You do not need to load the model every time the application starts; it is stored in the system files and only needs to be done once. Please carefully review the example application. You should use loadAssetModel and loadNetworkModel methods only when you need to upload the model to device
1.Loading Models from assets (available only in debug mode):
Don't forget to add your model to pubspec.yaml
- Loading from assets (loraUrl is optional)
    await modelManager.installModelFromAsset('model.bin', loraPath: 'lora_weights.bin');
- Loading from assets with Progress Status (loraUrl is optional)
    modelManager.installModelFromAssetWithProgress('model.bin', loraPath: 'lora_weights.bin').listen(
    (progress) {
      print('Loading progress: $progress%');
    },
    onDone: () {
      print('Model loading complete.');
    },
    onError: (error) {
      print('Error loading model: $error');
    },
  );
2.Loading Models from network:
- 
For web usage, you will also need to enable CORS (Cross-Origin Resource Sharing) for your network resource. To enable CORS in Firebase, you can follow the guide in the Firebase documentation: Setting up CORS - Loading from the network (loraUrl is optional).
 
   await modelManager.downloadModelFromNetwork('https://example.com/model.bin', loraUrl: 'https://example.com/lora_weights.bin');
- Loading from the network with Progress Status (loraUrl is optional)
    modelManager.downloadModelFromNetworkWithProgress('https://example.com/model.bin', loraUrl: 'https://example.com/lora_weights.bin').listen(
    (progress) {
      print('Loading progress: $progress%');
    },
    onDone: () {
      print('Model loading complete.');
    },
    onError: (error) {
      print('Error loading model: $error');
    },
);
- Loading LoRA Weights
- Loading LoRA weight from the network.
await modelManager.downloadLoraWeightsFromNetwork('https://example.com/lora_weights.bin');
- Loading LoRA weight from assets.
await modelManager.installLoraWeightsFromAsset('lora_weights.bin');
- Model Management You can set model and weights paths manually
await modelManager.setModelPath('model.bin');
await modelManager.setLoraWeightsPath('lora_weights.bin');
Model Replace Policy
Configure how the plugin handles switching between different models:
// Set policy to keep all models (default behavior)
await modelManager.setReplacePolicy(ModelReplacePolicy.keep);
// Set policy to replace old models (saves storage space)
await modelManager.setReplacePolicy(ModelReplacePolicy.replace);
// Check current policy
final currentPolicy = modelManager.replacePolicy;
Automatic Model Management
Use ensureModelReady() for seamless model switching that handles all scenarios automatically:
// Handles all cases:
// - Same model already loaded: does nothing
// - Different model with KEEP policy: loads new model, keeps old one
// - Different model with REPLACE policy: deletes old model, loads new one
// - Corrupted/invalid model: re-downloads automatically
await modelManager.ensureModelReady(
  'gemma-3n-E4B-it-int4.task',
  'https://huggingface.co/google/gemma-3n-E4B-it-litert-preview/resolve/main/gemma-3n-E4B-it-int4.task'
);
You can delete the model and weights from the device. Deleting the model or LoRA weights will automatically close and clean up the inference. This ensures that there are no lingering resources or memory leaks when switching models or updating files.
await modelManager.deleteModel();
await modelManager.deleteLoraWeights();
5.Initialize:
Before performing any inference, you need to create a model instance. This ensures that your application is ready to handle requests efficiently.
Text-Only Models:
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt, // Required, model type to create
  preferredBackend: PreferredBackend.gpu, // Optional, backend type, default is PreferredBackend.gpu
  maxTokens: 512, // Optional, default is 1024
  loraRanks: [4, 8], // Optional, LoRA rank configuration for fine-tuned models
);
๐ผ๏ธ Multimodal Models:
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt, // Required, model type to create
  preferredBackend: PreferredBackend.gpu, // Optional, backend type
  maxTokens: 4096, // Recommended for multimodal models
  supportImage: true, // Enable image support
  maxNumImages: 1, // Optional, maximum number of images per message
  loraRanks: [4, 8], // Optional, LoRA rank configuration for fine-tuned models
);
6.Using Sessions for Single Inferences:
If you need to generate individual responses without maintaining a conversation history, use sessions. Sessions allow precise control over inference and must be properly closed to avoid memory leaks.
- Text-Only Session:
final session = await inferenceModel.createSession(
  temperature: 1.0, // Optional, default: 0.8
  randomSeed: 1, // Optional, default: 1
  topK: 1, // Optional, default: 1
  // topP: 0.9, // Optional nucleus sampling parameter
  // loraPath: 'path/to/lora.bin', // Optional LoRA weights path
  // enableVisionModality: true, // Enable vision for multimodal models
);
await session.addQueryChunk(Message.text(text: 'Tell me something interesting', isUser: true));
String response = await session.getResponse();
print(response);
await session.close(); // Always close the session when done
- ๐ผ๏ธ Multimodal Session:
import 'dart:typed_data'; // For Uint8List
final session = await inferenceModel.createSession(
  enableVisionModality: true, // Enable image processing
);
// Text + Image message
final imageBytes = await loadImageBytes(); // Your image loading method
await session.addQueryChunk(Message.withImage(
  text: 'What do you see in this image?',
  imageBytes: imageBytes,
  isUser: true,
));
// Note: session.getResponse() returns String directly
String response = await session.getResponse();
print(response);
await session.close();
- Asynchronous Response Generation:
final session = await inferenceModel.createSession();
await session.addQueryChunk(Message.text(text: 'Tell me something interesting', isUser: true));
// Note: session.getResponseAsync() returns Stream<String>
session.getResponseAsync().listen((String token) {
  print(token);
}, onDone: () {
  print('Stream closed');
}, onError: (error) {
  print('Error: $error');
});
await session.close(); // Always close the session when done
7.Chat Scenario with Automatic Session Management
For chat-based applications, you can create a chat instance. Unlike sessions, the chat instance manages the conversation context and refreshes sessions when necessary.
Text-Only Chat:
final chat = await inferenceModel.createChat(
  temperature: 0.8, // Controls response randomness, default: 0.8
  randomSeed: 1, // Ensures reproducibility, default: 1
  topK: 1, // Limits vocabulary scope, default: 1
  // topP: 0.9, // Optional nucleus sampling parameter
  // tokenBuffer: 256, // Token buffer size, default: 256
  // loraPath: 'path/to/lora.bin', // Optional LoRA weights path
  // supportImage: false, // Enable image support, default: false
  // tools: [], // List of available tools, default: []
  // supportsFunctionCalls: false, // Enable function calling, default: false
  // isThinking: false, // Enable thinking mode, default: false
  // modelType: ModelType.gemmaIt, // Model type, default: ModelType.gemmaIt
);
๐ผ๏ธ Multimodal Chat:
final chat = await inferenceModel.createChat(
  temperature: 0.8, // Controls response randomness
  randomSeed: 1, // Ensures reproducibility
  topK: 1, // Limits vocabulary scope
  supportImage: true, // Enable image support in chat
  // tokenBuffer: 256, // Token buffer size for context management
);
๐ง Thinking Mode Chat (DeepSeek Models):
final chat = await inferenceModel.createChat(
  temperature: 0.8,
  randomSeed: 1,
  topK: 1,
  isThinking: true, // Enable thinking mode for DeepSeek models
  modelType: ModelType.deepSeek, // Specify DeepSeek model type
  // supportsFunctionCalls: true, // Enable function calling for DeepSeek models
);
- Synchronous Chat:
await chat.addQueryChunk(Message.text(text: 'User: Hello, who are you?', isUser: true));
ModelResponse response = await chat.generateChatResponse();
if (response is TextResponse) {
  print(response.token);
}
await chat.addQueryChunk(Message.text(text: 'User: Are you sure?', isUser: true));
ModelResponse response2 = await chat.generateChatResponse();
if (response2 is TextResponse) {
  print(response2.token);
}
- ๐ผ๏ธ Multimodal Chat Example:
// Add text message
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
ModelResponse response1 = await chat.generateChatResponse();
if (response1 is TextResponse) {
  print(response1.token);
}
// Add image message
final imageBytes = await loadImageBytes();
await chat.addQueryChunk(Message.withImage(
  text: 'Can you analyze this image?',
  imageBytes: imageBytes,
  isUser: true,
));
ModelResponse response2 = await chat.generateChatResponse();
if (response2 is TextResponse) {
  print(response2.token);
}
// Add image-only message
await chat.addQueryChunk(Message.imageOnly(imageBytes: imageBytes, isUser: true));
ModelResponse response3 = await chat.generateChatResponse();
if (response3 is TextResponse) {
  print(response3.token);
}
- Asynchronous Chat (Streaming):
await chat.addQueryChunk(Message.text(text: 'User: Hello, who are you?', isUser: true));
chat.generateChatResponseAsync().listen((ModelResponse response) {
  if (response is TextResponse) {
    print(response.token);
  } else if (response is FunctionCallResponse) {
    print('Function call: ${response.name}');
  } else if (response is ThinkingResponse) {
    print('Thinking: ${response.content}');
  }
}, onDone: () {
  print('Chat stream closed');
}, onError: (error) {
  print('Chat error: $error');
});
- ๐ ๏ธ Function Calling
Enable your models to call external functions and integrate with other services. Note: Function calling is only supported by specific models - see the Model Support section below.
Step 1: Define Tools
Tools define the functions your model can call:
final List<Tool> _tools = [
  const Tool(
    name: 'change_background_color',
    description: "Changes the background color of the app. The color should be a standard web color name like 'red', 'blue', 'green', 'yellow', 'purple', or 'orange'.",
    parameters: {
      'type': 'object',
      'properties': {
        'color': {
          'type': 'string',
          'description': 'The color name',
        },
      },
      'required': ['color'],
    },
  ),
  const Tool(
    name: 'show_alert',
    description: 'Shows an alert dialog with a custom message and title.',
    parameters: {
      'type': 'object',
      'properties': {
        'title': {
          'type': 'string',
          'description': 'The title of the alert dialog',
        },
        'message': {
          'type': 'string',
          'description': 'The message content of the alert dialog',
        },
      },
      'required': ['title', 'message'],
    },
  ),
];
Step 2: Create Chat with Tools
final chat = await inferenceModel.createChat(
  temperature: 0.8,
  randomSeed: 1,
  topK: 1,
  tools: _tools, // Pass your tools
  supportsFunctionCalls: true, // Enable function calling (required for tools)
  // tokenBuffer: 256, // Adjust if needed for function calling
);
Step 3: Handle Different Response Types
The model can now return two types of responses:
// Add user message
await chat.addQueryChunk(Message.text(text: 'Change the background to blue', isUser: true));
// Handle async responses
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text: ${response.token}');
    // Update your UI with the text
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function
    print('Function Call: ${response.name}(${response.args})');
    _handleFunctionCall(response);
  }
});
Step 4: Execute Function and Send Response Back
Future<void> _handleFunctionCall(FunctionCallResponse functionCall) async {
  // Execute the requested function
  Map<String, dynamic> toolResponse;
  
  switch (functionCall.name) {
    case 'change_background_color':
      final color = functionCall.args['color'] as String?;
      // Your implementation here
      toolResponse = {'status': 'success', 'message': 'Color changed to $color'};
      break;
    case 'show_alert':
      final title = functionCall.args['title'] as String?;
      final message = functionCall.args['message'] as String?;
      // Show alert dialog
      toolResponse = {'status': 'success', 'message': 'Alert shown'};
      break;
    default:
      toolResponse = {'error': 'Unknown function: ${functionCall.name}'};
  }
  
  // Send the tool response back to the model
  final toolMessage = Message.toolResponse(
    toolName: functionCall.name,
    response: toolResponse,
  );
  await chat.addQueryChunk(toolMessage);
  
  // The model will then generate a final response explaining what it did
  final finalResponse = await chat.generateChatResponse();
  if (finalResponse is TextResponse) {
    print('Model: ${finalResponse.token}');
  }
}
Function Calling Best Practices:
- Use descriptive function names and clear descriptions
- Specify required vs optional parameters
- Always handle function execution errors gracefully
- Send meaningful responses back to the model
- The model will only call functions when explicitly requested by the user
- ๐ง Thinking Mode (DeepSeek Models)
DeepSeek models support "thinking mode" where you can see the model's reasoning process before it generates the final response. This provides transparency into how the model approaches problems.
Enable Thinking Mode:
final chat = await inferenceModel.createChat(
  temperature: 0.8,
  randomSeed: 1,
  topK: 1,
  isThinking: true, // Enable thinking mode
  modelType: ModelType.deepSeek, // Required for DeepSeek models
  supportsFunctionCalls: true, // DeepSeek also supports function calls
  tools: _tools, // Optional: add tools for function calling
  // tokenBuffer: 256, // Token buffer for context management
);
Handle Thinking Responses:
chat.generateChatResponseAsync().listen((response) {
  if (response is ThinkingResponse) {
    // Model's reasoning process
    print('Model is thinking: ${response.content}');
    // Show thinking bubble in UI
    _showThinkingBubble(response.content);
    
  } else if (response is TextResponse) {
    // Final response after thinking
    print('Final answer: ${response.token}');
    _updateFinalResponse(response.token);
    
  } else if (response is FunctionCallResponse) {
    // DeepSeek can also call functions while thinking
    print('Function call: ${response.name}');
    _handleFunctionCall(response);
  }
});
Thinking Mode Features:
- โ Transparent Reasoning: See how the model thinks through problems
- โ Interactive UI: Show/hide thinking bubbles with expandable content
- โ Streaming Support: Thinking content streams in real-time
- โ Function Integration: Models can think before calling functions
- โ DeepSeek Optimized: Designed specifically for DeepSeek model architecture
Example Thinking Flow:
- 
User asks: "Change the background to blue and explain why blue is calming" 
- 
Model thinks: "I need to change the color first, then explain the psychology" 
- 
Model calls: change_background_color(color: 'blue')
- 
Model explains: "Blue is calming because it's associated with sky and ocean..." 
- 
๐ Text Embedding Models (Modern API) 
Generate vector embeddings from text using specialized embedding models. These models convert text into numerical vectors that can be used for semantic similarity, search, and RAG applications.
Supported Embedding Models:
- EmbeddingGemma-300M - 300M parameters, generates 768D embeddings with varying max sequence lengths (256, 512, 1024, 2048 tokens)
- Gecko-110m - 110M parameters, generates 768D embeddings with varying max sequence lengths (64, 256, 512 tokens)
Note: Numbers in model names (64, 256, 512, 1024, 2048) refer to max sequence length (context window size in tokens), NOT embedding dimension. All these models output 768-dimensional embeddings regardless of sequence length.
Install Embedding Model:
// Install from network with progress tracking
await FlutterGemma.installEmbedder()
  .modelFromNetwork(
    'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/embeddinggemma-300M_seq1024_mixed-precision.tflite',
    token: 'hf_your_token_here',  // Required for gated models
  )
  .tokenizerFromNetwork(
    'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/sentencepiece.model',
  )
  .withModelProgress((progress) => print('Model: $progress%'))
  .withTokenizerProgress((progress) => print('Tokenizer: $progress%'))
  .install();
// Or from assets
await FlutterGemma.installEmbedder()
  .modelFromAsset('models/embeddinggemma.tflite')
  .tokenizerFromAsset('models/sentencepiece.model')
  .install();
Generate Text Embeddings:
// Create embedding model instance
final embeddingModel = await FlutterGemma.getActiveEmbedder(
  preferredBackend: PreferredBackend.gpu, // Optional: use GPU acceleration
);
// Generate embedding for single text
final embedding = await embeddingModel.generateEmbedding('Hello, world!');
print('Embedding vector: ${embedding.take(5)}...'); // Show first 5 dimensions
print('Embedding dimension: ${embedding.length}');
// Generate embeddings for multiple texts
final embeddings = await embeddingModel.generateEmbeddings([
  'Hello, world!',
  'How are you?',
  'Flutter is awesome!'
]);
print('Generated ${embeddings.length} embeddings');
// Get embedding model dimension
final dimension = await embeddingModel.getDimension();
print('Model dimension: $dimension');
// Calculate cosine similarity between embeddings
double cosineSimilarity(List<double> a, List<double> b) {
  double dotProduct = 0.0;
  double normA = 0.0;
  double normB = 0.0;
  for (int i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (math.sqrt(normA) * math.sqrt(normB));
}
final similarity = cosineSimilarity(embeddings[0], embeddings[1]);
print('Similarity: $similarity');
// Close model when done
await embeddingModel.close();
Legacy API (Still supported):
Click to expand Legacy API for embeddings
// Create embedding model specification
final embeddingSpec = MobileModelManager.createEmbeddingSpec(
  name: 'EmbeddingGemma 1024',
  modelUrl: 'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/embeddinggemma-300M_seq1024_mixed-precision.tflite',
  tokenizerUrl: 'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/sentencepiece.model',
);
// Download with progress tracking
final mobileManager = FlutterGemmaPlugin.instance.modelManager as MobileModelManager;
mobileManager.downloadModelWithProgress(embeddingSpec, token: 'your_hf_token').listen(
  (progress) => print('Download progress: ${progress.overallProgress}%'),
  onError: (error) => print('Download error: $error'),
  onDone: () => print('Download completed'),
);
// Create embedding model instance
final embeddingModel = await FlutterGemmaPlugin.instance.createEmbeddingModel(
  modelPath: '/path/to/embeddinggemma-300M_seq1024_mixed-precision.tflite',
  tokenizerPath: '/path/to/sentencepiece.model',
  preferredBackend: PreferredBackend.gpu,
);
Important Notes:
- โ EmbeddingGemma models require HuggingFace authentication token for gated repositories
- โ Embedding models use the same unified download and management system as inference models
- โ Each embedding model consists of both model file (.tflite) and tokenizer file (.model)
- โ Different dimension options allow trade-offs between accuracy and performance
- โ Modern API provides separate progress tracking for model and tokenizer downloads
๐ VectorStore Optimization (v0.11.7):
As of version 0.11.7, the VectorStore has been significantly optimized for better performance and storage efficiency:
Performance Improvements:
- 71% smaller storage: Binary BLOB format instead of JSON (3 KB vs 10.5 KB per 768D embedding)
- 6.7x faster reads: ~75 ฮผs vs ~500 ฮผs per document search
- 3.3x faster writes: ~45 ฮผs vs ~150 ฮผs per document insertion
New Features:
- Dynamic dimensions: Auto-detects any embedding size (256D, 384D, 512D, 768D, 1024D, 1536D, 3072D, 4096D+)
- iOS implementation: Full VectorStore support on iOS (was stubs only before v0.11.7)
- Cross-platform parity: Identical behavior on Android and iOS
Migration Notes:
- โ ๏ธ Breaking change for RAG users: Existing vector databases will be recreated on upgrade (re-indexing required)
- ๐ Impact: Minimal, since RAG feature is new (introduced in v0.11.5)
- โ Automatic: Database schema upgrade happens automatically on first use
Common Embedding Dimensions:
- 256D: Gecko Small, efficient for mobile
- 384D: MiniLM models
- 512D: Mid-range models
- 768D: BERT-base (standard)
- 1024D: BERT-large, Cohere v3
- 1536D: OpenAI Ada
- 3072D: OpenAI Large
- 4096D: Qwen-3
- Checking Token Usage You can check the token size of a prompt before inference. The accumulated context should not exceed maxTokens to ensure smooth operation.
int tokenCount = await session.sizeInTokens('Your prompt text here');
print('Prompt size in tokens: $tokenCount');
- Closing the Model
When you no longer need to perform any further inferences, call the close method to release resources:
await inferenceModel.close();
If you need to use the inference again later, remember to call createModel again before generating responses.
๐ผ๏ธ Message Types
The plugin now supports different types of messages:
// Text only
final textMessage = Message.text(text: "Hello!", isUser: true);
// Text + Image
final multimodalMessage = Message.withImage(
  text: "What's in this image?",
  imageBytes: imageBytes,
  isUser: true,
);
// Image only
final imageMessage = Message.imageOnly(imageBytes: imageBytes, isUser: true);
// Tool response (for function calling)
final toolMessage = Message.toolResponse(
  toolName: 'change_background_color',
  response: {'status': 'success', 'color': 'blue'},
);
// System information message
final systemMessage = Message.systemInfo(text: "Function completed successfully");
// Thinking content (for DeepSeek models)
final thinkingMessage = Message.thinking(text: "Let me analyze this problem...");
// Check if message contains image
if (message.hasImage) {
  print('This message contains an image');
}
// Create a copy of message
final copiedMessage = message.copyWith(text: "Updated text");
๐ฌ Response Types
The model can return different types of responses depending on capabilities:
// Handle different response types
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text token: ${response.token}');
    // Use response.token to update your UI incrementally
    
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function (Gemma 3 Nano, DeepSeek, Qwen2.5)
    print('Function: ${response.name}');
    print('Arguments: ${response.args}');
    
    // Execute the function and send response back
    _handleFunctionCall(response);
  } else if (response is ThinkingResponse) {
    // Model's reasoning process (DeepSeek models only)
    print('Thinking: ${response.content}');
    
    // Show thinking process in UI
    _showThinkingBubble(response.content);
  }
});
Response Types:
- TextResponse: Contains a text token (- response.token) for regular model output
- FunctionCallResponse: Contains function name (- response.name) and arguments (- response.args) when the model wants to call a function
- ThinkingResponse: Contains the model's reasoning process (- response.content) for DeepSeek models with thinking mode enabled
๐ฏ Supported Models
Text-Only Models
- Gemma 2B & Gemma 7B
- Gemma-2 2B
- Gemma-3 1B
- Gemma 3 270M - Ultra-compact model
- TinyLlama 1.1B - Lightweight chat model
- Hammer 2.1 0.5B - Action model with function calling
- Llama 3.2 1B - Instruction-tuned model
- Phi-4
- DeepSeek
- Phi-2, Phi-3, Falcon-RW-1B, StableLM-3B
๐ผ๏ธ Multimodal Models (Vision + Text)
- Gemma 3 Nano E2B - 2B parameters with vision support
- Gemma 3 Nano E4B - 4B parameters with vision support
- Gemma 3 Nano E2B LitertLM - 2B parameters with vision support
- Gemma 3 Nano E4B LitertLM - 4B parameters with vision support
๐ Text Embedding Models
All embedding models generate 768-dimensional vectors. The numbers in names (64/256/512/1024/2048) indicate maximum input sequence length in tokens, not embedding dimension.
| Model | Parameters | Dimensions | Max Seq Length | Size | Best For | Auth Required | 
|---|---|---|---|---|---|---|
| Gecko 64 | 110M | 768D | 64 tokens | 110MB | Short queries, real-time search | โ | 
| Gecko 256 | 110M | 768D | 256 tokens | 114MB | Balanced speed/accuracy | โ | 
| Gecko 512 | 110M | 768D | 512 tokens | 116MB | Medium context documents | โ | 
| EmbeddingGemma 256 | 300M | 768D | 256 tokens | 179MB | High accuracy, short context | โ | 
| EmbeddingGemma 512 | 300M | 768D | 512 tokens | 179MB | High accuracy, medium context | โ | 
| EmbeddingGemma 1024 | 300M | 768D | 1024 tokens | 183MB | Long documents, detailed content | โ | 
| EmbeddingGemma 2048 | 300M | 768D | 2048 tokens | 196MB | Very long documents | โ | 
Performance Comparison (Android Pixel 8 with GPU acceleration):
- Gecko 64: ~109ms/doc embedding, 130ms search (โก fastest - 2.6x faster than EmbeddingGemma)
- EmbeddingGemma 256: ~286ms/doc embedding, 342ms search (๐ฏ more accurate - 300M vs 110M params)
Use Cases:
- โ Gecko 64: Real-time search, mobile apps, short queries (โค64 tokens), fast inference
- โ Gecko 256/512: Balanced use cases, general-purpose embeddings, good speed/quality tradeoff
- โ EmbeddingGemma 256/512: High-quality embeddings, semantic search, better accuracy
- โ EmbeddingGemma 1024/2048: Long documents, detailed content, research papers, articles
๐ ๏ธ Model Function Calling Support
Function calling is currently supported by the following models:
โ Models with Function Calling Support
- Gemma 3 Nano models (E2B, E4B) - Full function calling support
- Hammer 2.1 0.5B - Action model with strong function calling capabilities
- DeepSeek models - Function calling + thinking mode support
- Qwen models - Full function calling support
โ Models WITHOUT Function Calling Support
- Gemma 3 1B models - Text generation only
- Gemma 3 270M - Text generation only
- TinyLlama 1.1B - Text generation only
- Llama 3.2 1B - Text generation only
- Phi models - Text generation only
Important Notes:
- When using unsupported models with tools, the plugin will log a warning and ignore the tools
- Models will work normally for text generation even if function calling is not supported
- Check the supportsFunctionCallsproperty in your model configuration
Platform Support Details ๐
Feature Comparison
| Feature | Android | iOS | Web | Notes | 
|---|---|---|---|---|
| Text Generation | โ Full | โ Full | โ Full | All models supported | 
| Image Input (Multimodal) | โ Full | โ Full | โ Full | Gemma 3 Nano models | 
| Function Calling | โ Full | โ Full | โ Full | Select models only | 
| Thinking Mode | โ Full | โ Full | โ Full | DeepSeek models | 
| GPU Acceleration | โ Full | โ Full | โ Full | Recommended | 
| CPU Backend | โ Full | โ Full | โ Not supported | MediaPipe limitation | 
| Streaming Responses | โ Full | โ Full | โ Full | Real-time generation | 
| LoRA Support | โ Full | โ Full | โ Full | Fine-tuned weights | 
| Text Embeddings | โ Full | โ Full | โ Full | EmbeddingGemma, Gecko | 
| File Downloads | โ Background | โ Background | โ In-memory | Platform-specific | 
| Asset Loading | โ Full | โ Full | โ Full | All source types | 
| Bundled Resources | โ Full | โ Full | โ Full | Native bundles | 
| External Files (FileSource) | โ Full | โ Full | โ Not supported | No local FS on web | 
Web Platform Specifics
Authentication
- Required for gated models: Gemma 3 Nano, Gemma 3 1B/270M, EmbeddingGemma
- Configuration: Use FlutterGemma.initialize(huggingFaceToken: '...')or pass token per-download
- Storage: Tokens stored in browser memory (not localStorage)
File Handling
- Downloads: Creates blob URLs in browser memory (no actual files)
- Storage: IndexedDB via WebFileSystemService
- FileSource: Only works with HTTP/HTTPS URLs or assets/paths
- Local file paths: โ Not supported (browser security restriction)
Backend Support
- GPU only: Web platform requires GPU backend (MediaPipe limitation)
- CPU models: โ Will fail to initialize on web
CORS Configuration
- Required for custom servers: Enable CORS headers on your model hosting server
- Firebase Storage: See CORS configuration docs
- HuggingFace: CORS already configured correctly
Memory Limitations
- Large models: May hit browser memory limits (2GB typical)
- Recommended: Use smaller models (1B-2B) for web platform
- Best models for web:
- Gemma 3 270M (300MB)
- Gemma 3 1B (500MB-1GB)
- Gemma 3 Nano E2B (3GB) - requires 6GB+ device RAM
 
Mobile Platform Specifics
Android
- GPU Support: Requires OpenGL libraries in AndroidManifest.xml
- ProGuard: Automatic rules included for release builds
- Storage: Local file system in app documents directory
iOS
- Minimum version: iOS 16.0 required for MediaPipe GenAI
- Memory entitlements: Required for large models (see Setup section)
- Linking: Static linking required (use_frameworks! :linkage => :static)
- Storage: Local file system in app documents directory
The full and complete example you can find in example folder
Important Considerations
- Model Size: Larger models (such as 7b and 7b-it) might be too resource-intensive for on-device inference.
- Function Calling Support: Gemma 3 Nano and DeepSeek models support function calling. Other models will ignore tools and show a warning.
- Thinking Mode: Only DeepSeek models support thinking mode. Enable with isThinking: trueandmodelType: ModelType.deepSeek.
- Multimodal Models: Gemma 3 Nano models with vision support require more memory and are recommended for devices with 8GB+ RAM.
- iOS Memory Requirements: Large models require memory entitlements in Runner.entitlementsand minimum iOS 16.0.
- LoRA Weights: They provide efficient customization without the need for full model retraining.
- Development vs. Production: For production apps, do not embed the model or LoRA weights within your assets. Instead, load them once and store them securely on the device or via a network drive.
- Web Models: Currently, Web support is available only for GPU backend models. Multimodal support is fully implemented.
- Image Formats: The plugin automatically handles common image formats (JPEG, PNG, etc.) when using Message.withImage().
๐ Troubleshooting
Multimodal Issues:
- Ensure you're using a multimodal model (Gemma 3 Nano E2B/E4B)
- Set supportImage: truewhen creating model and chat
- Check device memory - multimodal models require more RAM
Performance:
- Use GPU backend for better performance with multimodal models
- Consider using CPU backend for text-only models on lower-end devices
Memory Issues:
- iOS: Ensure Runner.entitlementscontains memory entitlements (see iOS setup)
- iOS: Set minimum platform to iOS 16.0 in Podfile
- Reduce maxTokensif experiencing memory issues
- Use smaller models (1B-2B parameters) for devices with <6GB RAM
- Close sessions and models when not needed
- Monitor token usage with sizeInTokens()
iOS Build Issues:
- Ensure minimum iOS version is set to 16.0 in Podfile
- Use static linking: use_frameworks! :linkage => :static
- Clean and reinstall pods: cd ios && pod install --repo-update
- Check that all required entitlements are in Runner.entitlements
Advanced Usage
ModelThinkingFilter (Advanced)
For advanced users who need to manually process model responses, the ModelThinkingFilter class provides utilities for cleaning model outputs:
import 'package:flutter_gemma/core/extensions.dart';
// Clean response based on model type
String cleanedResponse = ModelThinkingFilter.cleanResponse(
  rawResponse, 
  ModelType.deepSeek
);
// The filter automatically removes model-specific tokens like:
// - <end_of_turn> tags (Gemma models)
// - Special DeepSeek tokens
// - Extra whitespace and formatting
This is automatically handled by the chat API, but can be useful for custom inference implementations.
๐ What's New
โ ๐ Text Embeddings - Generate vector embeddings with EmbeddingGemma and Gecko models for semantic search applications โ ๐ง Unified Model Management - Single system for managing both inference and embedding models with automatic validation
Coming Soon:
- On-Device RAG Pipelines
- Desktop Support (macOS, Windows, Linux)
- Audio & Video Input
- Audio Output (Text-to-Speech)
- Web Caching
- System Instruction support
โ Support the Project
If you find Flutter Gemma useful and want to support its development, consider buying me a coffee! Your support helps me:
- ๐ง Maintain and improve the plugin
- ๐ Keep documentation up-to-date
- ๐ Fix bugs and resolve issues faster
- โจ Add new features and model support
- ๐งช Test on more devices and platforms
Every contribution, no matter how small, makes a difference. Thank you for your support! ๐
Libraries
- core/api/embedding_installation_builder
- core/api/flutter_gemma
- core/api/inference_installation_builder
- core/chat
- core/chat_event
- core/di/platform/mobile_service_factory
- Mobile-platform service factory. This file is only compiled on iOS/Android platforms. Uses background_downloader for model downloads.
- core/di/platform/web_service_factory
- Web-platform service factory. This file is only compiled on web platform. Uses dart:js_interop for browser-based downloads.
- core/di/service_registry
- core/domain/download_error
- core/domain/download_exception
- core/domain/model_source
- core/extensions
- core/function_call_parser
- core/handlers/asset_source_handler
- core/handlers/bundled_source_handler
- core/handlers/file_source_handler
- core/handlers/network_source_handler
- core/handlers/source_handler
- core/handlers/source_handler_registry
- core/handlers/web_asset_source_handler
- core/handlers/web_bundled_source_handler
- core/handlers/web_file_source_handler
- core/image_error_handler
- core/image_processor
- core/image_tokenizer
- core/infrastructure/background_downloader_service
- core/infrastructure/blob_url_manager
- core/infrastructure/flutter_asset_loader
- core/infrastructure/platform_file_system_service
- core/infrastructure/web_download_service
- core/infrastructure/web_file_system_service
- core/infrastructure/web_js_interop
- core/message
- core/migration/legacy_preferences_migrator
- core/model
- core/model_management/cancel_token
- core/model_management/constants/preferences_keys
- core/model_response
- core/multimodal_image_handler
- core/services/asset_loader
- core/services/download_service
- core/services/file_system_service
- core/services/model_repository
- core/services/protected_files_registry
- core/tool
- core/utils/file_name_utils
- core/vision_encoder_validator
- flutter_gemma
- flutter_gemma_interface
- mobile/flutter_gemma_mobile
- mobile/smart_downloader
- model_file_manager_interface
- pigeon.g
- rag/embedding_models
- web/flutter_gemma_web
- web/flutter_gemma_web_embedding_model
- web/llm_inference_web