Flutter Gemma

The plugin supports not only Gemma, but also other models. Here's the full list of supported models: Gemma 4 E2B/E4B, Gemma3n E2B/E4B, FastVLM 0.5B, Gemma-3 1B, Gemma 3 270M, FunctionGemma 270M, Qwen3 0.6B, Qwen 2.5, Phi-4 Mini, DeepSeek R1, SmolLM 135M.

*Note: The flutter_gemma plugin supports Gemma 4 and Gemma3n (with multimodal vision and audio support), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require .litertlm model format.

Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

gemma_github_cover

Bring the power of Google's lightweight Gemma language models and other on-device LLMs directly to your Flutter applications. With Flutter Gemma, you can seamlessly incorporate advanced AI capabilities into your Flutter applications, all without relying on external servers.

There is an example of using:

gemma_github_gif

Features

Local Execution: Run Gemma and other LLMs (Qwen, DeepSeek, Phi, FastVLM, SmolLM, …) directly on user devices for enhanced privacy and offline functionality.
Platform Support: Compatible with iOS, Android, Web, macOS, Windows, and Linux platforms.
🖥️ Desktop Support: Native desktop apps (macOS, Windows, Linux) with GPU acceleration via LiteRT-LM, called directly from Dart through dart:ffi — no JVM/JRE bundling. See DESKTOP_SUPPORT.md for details.
🖼️ Multimodal Support: Text + Image input with Gemma 4, Gemma3n, and FastVLM vision models
🎙️ Audio Input: Record and send audio messages with Gemma3n E2B/E4B models (Android, iOS device, Desktop)
🛠️ Function Calling: Enable your models to call external functions and integrate with other services (supported by select models)
🧠 Thinking Mode: View the reasoning process of DeepSeek and Gemma 4 models with thinking blocks
🛑 Stop Generation: Cancel text generation mid-process on Android, iOS, Web, and Desktop
⚙️ Backend Switching: Choose between CPU and GPU backends for each model individually in the example app
🔍 Advanced Model Filtering: Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
📊 Model Sorting: Sort models alphabetically, by size, or use default order in the example app
LoRA Support: Efficient fine-tuning and integration of LoRA (Low-Rank Adaptation) weights for tailored AI behavior.
📥 Enhanced Downloads: Smart retry logic with exponential backoff for reliable model downloads
🔧 Download Reliability: Automatic restart logic for interrupted downloads (resume not supported by HuggingFace CDN)
📱 Android Foreground Service: Large downloads (>500MB) automatically use foreground service to bypass 9-minute timeout
🔧 Model Replace Policy: Configurable model replacement system (keep/replace) with automatic model switching
📊 Text Embeddings: Generate vector embeddings from text using EmbeddingGemma and Gecko models
🔎 On-device RAG: qdrant-edge vector store on native, wa-sqlite on Web. Payload-aware Filter (must / should / mustNot) for semantic search.
🔧 Unified Model Management: Single system for managing both inference and embedding models with automatic validation
💾 Web Persistent Caching: Models persist across browser restarts using Cache API (Web only)

What's new in 1.0

📦 Modular package split — the monolith is now a small core (flutter_gemma) plus opt-in packages, so your app ships only the native weight it uses: flutter_gemma_litertlm (.litertlm), flutter_gemma_mediapipe (.task/.bin), flutter_gemma_embeddings, flutter_gemma_rag_qdrant, flutter_gemma_rag_sqlite.
🔧 New FlutterGemma.initialize(...) registration — pass inferenceEngines, embeddingBackends, vectorStore for the packages you added. See Initialize Flutter Gemma.
✅ Every model / session / chat / embedding / RAG API is unchanged — migrating is just adding packages + the initialize call. See MIGRATION.md.
🧹 Legacy sqlite+local_hnsw vector store removed — native RAG runs on qdrant-edge (flutter_gemma_rag_qdrant); web on wa-sqlite (flutter_gemma_rag_sqlite).

See CHANGELOG.md for the full release history.

Model File Types

Flutter Gemma supports different model file formats, which are grouped into two types based on how chat templates are handled:

Type 1: MediaPipe-Managed Templates

.task files: MediaPipe-optimized format for mobile (Android/iOS)
.litertlm files: LiteRT-LM format for Android, iOS, and Desktop platforms

Both formats have identical behavior — MediaPipe handles chat templates internally.

Type 2: Manual Template Formatting

.bin files: Standard binary format
.tflite files: LiteRT format (formerly TensorFlow Lite)

Both formats require manual chat template formatting in your code.

Note: The plugin automatically detects the file extension and applies appropriate formatting. When specifying ModelFileType in your code:

Use ModelFileType.task for .task and .litertlm files (same behavior)
Use ModelFileType.binary for .bin and .tflite files (same behavior)

Format by Platform

Format	Android	iOS	Web	Desktop	Use Case
`.task`	✅	✅	✅	❌	Older models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4)
`.litertlm`	✅	✅ ¹	❌	✅	Newer models (Gemma 4, Qwen3, FastVLM + desktop for all)
`-web.task`	❌	❌	✅	❌	Web-specific builds (e.g. Gemma 4, Gemma3n)
`.bin`	✅	✅	✅	❌	Manual chat template formatting required
`.tflite`	✅	✅	✅	✅	Embeddings only (EmbeddingGemma, Gecko)

¹ iOS .litertlm runs on the FFI engine — vision and audio supported on physical devices. The Simulator stays CPU-only because Metal sim has a 256 MB single-allocation cap.

Model Capabilities

The example app offers a curated list of models, each suited for different tasks. Here's a breakdown of the models available and their capabilities:

Model Family	Best For	Function Calling	Thinking Mode	Vision	Languages	Size
Gemma 4 E2B	Next-gen multimodal chat — text, image, audio	✅	✅	✅	Multilingual	2.4GB
Gemma 4 E4B	Next-gen multimodal chat — text, image, audio	✅	✅	✅	Multilingual	4.3GB
Gemma3n	On-device multimodal chat and image analysis	✅	❌	✅	Multilingual	3-6GB
FastVLM 0.5B	Fast vision-language inference	❌	❌	✅	Multilingual	0.5GB
Phi-4 Mini	Advanced reasoning and instruction following	✅	❌	❌	Multilingual	3.9GB
DeepSeek R1	High-performance reasoning and code generation	✅	✅	❌	Multilingual	1.7GB
Qwen3 0.6B	Compact multilingual chat with function calling	✅	✅	❌	Multilingual	586MB
Qwen 2.5	Strong multilingual chat and instruction following	✅	❌	❌	Multilingual	0.5-1.6GB
Gemma 3 1B	Balanced and efficient text generation	✅	❌	❌	Multilingual	0.5GB
Gemma 3 270M	Ideal for fine-tuning (LoRA) for specific tasks	❌	❌	❌	Multilingual	0.3GB
FunctionGemma 270M	Specialized for function calling on-device	✅	❌	❌	Multilingual	284MB
SmolLM 135M	Ultra-compact, resource-constrained devices	❌	❌	❌	English	135MB
TranslateGemma 4B †	Single-shot 55-language translation	❌	❌	❌	55 languages	2-4GB

† TranslateGemma is CPU-only for now. Google hasn't released a mobile/desktop .litertlm bundle (HF discussion #5 — "no concrete plans"). The example app uses the community-converted bundle from barakplasma/translategemma-4b-it-android-task-quantized, which keeps EMBEDDING_LOOKUP weights in float32 for MediaPipe .task compatibility. That layout crashes the LiteRT GPU partitioner on Metal/WebGPU across all platforms — tracked upstream at LiteRT-LM#1748. Until Google ships the litert-lm quantization CLI, translation runs on CPU only (≈90 s prefill on a 4 B int4 bundle on M-series Macs).

ModelType Reference

When installing models, you need to specify the correct ModelType. Use this table to find the right type for your model:

Model Family	ModelType	Examples
Gemma 4	`ModelType.gemma4`	Gemma 4 E2B, Gemma 4 E4B (native function-call tokens)
Gemma 3 / Gemma3n	`ModelType.gemmaIt`	Gemma 3 1B, Gemma 3 270M, Gemma3n E2B/E4B
DeepSeek	`ModelType.deepSeek`	DeepSeek R1
Qwen 2.5	`ModelType.qwen`	Qwen 2.5 1.5B, Qwen 2.5 0.5B
Qwen 3	`ModelType.qwen3`	Qwen3 0.6B
FunctionGemma	`ModelType.functionGemma`	FunctionGemma 270M IT
Phi	`ModelType.phi`	Phi-4 Mini
General	`ModelType.general`	FastVLM 0.5B, SmolLM 135M

Note: Gemma 4 uses ModelType.gemma4 so its native <\|tool_call>...<tool_call\|> tokens are routed through the LiteRT-LM SDK's chat-template path. For Gemma 3 and earlier, keep ModelType.gemmaIt.

Usage Example:

// Gemma models
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// DeepSeek models
await FlutterGemma.installModel(modelType: ModelType.deepSeek)
  .fromNetwork(url).install();

// Phi-4 (uses general type)
await FlutterGemma.installModel(modelType: ModelType.general)
  .fromNetwork(url).install();

Installation

As of 1.0, flutter_gemma is split into a small core package plus opt-in packages for each engine / backend, so your app only pulls the native weight it actually uses. Add the core package, then the packages for the model formats and features you need.

Add the core package and the opt-in packages you need to pubspec.yaml:

dependencies:
  flutter_gemma: latest_version              # Core — always required (no engine on its own)

  # Inference engines — add at least one:
  flutter_gemma_litertlm: latest_version     # .litertlm models (FFI; mobile + desktop + web)
  flutter_gemma_mediapipe: latest_version    # .task / .bin models (MediaPipe; mobile + web)

  # Optional — text embeddings + on-device RAG:
  flutter_gemma_embeddings: latest_version   # text embeddings (EmbeddingGemma / Gecko)
  flutter_gemma_rag_qdrant: latest_version   # RAG vector store (native: qdrant-edge)
  flutter_gemma_rag_sqlite: latest_version   # RAG vector store (web: wa-sqlite; native: sqlite3)

Pick by need:

You want to…	Add
Run `.litertlm` models (Gemma 4, Qwen3, FastVLM, + all desktop)	`flutter_gemma_litertlm`
Run `.task` / `.bin` models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4)	`flutter_gemma_mediapipe`
Generate text embeddings	`flutter_gemma_embeddings`
On-device RAG on native (Android/iOS/desktop)	`flutter_gemma_rag_qdrant`
On-device RAG on web	`flutter_gemma_rag_sqlite`

Core registers no engine by itself — you wire the packages you added in FlutterGemma.initialize(...) (see Initialize Flutter Gemma).

Run flutter pub get to install.

Migrating from 0.16.x (monolith)? See MIGRATION.md — the only breaking change is adding the opt-in packages and the initialize(...) call; every model / session / RAG API is unchanged.

Platform & Architecture Support

The plugin ships native prebuilts only for the architectures below. Other ABIs fail at native load with a typed error.

Platform	Supported architecture	Not supported
Android	`arm64-v8a` (full)	`armeabi-v7a`, `x86_64`¹
iOS device	`arm64`	—
iOS Simulator	`arm64` (Apple Silicon Mac)	`x86_64` (Intel Mac)
macOS	`arm64` (Apple Silicon)	`x86_64` (Intel Mac)
Linux	`x86_64`, `arm64`	—
Windows	`x86_64`	`arm64`

¹ MediaPipe text inference (.task / .bin) on Android also works on x86_64 and armeabi-v7a because Google ships those ABIs in tasks-genai. Everything else (.litertlm FFI, embedding via LiteRT FFI, image generation) is arm64-v8a only:

Android feature	arm64-v8a	x86_64	armeabi-v7a
Text inference (`.task` / `.bin`)	✅	✅	✅
`.litertlm` (FFI)	✅	❌	❌
Embedding (LiteRT FFI)	✅	❌	❌
Image generation (vision)	✅	❌	❌

If your Android app uses only the arm64-only features, restrict the build to arm64 so the Play Store does not offer broken APKs to incompatible devices:

android {
    defaultConfig {
        ndk { abiFilters 'arm64-v8a' }
    }
}

For development, prefer an Apple Silicon Mac — the Android emulator runs arm64-v8a natively, and macOS / iOS Simulator builds are arm64.

Setup

⚠️ Important: Complete platform-specific setup before using the plugin.

Download Model and optionally LoRA Weights: Obtain a model from the Supported Models section or HuggingFace

For multimodal support, download Gemma3n models or Gemma3n in LitertLM format that support vision input
Optionally, fine-tune a model for your specific use case
If you have LoRA weights, you can use them to customize the model's behavior without retraining the entire model.
There is an article that described all approaches

Platform specific setup:

iOS — required by any inference engine package (flutter_gemma_litertlm and/or flutter_gemma_mediapipe)

Set minimum iOS version in Podfile:

platform :ios, '16.0'  # Required for MediaPipe GenAI

Enable file sharing in Info.plist:

<key>UIFileSharingEnabled</key>
<true/>

Add network access description in Info.plist (for development):

<key>NSLocalNetworkUsageDescription</key>
<string>This app requires local network access for model inference services.</string>

Enable performance optimization in Info.plist (optional):

<key>CADisableMinimumFrameDurationOnPhone</key>
<true/>

Add memory entitlements in Runner.entitlements (for large models):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.developer.kernel.extended-virtual-addressing</key>
	<true/>
	<key>com.apple.developer.kernel.increased-memory-limit</key>
	<true/>
	<key>com.apple.developer.kernel.increased-debugging-memory-limit</key>
	<true/>
</dict>
</plist>

Change the linking type of pods to static in Podfile:

use_frameworks! :linkage => :static

No host-side Podfile post_install is required on iOS — flutter_gemma patches the upstream LiteRT-LM dlopen path to use @executable_path/Frameworks/<X>.framework/<X> so dyld resolves Metal accelerators directly through the Native-Assets-bundled framework. This also keeps Runner.app/Frameworks/ App-Store-clean (fixes ITMS-90432, see #245).

Android

GPU (any engine): if you want to run on the GPU, add OpenCL support to the manifest. Required by both inference engines (flutter_gemma_litertlm and flutter_gemma_mediapipe). CPU-only? Skip this step.

Add to 'AndroidManifest.xml' above tag </application>

 <uses-native-library
     android:name="libOpenCL.so"
     android:required="false"/>
 <uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
 <uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>

ProGuard/R8 (only if you use flutter_gemma_mediapipe): the package ships its own consumer ProGuard rules, so release builds work out of the box. If you still hit UnsatisfiedLinkError / missing MediaPipe classes, add to your proguard-rules.pro:

# MediaPipe
-keep class com.google.mediapipe.** { *; }
-dontwarn com.google.mediapipe.**

# Protocol Buffers
-keep class com.google.protobuf.** { *; }
-dontwarn com.google.protobuf.**

flutter_gemma_litertlm is delivered as a Native-Assets dylib (no MediaPipe Java classes), so it needs no ProGuard rules.

Web

Web runs on the GPU backend only (MediaPipe has no web CPU backend). Add the CDN script(s) for the engine package(s) you use to your web/index.html.

flutter_gemma_mediapipe (.task / -web.task models) — add:

  <script type="module">
  import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27';
  window.FilesetResolver = FilesetResolver;
  window.LlmInference = LlmInference;
  </script>

flutter_gemma_litertlm (.litertlm web models — early preview) — add the handshake below. The @litert-lm/core ESM doesn't assign window globals and module scripts are deferred, so Dart must await window.litertLmReady (which resolves to the Engine constructor) before any static interop:

  <script type="module">
  window.litertLmReady = (async () => {
    const m = await import('https://cdn.jsdelivr.net/npm/@litert-lm/core@0.12.1/+esm');
    window.Engine = m.Engine;
    return m.Engine;
  })();
  </script>

flutter_gemma_rag_sqlite (web RAG) — add the wa-sqlite loader; see that package's README for the exact <script> + Subresource-Integrity hash.

Model compatibility: mobile .task models often don't work on web — use the -web.task (MediaPipe) or .litertlm (LiteRT-LM) web variant. Check the model repo for web-compatible builds.

Desktop (macOS, Windows, Linux) — requires flutter_gemma_litertlm

⚠️ Desktop Model Format

Desktop is served exclusively by the flutter_gemma_litertlm package and uses LiteRT-LM format only (.litertlm files). There is no MediaPipe engine on desktop — .task / .bin models used on mobile/web are NOT compatible with desktop. (flutter_gemma_embeddings and flutter_gemma_rag_qdrant / flutter_gemma_rag_sqlite also support desktop.)

The native library is fetched at build time by the package's Native-Assets hook — no manual download/bundling. The setup below applies to flutter_gemma_litertlm.

Inference (LiteRT-LM C API) and embeddings (LiteRT C API) on all native platforms run via dart:ffi directly in the Dart process — no JVM, no gRPC, no separate server. Native libraries are downloaded by hook/build.dart (Native Assets) at build time and bundled into the app automatically.

Platform	Architecture	GPU Acceleration	Status
macOS	arm64 (Apple Silicon)	Metal	✅ Ready
macOS	x86_64 (Intel)	-	❌ Not Supported
Windows	x86_64	DirectX 12	✅ Ready
Windows	arm64	-	❌ Not Supported
Linux	x86_64	Vulkan	✅ Ready ¹
Linux	arm64	Vulkan	✅ Ready ¹

¹ Linux GPU requires a proper vendor Vulkan driver (NVIDIA / AMD / Intel). Mesa's llvmpipe software fallback is not sufficient for Gemma 4 — its hardcoded 128 MB maxStorageBufferRange is below the model's per-buffer requirement. Install the vendor driver (e.g. nvidia-driver-535-server on Ubuntu) before running on GPU.

macOS Setup:

macOS requires a small post_install block in your macos/Podfile. The Apple accelerator dylibs Google ships upstream (libGemmaModelConstraintProvider.dylib, libLiteRtMetalAccelerator.dylib, libLiteRtTopKMetalSampler.dylib) were linked without -Wl,-headerpad_max_install_names, so Dart Native Assets' JIT bundling path (used by dart run / dart build_runner / flutter test on a pure Dart library) cannot rewrite their install_name to a long absolute path inside .dart_tool/lib/ and aborts (#247). To unblock both dart run and flutter build macos, the plugin's hook/build.dart skips bundling those three through Native Assets on macOS, and we instead copy them into App.app/Contents/Frameworks/ ourselves and patch LiteRtLm.dylib's LC_LOAD_DYLIB reference to the new framework path.

Paste this into your macos/Podfile (replacing any existing post_install block) and run pod install:

post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_macos_build_settings(target)
  end

  # flutter_gemma: bundle Apple accelerator dylibs as .framework bundles
  # into Contents/Frameworks/ and re-point LiteRtLm.dylib's LC_LOAD_DYLIB
  # reference to GemmaModelConstraintProvider's new path. 3-tier dylib
  # source fallback: Native Assets cache (pub.dev users) → plugin symlink
  # → in-repo prebuilt/. See README -> macOS Setup and #247/#255.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      phase_name = '[flutter_gemma] Setup LiteRT-LM macOS'

      # Only the app target embeds the Frameworks/ this phase patches.
      # RunnerTests inherits Runner's framework search paths and has no
      # Contents/Frameworks of its own — having the phase there creates a
      # cross-target dependency on Runner's framework output that Xcode
      # reports as "Cycle inside Flutter Assemble". Remove any stale copy
      # from non-app targets and skip them.
      unless user_target.name == 'Runner'
        user_target.build_phases
          .select { |p| p.respond_to?(:name) && p.name == phase_name }
          .each { |p| user_target.build_phases.delete(p) }
        next
      end

      existing = user_target.shell_script_build_phases.find { |p| p.name == phase_name }
      phase = existing || user_target.new_shell_script_build_phase(phase_name)
      # Declare a sentinel output so Xcode can order this phase in the
      # dependency graph instead of treating it as "runs every build" with
      # no outputs (the other half of the cycle warning). The script
      # `touch`es this file at the end.
      phase.output_paths = ['$(DERIVED_FILE_DIR)/flutter_gemma_litertlm_macos.stamp']
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Contents/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          exit 0
        fi
        # Sweep any leftover lib*.dylib symlinks from older flutter_gemma versions.
        for base in LiteRtMetalAccelerator LiteRtTopKMetalSampler GemmaModelConstraintProvider; do
          rm -f "${FRAMEWORKS}/lib${base}.dylib"
        done
        # Wrap each upstream dylib into a .framework bundle inside the app's
        # Contents/Frameworks/ so dlopen("@executable_path/../Frameworks/<X>.framework/<X>")
        # (the path the patched gpu_registry.cc uses) resolves at runtime.
        # Resolve dylib source — Native Assets cache (pub.dev), then path-dep fallbacks.
        for candidate in \
            "${HOME}/Library/Caches/flutter_gemma/native/macos_arm64" \
            "${PODS_ROOT}/../Flutter/ephemeral/.symlinks/plugins/flutter_gemma/native/litert_lm/prebuilt/macos_arm64" \
            "${SRCROOT}/../../native/litert_lm/prebuilt/macos_arm64"; do
          if [ -f "${candidate}/libGemmaModelConstraintProvider.dylib" ]; then
            PLUGIN_PREBUILT="${candidate}"
            break
          fi
        done
        if [ -z "${PLUGIN_PREBUILT:-}" ]; then
          echo "[flutter_gemma] ERROR: macOS companion dylibs not found. Run 'flutter clean && flutter pub get'."
          exit 1
        fi
        for base in GemmaModelConstraintProvider LiteRtMetalAccelerator LiteRtTopKMetalSampler; do
          src="${PLUGIN_PREBUILT}/lib${base}.dylib"
          if [ ! -f "${src}" ]; then
            echo "[flutter_gemma] WARNING: ${src} not found — runtime dlopen will fail"
            continue
          fi
          fw_dir="${FRAMEWORKS}/${base}.framework"
          mkdir -p "${fw_dir}/Versions/A/Resources"
          cp "${src}" "${fw_dir}/Versions/A/${base}"
          install_name_tool -id "@rpath/${base}.framework/Versions/A/${base}" \\
            "${fw_dir}/Versions/A/${base}" 2>/dev/null || true
          (cd "${fw_dir}" && ln -sfh A Versions/Current && ln -sfh "Versions/Current/${base}" "${base}" && ln -sfh "Versions/Current/Resources" Resources)
          cat > "${fw_dir}/Versions/A/Resources/Info.plist" <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>CFBundleExecutable</key><string>${base}</string>
  <key>CFBundleIdentifier</key><string>dev.flutterberlin.flutter_gemma.${base}</string>
  <key>CFBundleVersion</key><string>1</string>
  <key>CFBundleShortVersionString</key><string>1.0</string>
  <key>CFBundlePackageType</key><string>FMWK</string>
</dict>
</plist>
EOF
        done
        # Re-point LiteRtLm.dylib's LC_LOAD_DYLIB at the new framework path.
        LITERTLM="${FRAMEWORKS}/LiteRtLm.framework/Versions/A/LiteRtLm"
        if [ -f "${LITERTLM}" ]; then
          install_name_tool -change \\
            @rpath/libGemmaModelConstraintProvider.dylib \\
            @rpath/GemmaModelConstraintProvider.framework/Versions/A/GemmaModelConstraintProvider \\
            "${LITERTLM}" 2>/dev/null || true
          codesign --force --sign - "${LITERTLM}" 2>/dev/null || true
        fi
        # Write the declared output so Xcode marks the phase up-to-date and
        # orders it deterministically (avoids the Flutter Assemble cycle).
        mkdir -p "$(dirname "${SCRIPT_OUTPUT_FILE_0}")"
        touch "${SCRIPT_OUTPUT_FILE_0}"
      SHELL
    end
  end
end

Add to macos/Runner/DebugProfile.entitlements and Release.entitlements:

<key>com.apple.security.cs.disable-library-validation</key>
<true/>

Windows Setup:

No additional configuration required. hook/build.dart (Native Assets) downloads LiteRtLm.dll + companion DLLs + the DXC runtime (dxil.dll, dxcompiler.dll v1.9.2602) from the GitHub release on first build, verifies them via SHA256, and bundles them next to your app.exe. End users need the Microsoft Visual C++ Redistributable 2019+ (download) — most modern Windows 10/11 systems already have it.

Linux Setup:

No additional configuration required. Build dependencies:

sudo apt install clang cmake ninja-build libgtk-3-dev lld

For GPU acceleration, install the vendor Vulkan driver (NVIDIA / AMD / Intel) in addition to the Vulkan loader. Mesa's llvmpipe software fallback caps maxStorageBufferRange at 128 MB and Gemma 4 will not run on it.

sudo apt install vulkan-tools libvulkan1
# Plus your vendor driver, e.g. NVIDIA:
sudo apt install nvidia-driver-535-server

📚 Full Desktop Documentation →

Quick Start

⚠️ Important: Complete platform setup before running this code.

1. Install a Model (One Time)

import 'package:flutter_gemma/flutter_gemma.dart';

// Install model. URL example uses the .litertlm variant so the same code
// works on Desktop (Windows/macOS/Linux) and mobile/web. For web only, the
// `.task`/`-web.task` variants of the same model also work.
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
).fromNetwork(
  'https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_ekv4096.litertlm',
  token: 'your_hf_token',
).withProgress((progress) {
  print('Downloading: $progress%');
}).install();

Mobile/Web shortcut: if you don't target Desktop, you can substitute the URL with the .task build of the same model. Desktop targets need the .litertlm build — .task and .bin are MediaPipe-only.

2. Create and Use Model (Multiple Times)

// Create model with specific configuration
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

// Use model
final chat = await model.createChat();
await chat.addQueryChunk(Message.text(
  text: 'Explain quantum computing',
  isUser: true,
));
final response = await chat.generateChatResponse();

// Cleanup
await model.close();

System Instructions

Control model behavior with a system-level instruction:

final chat = await model.createChat(
  systemInstruction: 'You are a concise assistant. Always respond in bullet points.',
);

Platform support:

Android .litertlm / Desktop: Passed natively via ConversationConfig.systemInstruction
Android .task / iOS / Web: Prepended to first user message as fallback

3. Multiple Instances from Same Model

// Install once
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// Create multiple instances
final quickModel = await FlutterGemma.getActiveModel(maxTokens: 512);
final deepModel = await FlutterGemma.getActiveModel(maxTokens: 4096);
// Both use the SAME model file!

Concurrent sessions (`openSession`)

A single loaded model can serve several independent dialogues at once. openSession() returns a session with its own conversation history, detached from the legacy model.session singleton; openChat() is the same for the higher-level chat API.

Why use it. The model weights (the big, expensive part — hundreds of MB to several GB) are loaded once and shared across every session; each session only adds its own lightweight conversation context. Without openSession, serving two independent conversations would mean either loading the model twice (doubling the weight memory) or constantly clearing and rebuilding one session's history when you switch between them.

When you'd reach for it:

Multiple chats in one app — e.g. a tabbed chat UI where each tab keeps its own thread, all backed by one loaded model.
Different roles / system instructions side by side — one session with a "translator" system instruction, another as a "code reviewer", without reloading weights between them.
Background work alongside an active chat — e.g. summarizing or tagging a document in one session while the user keeps chatting in another (they take turns on the accelerator — see the serialization note below).
A/B prompt comparison — run the same model with two different setups and compare, sharing the loaded weights.

If you only ever have one conversation at a time, stick with the simpler createSession() / createChat() singleton API — you don't need this.

final model = await FlutterGemma.getActiveModel(maxTokens: 1024);

final chatA = await model.openChat(); // independent context A
final chatB = await model.openChat(); // independent context B

await chatA.addQueryChunk(Message(text: 'My name is Alice.', isUser: true));
await chatA.generateChatResponse();

await chatB.addQueryChunk(Message(text: 'My name is Bob.', isUser: true));
await chatB.generateChatResponse();

// Each remembers only its own context.
await chatA.addQueryChunk(Message(text: 'What is my name?', isUser: true));
print(await chatA.generateChatResponse()); // "Alice"

model.sessions;        // all live sessions (legacy + open)
await chatA.session.close();  // closing one leaves the others usable

⚠️ Concurrent contexts, serialized inference. The sessions are logically independent, but only one session generates at a time — calling generateResponse() on a second session while another is still running blocks until the first finishes. Generation is not parallel. This is intentional: parallel on-device inference would contend for the accelerator and risk OOM.

Per-platform behavior (transparent to your code — the API is identical):

Path	How it works
`.litertlm` — native (Android/iOS/macOS/Windows/Linux)	Engine allows one live conversation; sessions multiplex — the active session's history is replayed on switch.
`.litertlm` — web (`@litert-lm/core`)	Separate conversations; generation still serialized.
`.task` — MediaPipe (Android/iOS)	N real `LlmInferenceSession` live at once (each with its own KV cache); generation serialized by a mutex.
`.task` — MediaPipe web	❌ Not yet — `openSession()` throws `UnsupportedError`. Planned for a future release.

Memory: each open session holds its own context (~100–500 MB depending on model + maxTokens). On phones with large models (Gemma 4 E2B+), several concurrent sessions can OOM. Cap the count with maxConcurrentSessions: on getActiveModel(...) — openSession() throws StateError past the cap. Multi-session is most reliable on desktop and high-end mobile with small models (Gemma 3 1B / 270M).

Installation Sources

// Network — .litertlm is the cross-platform default (Android/iOS/Desktop).
// For mobile-only or web-only apps you can substitute a .task URL of the
// same model.
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork('https://example.com/model.litertlm', token: 'optional')
  .install();

// Flutter assets
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromAsset('assets/models/model.litertlm')
  .install();

// Native bundle
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromBundled('model.litertlm')
  .install();

// External file
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromFile('/path/to/model.litertlm')
  .install();

Modern API vs Legacy API

Modern API (Recommended) ✅

Benefits:

✅ Cleaner, more intuitive
✅ Type-safe ModelSource
✅ Automatic active model management
✅ Install once, create many instances

Usage:

await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
final model = await FlutterGemma.getActiveModel(maxTokens: 2048);

Legacy API ⚠️ Deprecated

⚠️ DEPRECATED: This API is maintained for backwards compatibility only. New projects should use the Modern API above.

Still works but requires manual ModelType specification:

final model = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,  // Must specify every time
  maxTokens: 2048,
);

Initialize Flutter Gemma

Call FlutterGemma.initialize(...) once in main() and register the opt-in packages you added to pubspec.yaml. Core registers no engine on its own, so without this step getActiveModel() / createEmbeddingModel() throw a clear "add the engine package" error.

import 'package:flutter/widgets.dart';
import 'package:flutter_gemma/flutter_gemma.dart';
import 'package:flutter_gemma_litertlm/flutter_gemma_litertlm.dart';
import 'package:flutter_gemma_mediapipe/flutter_gemma_mediapipe.dart';
import 'package:flutter_gemma_embeddings/flutter_gemma_embeddings.dart';
import 'package:flutter_gemma_rag_qdrant/flutter_gemma_rag_qdrant.dart';

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  FlutterGemma.initialize(
    // Inference engines — add the ones whose packages you depend on:
    inferenceEngines: const [
      LiteRtLmEngine(),     // flutter_gemma_litertlm  — .litertlm models
      MediaPipeEngine(),    // flutter_gemma_mediapipe — .task / .bin models
    ],
    // Optional — embeddings (needed for RAG / generateEmbedding):
    embeddingBackends: const [
      LiteRtEmbeddingBackend(), // flutter_gemma_embeddings
    ],
    // Optional — RAG vector store (pick one; native here):
    vectorStore: QdrantVectorStore(), // flutter_gemma_rag_qdrant

    // Common settings:
    huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
    maxDownloadRetries: 10,
  );

  runApp(MyApp());
}

Which parameter ← which package:

Parameter	Provided by	Notes
`inferenceEngines: [LiteRtLmEngine()]`	`flutter_gemma_litertlm`	`.litertlm` (mobile + desktop + web)
`inferenceEngines: [MediaPipeEngine()]`	`flutter_gemma_mediapipe`	`.task` / `.bin` (mobile + web)
`embeddingBackends: [LiteRtEmbeddingBackend()]`	`flutter_gemma_embeddings`	text embeddings
`vectorStore: QdrantVectorStore()`	`flutter_gemma_rag_qdrant`	native RAG
`vectorStore: SqliteVectorStore()` / `WebSqliteVectorStore()`	`flutter_gemma_rag_sqlite`	native / web RAG

Add only the engines you ship. Passing both LiteRtLmEngine() and MediaPipeEngine() lets one app run both formats — the registry routes each model to the engine that handles its file type. On web, choose vectorStore: WebSqliteVectorStore() (flutter_gemma_rag_qdrant is native-only).

Common settings:

huggingFaceToken: Authentication token for gated models (Gemma3n, EmbeddingGemma)
maxDownloadRetries: Number of retry attempts for failed downloads (default: 10)
webStorageMode: (Web only) Storage strategy for model files (default: cacheApi)
- WebStorageMode.cacheApi: Cache API with Blob URLs (for models <2GB)
- WebStorageMode.streaming: OPFS streaming (for large models >2GB like E4B, 7B)
- WebStorageMode.none: No caching (ephemeral mode for testing)

Use WebStorageMode.streaming when shipping .litertlm web models — the @litert-lm/core engine consumes an OPFS ReadableStream and avoids Chrome's ~2 GB blob-fetch limit on Gemma 4 E2B/E4B web builds.

Next Steps:

📖 Authentication Setup - Configure tokens for gated models
📦 Model Sources - Learn about different model sources
🌐 Platform Support - Web vs Mobile differences
🔄 Migration Guide - Upgrade from Legacy API
📚 Legacy API Documentation - For backwards compatibility

HuggingFace Authentication 🔐

Many models require authentication to download from HuggingFace. Never commit tokens to version control.

✅ Recommended: config.json Pattern

This is the most secure way to handle tokens in development and production.

Step 1: Create config template file config.json.example:

{
  "HUGGINGFACE_TOKEN": ""
}

Step 2: Copy and add your token:

cp config.json.example config.json
# Edit config.json and add your token from https://huggingface.co/settings/tokens

Step 3: Add to .gitignore:

# Never commit tokens!
config.json

Step 4: Run with config:

flutter run --dart-define-from-file=config.json

Step 5: Access in code:

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Read from environment (populated by --dart-define-from-file)
  const token = String.fromEnvironment('HUGGINGFACE_TOKEN');

  // Initialize with token (optional if all models are public)
  FlutterGemma.initialize(
    huggingFaceToken: token.isNotEmpty ? token : null,
  );

  runApp(MyApp());
}

Alternative: Environment Variables

export HUGGINGFACE_TOKEN=hf_your_token_here
flutter run --dart-define=HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN

Alternative: Per-Download Token

// Pass token directly for specific downloads
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/resolve/main/gemma-3n-E2B-it-int4.litertlm',
    token: 'hf_your_token_here',  // ⚠️ Not recommended - use config.json
  )
  .install();

Which Models Require Authentication?

Common gated models:

✅ Gemma3n (E2B, E4B) - google/ repos are gated
✅ Gemma 3 1B - litert-community/ requires access
✅ Gemma 3 270M - litert-community/ requires access
✅ EmbeddingGemma - litert-community/ requires access

Public models (no auth needed):

❌ DeepSeek, Qwen3, Qwen 2.5, SmolLM, Phi-4, FastVLM - Public repos

Get your token: https://huggingface.co/settings/tokens

Grant access to gated repos: Visit model page → "Request Access" button

Model Sources 📦

Flutter Gemma supports multiple model sources with different capabilities:

Source Type	Platform	Progress	Resume	Authentication	Use Case
NetworkSource	All	✅ Detailed	⚠️ Server-dependent	✅ Supported	HuggingFace, CDNs, private servers
AssetSource	All	⚠️ End only	❌ No	❌ N/A	Models bundled in app assets
BundledSource	All	⚠️ End only	❌ No	❌ N/A	Native platform resources
FileSource	Native (no Web)	⚠️ End only	❌ No	❌ N/A	User-selected files (file picker)

NetworkSource - Internet Downloads

Downloads models from HTTP/HTTPS URLs with full progress tracking and authentication.

Features:

✅ Progress tracking (0-100%)
⚠️ Resume after interruption (server-dependent, not supported by HuggingFace CDN)
✅ HuggingFace authentication
✅ Smart retry logic with exponential backoff
✅ Background downloads on mobile
✅ Cancellable downloads with CancelToken
✅ Android foreground service for large downloads (>500MB)

Example:

// Public model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork('https://example.com/model.litertlm')
  .withProgress((progress) => print('$progress%'))
  .install();

// Private model with authentication
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/resolve/main/gemma-3n-E2B-it-int4.litertlm',
    token: 'hf_...',  // Or use FlutterGemma.initialize(huggingFaceToken: ...)
  )
  .withProgress((progress) => setState(() => _progress = progress))
  .install();

Android Foreground Service (Large Downloads):

Android has a 9-minute background execution limit. For large models (>500MB), you can use foreground service mode which shows a notification but bypasses this timeout:

// Auto-detect based on file size (>500MB = foreground) - DEFAULT
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url)  // foreground: null (auto-detect)
  .install();

// Force foreground mode (always show notification)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: true)
  .install();

// Force background mode (may fail for large files)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: false)
  .install();

Foreground Parameter:

null (default): Auto-detect based on file size. Files >500MB use foreground service.
true: Always use foreground service (shows notification, no timeout)
false: Never use foreground service (subject to 9-minute timeout)

Note: iOS uses native URLSession which handles long downloads automatically - no foreground service needed.

Cancelling Downloads:

Use CancelToken to cancel downloads in progress:

import 'package:flutter_gemma/core/model_management/cancel_token.dart';

// Create cancel token
final cancelToken = CancelToken();

// Start download with cancel token
final future = FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(url)
  .withCancelToken(cancelToken)  // ← Pass cancel token via builder
  .withProgress((progress) => print('Progress: $progress%'))
  .install();

// Cancel download from another part of your code
// (e.g., user pressed cancel button)
cancelToken.cancel('User cancelled download');

// Handle cancellation
try {
  await future;
  print('Download completed');
} catch (e) {
  if (CancelToken.isCancel(e)) {
    print('Download was cancelled by user');
  } else {
    print('Download failed: $e');
  }
}

// Check if cancelled
if (cancelToken.isCancelled) {
  print('Reason: ${cancelToken.cancelReason}');
}

CancelToken Features:

✅ Non-breaking: Optional parameter, existing code works without changes
✅ Works with network downloads (inference + embedding models)
✅ Cancels ALL files in multi-file downloads (embedding: model + tokenizer)
✅ Platform-independent (Mobile + Web)
✅ Throws DownloadCancelledException for proper error handling
✅ Thread-safe cancellation

AssetSource - Flutter Assets

Copies models from Flutter assets (declared in pubspec.yaml).

Features:

✅ No network required
✅ Fast installation (local copy)
⚠️ Increases app size significantly
✅ Works offline

Example:

// 1. Add to pubspec.yaml
// assets:
//   - models/gemma3-1b-it.litertlm

// 2. Install from asset
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('models/gemma3-1b-it.litertlm')
  .install();

BundledSource - Native Resources

Production-Ready Offline Models: Include small models directly in your app bundle for instant availability without downloads.

Use Cases:

✅ Offline-first applications (works without internet from first launch)
✅ Small models (Gemma 3 270M ~300MB)
✅ Core features requiring guaranteed availability
⚠️ Not for large models (increases app size significantly)

Platform Setup:

Android (android/app/src/main/assets/models/)

# Place your model file. .litertlm works for both Android and Desktop,
# .task is MediaPipe-only and won't load on Desktop.
android/app/src/main/assets/models/gemma3-270m-it-q8.litertlm

iOS (Add to Xcode project)

Drag model file into Xcode project (.litertlm for FFI; .task for MediaPipe)
Check "Copy items if needed"
Add to target membership

Web (Static files in web/ directory) — web uses MediaPipe only, so .task (or -web.task):

# Place model files in web/ directory
example/web/gemma3-270m-it.task

# Files are automatically copied to build/web/ during production build
flutter build web

⚠️ Web Platform Limitation:

Production only: Bundled resources work ONLY in production builds (flutter build web)
Debug mode: Files in web/ are NOT served by flutter run dev server
For development: Use NetworkSource or AssetSource instead

Features:

✅ Zero network dependency
✅ No installation delay
✅ No storage permission needed
✅ Direct path usage (no file copying)

Example:

await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromBundled('gemma3-270m-it-q8.litertlm')
  .install();

App Size Impact:

SmolLM 135M: ~135MB
Gemma 3 270M: ~300MB
Qwen3 0.6B: ~586MB
Consider hosting large models for download instead

FileSource - External Files (Native)

References external files (e.g., user-selected via file picker). Works on Android, iOS, macOS, Linux, Windows. Not available on Web (no local file system).

Features:

✅ No copying (references original file)
✅ Protected from cleanup
❌ Web not supported (no local file system)

Example:

// Native only - after user selects file with file_picker
final path = '/data/user/0/com.app/files/model.litertlm';
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromFile(path)
  .install();

Important: On web, FileSource only works with URLs or asset paths, not local file system paths.

Migration from Legacy to Modern API 🔄

If you're upgrading from the Legacy API, here are common migration patterns:

Installing Models

Legacy API Modern API

Legacy API	Modern API
`// Network download final spec = MobileModelManager.createInferenceSpec( name: 'model.bin', modelUrl: 'https://example.com/model.bin', ); await FlutterGemmaPlugin.instance.modelManager .downloadModelWithProgress(spec, token: token) .listen((progress) { print('${progress.overallProgress}%'); });`	`// Network download await FlutterGemma.installModel( modelType: ModelType.gemmaIt, ) .fromNetwork( 'https://example.com/model.bin', token: token, ) .withProgress((progress) { print('$progress%'); }) .install();`
`// From assets await modelManager.installModelFromAssetWithProgress( 'model.bin', loraPath: 'lora.bin', ).listen((progress) { print('$progress%'); });`	`// From assets await FlutterGemma.installModel( modelType: ModelType.gemmaIt, ) .fromAsset('model.bin') .withProgress((progress) { print('$progress%'); }) .install(); // LoRA weights can be installed with the model await FlutterGemma.installModel( modelType: ModelType.gemmaIt, ) .fromAsset('model.bin') .withLoraFromAsset('lora.bin') .install();`

// Network download
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: 'https://example.com/model.bin',
);

await FlutterGemmaPlugin.instance.modelManager
  .downloadModelWithProgress(spec, token: token)
  .listen((progress) {
    print('${progress.overallProgress}%');
  });

// Network download
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://example.com/model.bin',
    token: token,
  )
  .withProgress((progress) {
    print('$progress%');
  })
  .install();

// From assets
await modelManager.installModelFromAssetWithProgress(
  'model.bin',
  loraPath: 'lora.bin',
).listen((progress) {
  print('$progress%');
});

// From assets
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withProgress((progress) {
    print('$progress%');
  })
  .install();

// LoRA weights can be installed with the model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withLoraFromAsset('lora.bin')
  .install();

Checking Model Installation

Legacy API Modern API

Legacy API	Modern API
`final spec = MobileModelManager.createInferenceSpec( name: 'model.bin', modelUrl: url, ); final isInstalled = await FlutterGemmaPlugin .instance.modelManager .isModelInstalled(spec);`	`final isInstalled = await FlutterGemma .isModelInstalled('model.bin');`

final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: url,
);

final isInstalled = await FlutterGemmaPlugin
  .instance.modelManager
  .isModelInstalled(spec);

final isInstalled = await FlutterGemma
  .isModelInstalled('model.bin');

Key Migration Notes

✅ Simpler imports: Use package:flutter_gemma/core/api/flutter_gemma.dart
✅ Builder pattern: Chain methods for cleaner code
✅ Callback-based progress: Simpler than streams for most cases
✅ Type-safe sources: Compile-time validation of source types
⚠️ Breaking change: Progress values are now int (0-100) instead of DownloadProgress object
⚠️ Separate files: Model and LoRA weights installed independently

Model Creation and Inference

Modern API (Recommended):

// Create model with runtime configuration
final inferenceModel = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Legacy API (Still supported):

// Works with both Legacy and Modern installation methods
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,
  preferredBackend: PreferredBackend.gpu,
  maxTokens: 2048,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Usage (Legacy API) ⚠️ DEPRECATED

The pre-Modern stream-based API (FlutterGemmaPlugin.instance.modelManager, installModelFromAsset, downloadModelFromNetworkWithProgress, etc.) is still supported but deprecated. New projects should use the Modern API above.

📚 Full Legacy API reference: docs/LEGACY_API.md

🖼️ Message Types

The plugin now supports different types of messages:

// Text only
final textMessage = Message.text(text: "Hello!", isUser: true);

// Text + Image
final multimodalMessage = Message.withImages(
  text: "What's in this image?",
  imageBytes: [imageBytes],
  isUser: true,
);

// Image only
final imageMessage = Message.imagesOnly(imageBytes: [imageBytes], isUser: true);

// Tool response (for function calling)
final toolMessage = Message.toolResponse(
  toolName: 'change_background_color',
  response: {'status': 'success', 'color': 'blue'},
);

// System information message
final systemMessage = Message.systemInfo(text: "Function completed successfully");

// Thinking content (for DeepSeek models)
final thinkingMessage = Message.thinking(text: "Let me analyze this problem...");

// Check if message contains image
if (message.hasImage) {
  print('This message contains an image');
}

// Create a copy of message
final copiedMessage = message.copyWith(text: "Updated text");

💬 Response Types

The model can return different types of responses depending on capabilities:

// Handle different response types
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text token: ${response.token}');
    // Use response.token to update your UI incrementally
    
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function (Gemma 4, Gemma3n, Gemma 3 1B,
    // FunctionGemma, DeepSeek, Qwen3, Qwen 2.5, Phi-4)
    print('Function: ${response.name}');
    print('Arguments: ${response.args}');
    
    // Execute the function and send response back
    _handleFunctionCall(response);
  } else if (response is ThinkingResponse) {
    // Model's reasoning process (DeepSeek models only)
    print('Thinking: ${response.content}');
    
    // Show thinking process in UI
    _showThinkingBubble(response.content);
  }
});

Response Types:

TextResponse: Contains a text token (response.token) for regular model output
FunctionCallResponse: Contains function name (response.name) and arguments (response.args) when the model wants to call a function
ThinkingResponse: Contains the model's reasoning process (response.content) for DeepSeek models with thinking mode enabled

🎯 Supported Models

Platform Support

Model	Size	Desktop	Mobile	Web
Gemma 4 E2B	2.4GB	✅	✅	✅
Gemma 4 E4B	4.3GB	✅	✅	✅
Gemma3n E2B	3.1GB	✅	✅	✅
Gemma3n E4B	6.5GB	✅	✅	✅
FastVLM 0.5B	0.5GB	✅	❌	❌
Gemma-3 1B	0.5GB	✅	✅	✅
Gemma 3 270M	0.3GB	✅	✅	✅
FunctionGemma 270M	284MB	✅	✅	❌
Qwen3 0.6B	586MB	✅	✅	✅
Qwen 2.5 1.5B	1.6GB	✅	✅	❌
Qwen 2.5 0.5B	0.5GB	❌	✅	❌
SmolLM 135M	135MB	❌	✅	❌
Phi-4 Mini	3.9GB	✅	✅	✅
DeepSeek R1	1.7GB	❌	✅	❌

📊 Text Embedding Models

All embedding models generate 768-dimensional vectors. The numbers in names (64/256/512/1024/2048) indicate maximum input sequence length in tokens, not embedding dimension.

Model	Parameters	Dimensions	Max Seq Length	Size	Best For	Auth Required
Gecko 64	110M	768D	64 tokens	110MB	Short queries, real-time search	❌
Gecko 256	110M	768D	256 tokens	114MB	Balanced speed/accuracy	❌
Gecko 512	110M	768D	512 tokens	116MB	Medium context documents	❌
EmbeddingGemma 256	300M	768D	256 tokens	179MB	High accuracy, short context	✅
EmbeddingGemma 512	300M	768D	512 tokens	179MB	High accuracy, medium context	✅
EmbeddingGemma 1024	300M	768D	1024 tokens	183MB	Long documents, detailed content	✅
EmbeddingGemma 2048	300M	768D	2048 tokens	196MB	Very long documents	✅

Performance Comparison (Android Pixel 8 with GPU acceleration):

Gecko 64: ~109ms/doc embedding, 130ms search (⚡ fastest - 2.6x faster than EmbeddingGemma)
EmbeddingGemma 256: ~286ms/doc embedding, 342ms search (🎯 more accurate - 300M vs 110M params)

Use Cases:

✅ Gecko 64: Real-time search, mobile apps, short queries (≤64 tokens), fast inference
✅ Gecko 256/512: Balanced use cases, general-purpose embeddings, good speed/quality tradeoff
✅ EmbeddingGemma 256/512: High-quality embeddings, semantic search, better accuracy
✅ EmbeddingGemma 1024/2048: Long documents, detailed content, research papers, articles

🔎 On-device RAG / Vector Store

Native platforms (Android, iOS, macOS, Linux, Windows) use qdrant-edge as the default vector store since 0.16. Web stays on wa-sqlite (qdrant-edge can't target WASM yet). Same Dart API on both — code is portable across platforms.

import 'package:flutter_gemma/flutter_gemma.dart';

// 1. Install an embedding model (any of Gecko / EmbeddingGemma)
await FlutterGemma.installEmbedder()
    .modelFromNetwork(
      'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/embeddinggemma-300M_seq256_mixed-precision.tflite',
      token: 'hf_...',
    )
    .tokenizerFromNetwork(
      'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/sentencepiece.model',
      token: 'hf_...',
    )
    .install();

// 2. Initialize the vector store (one shard per database path)
await FlutterGemmaPlugin.instance.initializeVectorStore('rag_store');

// 3. Add documents — let the plugin compute embeddings for you
for (final doc in docs) {
  await FlutterGemmaPlugin.instance.addDocument(
    id: doc.id,
    content: doc.content,
    metadata: '{"category":"science","lang":"en"}',
  );
}

// 3b. Or batch-embed yourself and feed pre-computed vectors via
//     addDocumentWithEmbedding(...) for higher throughput.
final embedder = FlutterGemmaPlugin.instance.initializedEmbeddingModel!;
final embeddings = await embedder.generateEmbeddings(
  docs.map((d) => d.content).toList(),
  taskType: TaskType.retrievalDocument,
);
for (var i = 0; i < docs.length; i++) {
  await FlutterGemmaPlugin.instance.addDocumentWithEmbedding(
    id: docs[i].id,
    content: docs[i].content,
    embedding: embeddings[i],
    metadata: '{"category":"science","lang":"en"}',
  );
}

// 4. Semantic search, with optional payload-aware Filter (native only)
final results = await FlutterGemmaPlugin.instance.searchSimilar(
  query: 'quantum entanglement',
  topK: 10,
  threshold: 0.0,
  filter: Filter(
    must: [FieldEquals(key: 'category', value: 'science')],
    mustNot: [FieldEquals(key: 'lang', value: 'fr')],
  ),
);

Filter supports must / should / mustNot lists of FieldEquals, FieldRange, FieldMatchAny conditions. On Web the filter argument is silently ignored — wa-sqlite has no payload-filter support.

Benchmarks comparing qdrant-edge to the legacy sqlite + local_hnsw backend across 5 platforms (5 000 documents, EmbeddingGemma 300M, 768-dim): see example/integration_test/benchmarks/comparison.md.

🛠️ Model Function Calling Support

Function calling is currently supported by the following models:

✅ Models with Function Calling Support

Gemma 4 (E2B, E4B) - Full function calling support
Gemma3n (E2B, E4B) - Full function calling support
Gemma 3 1B - Function calling support
FunctionGemma 270M - Google's specialized function calling model
DeepSeek R1 - Function calling + thinking mode support
Qwen models (0.5B, 0.6B, 1.5B) - Full function calling support
Phi-4 Mini - Advanced reasoning with function calling support

❌ Models WITHOUT Function Calling Support

Gemma 3 270M - Text generation only
SmolLM 135M - Text generation only
FastVLM 0.5B - Vision model, no function calling

Important Notes:

When using unsupported models with tools, the plugin will log a warning and ignore the tools
Models will work normally for text generation even if function calling is not supported
Check the supportsFunctionCalls property in your model configuration

Platform Support Details 🌐

Feature Comparison

Feature	Android	iOS	Web	Desktop	Notes
Text Generation	✅ Full	✅ Full	✅ Full	✅ Full	All models supported
Image Input (Multimodal)	✅ Full	✅ Full	✅ Full	✅ Full	Verified on macOS Metal and Linux Vulkan (Gemma 4 + Gemma 3n)
Audio Input	✅ Full	✅ Full ¹	❌ Not supported	✅ `.litertlm` only	Gemma3n E2B/E4B + Gemma 4; iOS device-only; Desktop via FFI
Function Calling	✅ Full	✅ Full	✅ Full	✅ Full	Gemma 4 native (SDK chat template)
Thinking Mode	✅ Full	✅ Full	❌ Not supported	✅ Full	Gemma 4 / DeepSeek / Qwen3; Web MediaPipe has no `extraContext`
Stop Generation	✅ Full	✅ Full	✅ Full	✅ Full	Cancel mid-process
GPU Acceleration	✅ Full	✅ Full	✅ Full	✅ Full	Metal/WebGPU/Vulkan/DX12
NPU Acceleration	✅ Full	❌ Not supported	❌ Not supported	✅ Windows	Android (.litertlm) + Windows Intel LunarLake/PantherLake
CPU Backend	✅ Full	✅ Full	❌ Not supported	✅ Full	MediaPipe limitation
Streaming Responses	✅ Full	✅ Full	✅ Full	✅ Full	Real-time generation
LoRA Support	✅ Full	✅ Full	✅ Full	❌ Not supported	LiteRT-LM limitation
Text Embeddings	✅ Full	✅ Full	✅ Full	✅ Full	EmbeddingGemma, Gecko
VectorStore (RAG)	✅ qdrant-edge	✅ qdrant-edge	✅ wa-sqlite (WASM)	✅ qdrant-edge	Semantic search + payload `Filter` (native)
File Downloads	✅ Background	✅ Background	✅ In-memory	✅ Background	Platform-specific
Asset Loading	✅ Full	✅ Full	✅ Full	❌ Not supported	Flutter assets N/A
Bundled Resources	✅ Full	✅ Full	✅ Full	❌ Not supported	Native bundles only
External Files (FileSource)	✅ Full	✅ Full	❌ Not supported	✅ Full	No local FS on web

Web column note: the Web ✅ marks above describe the MediaPipe .task web path (image input, function calling, etc.). Thinking Mode is not available on Web — MediaPipe web exposes no extraContext hook. The newer web .litertlm path (@litert-lm/core) is an early-preview subset — text-only, no vision/audio/thinking/function-calling. See Web .litertlm support & limitations.

Web Platform Specifics

Authentication

Required for gated models: Gemma3n, Gemma 3 1B/270M, EmbeddingGemma
Configuration: Use FlutterGemma.initialize(huggingFaceToken: '...') or pass token per-download
Storage: Tokens stored in browser memory (not localStorage)

File Handling

Downloads: Creates blob URLs in browser memory (no actual files)
Storage: IndexedDB via WebFileSystemService
FileSource: Only works with HTTP/HTTPS URLs or assets/ paths
Local file paths: ❌ Not supported (browser security restriction)

Web Storage Modes

Three Storage Modes:

1. Cache API Mode (default, WebStorageMode.cacheApi):

Uses browser Cache API with Blob URLs
Models persist across browser restarts
Best for models <2GB

2. Streaming Mode (WebStorageMode.streaming):

Uses OPFS with ReadableStream
Bypasses browser 2GB ArrayBuffer limit
Required for large models (E4B 4GB+, 7B, 27B)
Requires Chrome 86+, Edge 86+, Safari 15.2+

3. Ephemeral Mode (WebStorageMode.none):

Models stored in memory only
Cleared when browser closes
For testing/demos

// Default: Cache API for small models
FlutterGemma.initialize(webStorageMode: WebStorageMode.cacheApi);

// Streaming for large models (>2GB)
FlutterGemma.initialize(webStorageMode: WebStorageMode.streaming);

// Check if streaming is supported
final supported = await FlutterGemma.isStreamingSupported();

Backend Support

GPU only: See PreferredBackend Options table above

CORS Configuration

Required for custom servers: Enable CORS headers on your model hosting server
Firebase Storage: See CORS configuration docs
HuggingFace: CORS already configured correctly

Memory Limitations

Large models: May hit browser memory limits (2GB typical)
Recommended: Use smaller models (1B-2B) for web platform
Best models for web:
- Gemma 3 270M (300MB)
- Gemma 3 1B (500MB-1GB)
- Gemma3n E2B (3GB) - requires 6GB+ device RAM

Browser Cache Storage Limits

Browser	Max Model Size	Notes
Chrome/Firefox	~2 GB	ArrayBuffer limit
Safari	~50 MB	⚠️ Not suitable

Web `.litertlm` support & limitations

Web .litertlm inference (added in 0.16.2) runs Gemma .litertlm models (verified on Gemma 4 E2B/E4B web variants) in the browser through the upstream @litert-lm/core package (WebGPU + WASM). It is an early preview and is intentionally a subset of the native .litertlm path. MediaPipe .task on web is unaffected and remains fully supported.

Works on web .litertlm:

✅ Text generation (sync getResponse() and streaming getResponseAsync())
✅ Multi-turn chat with history (createChat / openChat)
✅ System instruction (via the conversation preface)
✅ Concurrent sessions (openSession) — serialized inference (see Concurrent sessions)
✅ Large models via OPFS streaming (WebStorageMode.streaming) — bypasses Chrome's ~2 GB blob limit
✅ GPU only (WebGPU is required; there is no CPU backend on web)

Not supported on web .litertlm yet (mobile/desktop only):

❌ Vision / image input — @litert-lm/core does not expose the Vision executor config; image inputs are dropped with a debug warning
❌ Audio input — same reason (no Audio executor config in the JS API)
❌ Thinking mode — extraContext thinking channel is not wired on web
❌ Function calling / tool calls — prefill+decode tool models aren't available on the web runtime
❌ LoRA weights — loraPath throws UnsupportedError
⚠️ stopGeneration() — closes the local Dart stream and calls the upstream conversation.cancel() to abort generation; the cancel is best-effort (the early-preview JS API may throw if nothing is in flight, which is swallowed)
⚠️ WebStorageMode.none + model > 2 GB — the engine fetch()es the in-memory blob and trips Chrome's ERR_BLOB_OUT_OF_MEMORY; use WebStorageMode.streaming for large models

These limits track the upstream @litert-lm/core early-preview API and will lift as Google extends the JS executor surface. For full vision / audio / thinking / function calling on web today, use MediaPipe .task web models instead.

Mobile Platform Specifics

Android

GPU Support: Requires OpenGL libraries in AndroidManifest.xml
ProGuard: Automatic rules included for release builds
Storage: Local file system in app documents directory

iOS

Minimum version: iOS 16.0 required for MediaPipe GenAI
Memory entitlements: Required for large models (see Setup section)
Linking: Static linking required (use_frameworks! :linkage => :static)
Storage: Local file system in app documents directory

Desktop Platform Specifics

Storage Locations

Desktop builds store downloaded models outside the user's Documents/ folder to avoid OneDrive / iCloud / Domain-Roaming sync corrupting FFI mmap of large .litertlm files (since 0.15.1):

Windows: %LOCALAPPDATA%\flutter_gemma\ (never OneDrive-synced)
macOS: ~/Library/Application Support/<bundle>/flutter_gemma/
Linux: ~/.local/share/<app>/flutter_gemma/

Models installed by 0.14.x / 0.15.0 builds that still live under Documents/ keep working via a fallback read (a debug log nudges users to re-install once for migration).

The full and complete example you can find in example folder

Important Considerations

Model Size: Larger models (such as 7b and 7b-it) might be too resource-intensive for on-device inference.
Function Calling Support: Gemma 4, Gemma3n, Gemma 3 1B, FunctionGemma, DeepSeek, Qwen3, Qwen 2.5, and Phi-4 models support function calling. Other models will ignore tools and show a warning. See Model Function Calling Support.
Thinking Mode: Gemma 4, DeepSeek, and Qwen3 models support thinking mode. Enable with isThinking: true on the matching ModelType.
Multimodal Models: Gemma3n models with vision support require more memory and are recommended for devices with 8GB+ RAM.
iOS Memory Requirements: Large models require memory entitlements in Runner.entitlements and minimum iOS 16.0.
LoRA Weights: They provide efficient customization without the need for full model retraining.
Development vs. Production: For production apps, do not embed the model or LoRA weights within your assets. Instead, load them once and store them securely on the device or via a network drive.
Web Models: Currently, Web support is available only for GPU backend models. Multimodal support is fully implemented.
Image Formats: The plugin automatically handles common image formats (JPEG, PNG, etc.) when using Message.withImages() or Message.withImage().

🛟 Troubleshooting

Multimodal Issues:

Ensure you're using a multimodal model (Gemma3n E2B/E4B)
Set supportImage: true when creating model and chat
Check device memory - multimodal models require more RAM

Performance:

Use GPU backend for better performance with multimodal models
Consider using CPU backend for text-only models on lower-end devices

Memory Issues:

iOS: Ensure Runner.entitlements contains memory entitlements (see iOS setup)
iOS: Set minimum platform to iOS 16.0 in Podfile
Reduce maxTokens if experiencing memory issues
Use smaller models (1B-2B parameters) for devices with <6GB RAM
Close sessions and models when not needed
Monitor token usage with sizeInTokens()

iOS Build Issues:

Ensure minimum iOS version is set to 16.0 in Podfile
Use static linking: use_frameworks! :linkage => :static
Clean and reinstall pods: cd ios && pod install --repo-update
Check that all required entitlements are in Runner.entitlements

Advanced Usage

ModelThinkingFilter (Advanced)

For advanced users who need to manually process model responses, the ModelThinkingFilter class provides utilities for cleaning model outputs:

import 'package:flutter_gemma/core/extensions.dart';

// Clean response based on model type
String cleanedResponse = ModelThinkingFilter.cleanResponse(
  rawResponse,
  ModelType.deepSeek
);

// The filter automatically removes model-specific tokens like:
// - <end_of_turn> tags (Gemma models)
// - <think>...</think> blocks (DeepSeek)
// - <|channel>thought\n...<channel|> blocks (Gemma 4 E2B/E4B)
// - Extra whitespace and formatting

This is automatically handled by the chat API, but can be useful for custom inference implementations.

☕ Support the Project

If you find Flutter Gemma useful and want to support its development, consider buying me a coffee! Your support helps me:

🔧 Maintain and improve the plugin
📚 Keep documentation up-to-date
🐛 Fix bugs and resolve issues faster
✨ Add new features and model support
🧪 Test on more devices and platforms

Every contribution, no matter how small, makes a difference. Thank you for your support! 💙

Libraries

core/api/embedding_installation_builder
core/api/flutter_gemma
core/api/inference_installation_builder
core/chat
core/chat_event
core/di/platform/mobile_service_factory: Mobile-platform service factory. This file is only compiled on iOS/Android platforms. Uses background_downloader for model downloads.
core/di/platform/web_service_factory: Web-platform service factory. This file is only compiled on web platform. Uses dart:js_interop for browser-based downloads.
core/di/service_registry
core/domain/cache_metadata
core/domain/download_error
core/domain/download_exception
core/domain/model_source
core/domain/web_storage_mode
core/extensions
core/function_call_parser
core/handlers/asset_source_handler
core/handlers/bundled_source_handler
core/handlers/file_source_handler
core/handlers/network_source_handler
core/handlers/source_handler
core/handlers/source_handler_registry
core/handlers/web_asset_source_handler
core/handlers/web_asset_source_handler_stub: Stub implementation for non-web platforms This file is used when dart:js_interop is not available
core/handlers/web_bundled_source_handler
core/handlers/web_bundled_source_handler_stub
core/handlers/web_file_source_handler
core/handlers/web_file_source_handler_stub
core/handlers/web_network_source_handler
core/handlers/web_network_source_handler_stub
core/image_error_handler
core/image_processor
core/image_tokenizer
core/infrastructure/background_downloader_service
core/infrastructure/blob_url_manager
core/infrastructure/blob_url_manager_stub
core/infrastructure/flutter_asset_loader
core/infrastructure/flutter_asset_loader_stub: Stub implementation for platforms where dart:io is not available (web) This file is used when large_file_handler cannot be imported
core/infrastructure/in_memory_model_repository
core/infrastructure/platform_file_system_service
core/infrastructure/shared_preferences_model_repository
core/infrastructure/shared_preferences_protected_registry
core/infrastructure/unconfigured_vector_store
core/infrastructure/url_utils
core/infrastructure/web_cache_interop: JavaScript interop for Cache API
core/infrastructure/web_cache_interop_stub: Stub implementation for non-web platforms
core/infrastructure/web_cache_service: Web cache service for persistent model storage
core/infrastructure/web_cache_service_stub
core/infrastructure/web_download_service
core/infrastructure/web_download_service_stub
core/infrastructure/web_file_system_service
core/infrastructure/web_js_interop
core/infrastructure/web_js_interop_stub
core/infrastructure/web_opfs_interop: JavaScript interop for OPFS (Origin Private File System)
core/infrastructure/web_opfs_interop_stub: Stub for OPFS interop on non-web platforms
core/infrastructure/web_opfs_service: Dart service wrapper for OPFS (Origin Private File System)
core/lifecycle/close_notifier
core/message
core/migration/legacy_preferences_migrator
core/model
core/model_management/cancel_token
core/model_management/constants/preferences_keys
core/model_management/managers/web_model_manager
core/model_response
core/multimodal_image_handler
core/parsing/deepseek_function_call_format
core/parsing/function_call_format
core/parsing/function_call_format_factory
core/parsing/function_gemma_format
core/parsing/json_function_call_format
core/parsing/json_parsing_utils
core/parsing/llama_function_call_format
core/parsing/phi_function_call_format
core/parsing/qwen_function_call_format
core/parsing/sdk_passthrough_function_call_format
core/parsing/sdk_response_parser
core/parsing/sdk_text_extractor
core/registry/embedding_backend_provider
core/registry/embedding_registry
core/registry/engine_registry
core/registry/inference_engine_provider
core/registry/runtime_config
core/services/asset_loader
core/services/download_service
core/services/file_system_service
core/services/model_repository
core/services/protected_files_registry
core/services/vector_store_filter
core/services/vector_store_repository
core/tool
core/utils/file_name_utils
core/utils/gemma_log
core/vision_encoder_validator
desktop/flutter_gemma_desktop
desktop/flutter_gemma_desktop_stub
flutter_gemma
flutter_gemma_interface
mobile/flutter_gemma_mobile
mobile/smart_downloader
model_file_manager_interface
pigeon.g
rag/embedding_models
web/flutter_gemma_web
web/web_image_format
web/web_model_source

Flutter Gemma

Features

What's new in 1.0

Model File Types

Type 1: MediaPipe-Managed Templates

Type 2: Manual Template Formatting

Format by Platform

Model Capabilities

ModelType Reference

Installation

Platform & Architecture Support

Setup

Quick Start

1. Install a Model (One Time)

2. Create and Use Model (Multiple Times)

System Instructions

3. Multiple Instances from Same Model

Concurrent sessions (openSession)

Installation Sources

Modern API vs Legacy API

Modern API (Recommended) ✅

Legacy API ⚠️ Deprecated

Initialize Flutter Gemma

HuggingFace Authentication 🔐

✅ Recommended: config.json Pattern

Alternative: Environment Variables

Alternative: Per-Download Token

Which Models Require Authentication?

Model Sources 📦

NetworkSource - Internet Downloads

AssetSource - Flutter Assets

BundledSource - Native Resources

FileSource - External Files (Native)

Migration from Legacy to Modern API 🔄

Installing Models

Checking Model Installation

Key Migration Notes

Model Creation and Inference

Usage (Legacy API) ⚠️ DEPRECATED

🖼️ Message Types

💬 Response Types

🎯 Supported Models

Platform Support

📊 Text Embedding Models

🔎 On-device RAG / Vector Store

🛠️ Model Function Calling Support

✅ Models with Function Calling Support

❌ Models WITHOUT Function Calling Support

Platform Support Details 🌐

Feature Comparison

Web Platform Specifics

Authentication

File Handling

Web Storage Modes

Backend Support

CORS Configuration

Memory Limitations

Browser Cache Storage Limits

Web .litertlm support & limitations

Mobile Platform Specifics

Android

iOS

Desktop Platform Specifics

Storage Locations

Important Considerations

🛟 Troubleshooting

Advanced Usage

ModelThinkingFilter (Advanced)

☕ Support the Project

Libraries

flutter_gemma package

Concurrent sessions (`openSession`)

Web `.litertlm` support & limitations