Flutter Gemma

CI Tests Release Build pub package Ask DeepWiki

ko-fi

The plugin supports not only Gemma, but also other models. Here's the full list of supported models: Gemma 4 E2B/E4B, Gemma3n E2B/E4B, FastVLM 0.5B, Gemma-3 1B, Gemma 3 270M, FunctionGemma 270M, Qwen3 0.6B, Qwen 2.5, Phi-4 Mini, DeepSeek R1, SmolLM 135M.

*Note: The flutter_gemma plugin supports Gemma 4 and Gemma3n (with multimodal vision and audio support), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require .litertlm model format.

Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

gemma_github_cover

Bring the power of Google's lightweight Gemma language models and other on-device LLMs directly to your Flutter applications. With Flutter Gemma, you can seamlessly incorporate advanced AI capabilities into your Flutter applications, all without relying on external servers.

There is an example of using:

gemma_github_gif

Features

  • Local Execution: Run Gemma and other LLMs (Qwen, DeepSeek, Phi, FastVLM, SmolLM, …) directly on user devices for enhanced privacy and offline functionality.
  • Platform Support: Compatible with iOS, Android, Web, macOS, Windows, and Linux platforms.
  • πŸ–₯️ Desktop Support: Native desktop apps (macOS, Windows, Linux) with GPU acceleration via LiteRT-LM, called directly from Dart through dart:ffi β€” no JVM/JRE bundling. See DESKTOP_SUPPORT.md for details.
  • πŸ–ΌοΈ Multimodal Support: Text + Image input with Gemma 4, Gemma3n, and FastVLM vision models
  • πŸŽ™οΈ Audio Input: Record and send audio messages with Gemma3n E2B/E4B models (Android, iOS device, Desktop)
  • πŸ› οΈ Function Calling: Enable your models to call external functions and integrate with other services (supported by select models)
  • 🧠 Thinking Mode: View the reasoning process of DeepSeek and Gemma 4 models with thinking blocks
  • πŸ›‘ Stop Generation: Cancel text generation mid-process on Android, iOS, Web, and Desktop
  • βš™οΈ Backend Switching: Choose between CPU and GPU backends for each model individually in the example app
  • πŸ” Advanced Model Filtering: Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
  • πŸ“Š Model Sorting: Sort models alphabetically, by size, or use default order in the example app
  • LoRA Support: Efficient fine-tuning and integration of LoRA (Low-Rank Adaptation) weights for tailored AI behavior.
  • πŸ“₯ Enhanced Downloads: Smart retry logic with exponential backoff for reliable model downloads
  • πŸ”§ Download Reliability: Automatic restart logic for interrupted downloads (resume not supported by HuggingFace CDN)
  • πŸ“± Android Foreground Service: Large downloads (>500MB) automatically use foreground service to bypass 9-minute timeout
  • πŸ”§ Model Replace Policy: Configurable model replacement system (keep/replace) with automatic model switching
  • πŸ“Š Text Embeddings: Generate vector embeddings from text using EmbeddingGemma and Gecko models
  • πŸ”Ž On-device RAG: qdrant-edge vector store on native, wa-sqlite on Web. Payload-aware Filter (must / should / mustNot) for semantic search.
  • πŸ”§ Unified Model Management: Single system for managing both inference and embedding models with automatic validation
  • πŸ’Ύ Web Persistent Caching: Models persist across browser restarts using Cache API (Web only)

What's new in 1.0

  • πŸ“¦ Modular package split β€” the monolith is now a small core (flutter_gemma) plus opt-in packages, so your app ships only the native weight it uses: flutter_gemma_litertlm (.litertlm), flutter_gemma_mediapipe (.task/.bin), flutter_gemma_embeddings, flutter_gemma_rag_qdrant, flutter_gemma_rag_sqlite.
  • πŸ”§ New FlutterGemma.initialize(...) registration β€” pass inferenceEngines, embeddingBackends, vectorStore for the packages you added. See Initialize Flutter Gemma.
  • βœ… Every model / session / chat / embedding / RAG API is unchanged β€” migrating is just adding packages + the initialize call. See MIGRATION.md.
  • 🧹 Legacy sqlite+local_hnsw vector store removed β€” native RAG runs on qdrant-edge (flutter_gemma_rag_qdrant); web on wa-sqlite (flutter_gemma_rag_sqlite).

See CHANGELOG.md for the full release history.

Model File Types

Flutter Gemma supports different model file formats, which are grouped into two types based on how chat templates are handled:

Type 1: MediaPipe-Managed Templates

  • .task files: MediaPipe-optimized format for mobile (Android/iOS)
  • .litertlm files: LiteRT-LM format for Android, iOS, and Desktop platforms

Both formats have identical behavior β€” MediaPipe handles chat templates internally.

Type 2: Manual Template Formatting

  • .bin files: Standard binary format
  • .tflite files: LiteRT format (formerly TensorFlow Lite)

Both formats require manual chat template formatting in your code.

Note: The plugin automatically detects the file extension and applies appropriate formatting. When specifying ModelFileType in your code:

  • Use ModelFileType.task for .task and .litertlm files (same behavior)
  • Use ModelFileType.binary for .bin and .tflite files (same behavior)

Format by Platform

Format Android iOS Web Desktop Use Case
.task βœ… βœ… βœ… ❌ Older models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4)
.litertlm βœ… βœ… ΒΉ ❌ βœ… Newer models (Gemma 4, Qwen3, FastVLM + desktop for all)
-web.task ❌ ❌ βœ… ❌ Web-specific builds (e.g. Gemma 4, Gemma3n)
.bin βœ… βœ… βœ… ❌ Manual chat template formatting required
.tflite βœ… βœ… βœ… βœ… Embeddings only (EmbeddingGemma, Gecko)

ΒΉ iOS .litertlm runs on the FFI engine β€” vision and audio supported on physical devices. The Simulator stays CPU-only because Metal sim has a 256 MB single-allocation cap.

Model Capabilities

The example app offers a curated list of models, each suited for different tasks. Here's a breakdown of the models available and their capabilities:

Model Family Best For Function Calling Thinking Mode Vision Languages Size
Gemma 4 E2B Next-gen multimodal chat β€” text, image, audio βœ… βœ… βœ… Multilingual 2.4GB
Gemma 4 E4B Next-gen multimodal chat β€” text, image, audio βœ… βœ… βœ… Multilingual 4.3GB
Gemma3n On-device multimodal chat and image analysis βœ… ❌ βœ… Multilingual 3-6GB
FastVLM 0.5B Fast vision-language inference ❌ ❌ βœ… Multilingual 0.5GB
Phi-4 Mini Advanced reasoning and instruction following βœ… ❌ ❌ Multilingual 3.9GB
DeepSeek R1 High-performance reasoning and code generation βœ… βœ… ❌ Multilingual 1.7GB
Qwen3 0.6B Compact multilingual chat with function calling βœ… βœ… ❌ Multilingual 586MB
Qwen 2.5 Strong multilingual chat and instruction following βœ… ❌ ❌ Multilingual 0.5-1.6GB
Gemma 3 1B Balanced and efficient text generation βœ… ❌ ❌ Multilingual 0.5GB
Gemma 3 270M Ideal for fine-tuning (LoRA) for specific tasks ❌ ❌ ❌ Multilingual 0.3GB
FunctionGemma 270M Specialized for function calling on-device βœ… ❌ ❌ Multilingual 284MB
SmolLM 135M Ultra-compact, resource-constrained devices ❌ ❌ ❌ English 135MB
TranslateGemma 4B † Single-shot 55-language translation ❌ ❌ ❌ 55 languages 2-4GB

† TranslateGemma is CPU-only for now. Google hasn't released a mobile/desktop .litertlm bundle (HF discussion #5 β€” "no concrete plans"). The example app uses the community-converted bundle from barakplasma/translategemma-4b-it-android-task-quantized, which keeps EMBEDDING_LOOKUP weights in float32 for MediaPipe .task compatibility. That layout crashes the LiteRT GPU partitioner on Metal/WebGPU across all platforms β€” tracked upstream at LiteRT-LM#1748. Until Google ships the litert-lm quantization CLI, translation runs on CPU only (β‰ˆ90 s prefill on a 4 B int4 bundle on M-series Macs).

ModelType Reference

When installing models, you need to specify the correct ModelType. Use this table to find the right type for your model:

Model Family ModelType Examples
Gemma 4 ModelType.gemma4 Gemma 4 E2B, Gemma 4 E4B (native function-call tokens)
Gemma 3 / Gemma3n ModelType.gemmaIt Gemma 3 1B, Gemma 3 270M, Gemma3n E2B/E4B
DeepSeek ModelType.deepSeek DeepSeek R1
Qwen 2.5 ModelType.qwen Qwen 2.5 1.5B, Qwen 2.5 0.5B
Qwen 3 ModelType.qwen3 Qwen3 0.6B
FunctionGemma ModelType.functionGemma FunctionGemma 270M IT
Phi ModelType.phi Phi-4 Mini
General ModelType.general FastVLM 0.5B, SmolLM 135M

Note: Gemma 4 uses ModelType.gemma4 so its native <\|tool_call>...<tool_call\|> tokens are routed through the LiteRT-LM SDK's chat-template path. For Gemma 3 and earlier, keep ModelType.gemmaIt.

Usage Example:

// Gemma models
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// DeepSeek models
await FlutterGemma.installModel(modelType: ModelType.deepSeek)
  .fromNetwork(url).install();

// Phi-4 (uses general type)
await FlutterGemma.installModel(modelType: ModelType.general)
  .fromNetwork(url).install();

Installation

As of 1.0, flutter_gemma is split into a small core package plus opt-in packages for each engine / backend, so your app only pulls the native weight it actually uses. Add the core package, then the packages for the model formats and features you need.

  1. Add the core package and the opt-in packages you need to pubspec.yaml:

    dependencies:
      flutter_gemma: latest_version              # Core β€” always required (no engine on its own)
    
      # Inference engines β€” add at least one:
      flutter_gemma_litertlm: latest_version     # .litertlm models (FFI; mobile + desktop + web)
      flutter_gemma_mediapipe: latest_version    # .task / .bin models (MediaPipe; mobile + web)
    
      # Optional β€” text embeddings + on-device RAG:
      flutter_gemma_embeddings: latest_version   # text embeddings (EmbeddingGemma / Gecko)
      flutter_gemma_rag_qdrant: latest_version   # RAG vector store (native: qdrant-edge)
      flutter_gemma_rag_sqlite: latest_version   # RAG vector store (web: wa-sqlite; native: sqlite3)
    

    Pick by need:

    You want to… Add
    Run .litertlm models (Gemma 4, Qwen3, FastVLM, + all desktop) flutter_gemma_litertlm
    Run .task / .bin models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4) flutter_gemma_mediapipe
    Generate text embeddings flutter_gemma_embeddings
    On-device RAG on native (Android/iOS/desktop) flutter_gemma_rag_qdrant
    On-device RAG on web flutter_gemma_rag_sqlite

    Core registers no engine by itself β€” you wire the packages you added in FlutterGemma.initialize(...) (see Initialize Flutter Gemma).

  2. Run flutter pub get to install.

Migrating from 0.16.x (monolith)? See MIGRATION.md β€” the only breaking change is adding the opt-in packages and the initialize(...) call; every model / session / RAG API is unchanged.

Platform & Architecture Support

The plugin ships native prebuilts only for the architectures below. Other ABIs fail at native load with a typed error.

Platform Supported architecture Not supported
Android arm64-v8a (full) armeabi-v7a, x86_64ΒΉ
iOS device arm64 β€”
iOS Simulator arm64 (Apple Silicon Mac) x86_64 (Intel Mac)
macOS arm64 (Apple Silicon) x86_64 (Intel Mac)
Linux x86_64, arm64 β€”
Windows x86_64 arm64

ΒΉ MediaPipe text inference (.task / .bin) on Android also works on x86_64 and armeabi-v7a because Google ships those ABIs in tasks-genai. Everything else (.litertlm FFI, embedding via LiteRT FFI, image generation) is arm64-v8a only:

Android feature arm64-v8a x86_64 armeabi-v7a
Text inference (.task / .bin) βœ… βœ… βœ…
.litertlm (FFI) βœ… ❌ ❌
Embedding (LiteRT FFI) βœ… ❌ ❌
Image generation (vision) βœ… ❌ ❌

If your Android app uses only the arm64-only features, restrict the build to arm64 so the Play Store does not offer broken APKs to incompatible devices:

android {
    defaultConfig {
        ndk { abiFilters 'arm64-v8a' }
    }
}

For development, prefer an Apple Silicon Mac β€” the Android emulator runs arm64-v8a natively, and macOS / iOS Simulator builds are arm64.

Setup

⚠️ Important: Complete platform-specific setup before using the plugin.

  1. Download Model and optionally LoRA Weights: Obtain a model from the Supported Models section or HuggingFace
  1. Platform specific setup:

iOS β€” required by any inference engine package (flutter_gemma_litertlm and/or flutter_gemma_mediapipe)

  • Set minimum iOS version in Podfile:
platform :ios, '16.0'  # Required for MediaPipe GenAI
  • Enable file sharing in Info.plist:
<key>UIFileSharingEnabled</key>
<true/>
  • Add network access description in Info.plist (for development):
<key>NSLocalNetworkUsageDescription</key>
<string>This app requires local network access for model inference services.</string>
  • Enable performance optimization in Info.plist (optional):
<key>CADisableMinimumFrameDurationOnPhone</key>
<true/>
  • Add memory entitlements in Runner.entitlements (for large models):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.developer.kernel.extended-virtual-addressing</key>
	<true/>
	<key>com.apple.developer.kernel.increased-memory-limit</key>
	<true/>
	<key>com.apple.developer.kernel.increased-debugging-memory-limit</key>
	<true/>
</dict>
</plist>
  • Change the linking type of pods to static in Podfile:
use_frameworks! :linkage => :static

No host-side Podfile post_install is required on iOS β€” flutter_gemma patches the upstream LiteRT-LM dlopen path to use @executable_path/Frameworks/<X>.framework/<X> so dyld resolves Metal accelerators directly through the Native-Assets-bundled framework. This also keeps Runner.app/Frameworks/ App-Store-clean (fixes ITMS-90432, see #245).

Android

  • GPU (any engine): if you want to run on the GPU, add OpenCL support to the manifest. Required by both inference engines (flutter_gemma_litertlm and flutter_gemma_mediapipe). CPU-only? Skip this step.

Add to 'AndroidManifest.xml' above tag </application>

 <uses-native-library
     android:name="libOpenCL.so"
     android:required="false"/>
 <uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
 <uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
  • ProGuard/R8 (only if you use flutter_gemma_mediapipe): the package ships its own consumer ProGuard rules, so release builds work out of the box. If you still hit UnsatisfiedLinkError / missing MediaPipe classes, add to your proguard-rules.pro:
# MediaPipe
-keep class com.google.mediapipe.** { *; }
-dontwarn com.google.mediapipe.**

# Protocol Buffers
-keep class com.google.protobuf.** { *; }
-dontwarn com.google.protobuf.**

flutter_gemma_litertlm is delivered as a Native-Assets dylib (no MediaPipe Java classes), so it needs no ProGuard rules.

Web

Web runs on the GPU backend only (MediaPipe has no web CPU backend). Add the CDN script(s) for the engine package(s) you use to your web/index.html.

  • flutter_gemma_mediapipe (.task / -web.task models) β€” add:
  <script type="module">
  import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27';
  window.FilesetResolver = FilesetResolver;
  window.LlmInference = LlmInference;
  </script>
  • flutter_gemma_litertlm (.litertlm web models β€” early preview) β€” add the handshake below. The @litert-lm/core ESM doesn't assign window globals and module scripts are deferred, so Dart must await window.litertLmReady (which resolves to the Engine constructor) before any static interop:
  <script type="module">
  window.litertLmReady = (async () => {
    const m = await import('https://cdn.jsdelivr.net/npm/@litert-lm/core@0.12.1/+esm');
    window.Engine = m.Engine;
    return m.Engine;
  })();
  </script>
  • flutter_gemma_rag_sqlite (web RAG) β€” add the wa-sqlite loader; see that package's README for the exact <script> + Subresource-Integrity hash.

Model compatibility: mobile .task models often don't work on web β€” use the -web.task (MediaPipe) or .litertlm (LiteRT-LM) web variant. Check the model repo for web-compatible builds.

Desktop (macOS, Windows, Linux) β€” requires flutter_gemma_litertlm

⚠️ Desktop Model Format

Desktop is served exclusively by the flutter_gemma_litertlm package and uses LiteRT-LM format only (.litertlm files). There is no MediaPipe engine on desktop β€” .task / .bin models used on mobile/web are NOT compatible with desktop. (flutter_gemma_embeddings and flutter_gemma_rag_qdrant / flutter_gemma_rag_sqlite also support desktop.)

The native library is fetched at build time by the package's Native-Assets hook β€” no manual download/bundling. The setup below applies to flutter_gemma_litertlm.

Inference (LiteRT-LM C API) and embeddings (LiteRT C API) on all native platforms run via dart:ffi directly in the Dart process β€” no JVM, no gRPC, no separate server. Native libraries are downloaded by hook/build.dart (Native Assets) at build time and bundled into the app automatically.

Platform Architecture GPU Acceleration Status
macOS arm64 (Apple Silicon) Metal βœ… Ready
macOS x86_64 (Intel) - ❌ Not Supported
Windows x86_64 DirectX 12 βœ… Ready
Windows arm64 - ❌ Not Supported
Linux x86_64 Vulkan βœ… Ready ΒΉ
Linux arm64 Vulkan βœ… Ready ΒΉ

ΒΉ Linux GPU requires a proper vendor Vulkan driver (NVIDIA / AMD / Intel). Mesa's llvmpipe software fallback is not sufficient for Gemma 4 β€” its hardcoded 128 MB maxStorageBufferRange is below the model's per-buffer requirement. Install the vendor driver (e.g. nvidia-driver-535-server on Ubuntu) before running on GPU.

macOS Setup:

macOS requires a small post_install block in your macos/Podfile. The Apple accelerator dylibs Google ships upstream (libGemmaModelConstraintProvider.dylib, libLiteRtMetalAccelerator.dylib, libLiteRtTopKMetalSampler.dylib) were linked without -Wl,-headerpad_max_install_names, so Dart Native Assets' JIT bundling path (used by dart run / dart build_runner / flutter test on a pure Dart library) cannot rewrite their install_name to a long absolute path inside .dart_tool/lib/ and aborts (#247). To unblock both dart run and flutter build macos, the plugin's hook/build.dart skips bundling those three through Native Assets on macOS, and we instead copy them into App.app/Contents/Frameworks/ ourselves and patch LiteRtLm.dylib's LC_LOAD_DYLIB reference to the new framework path.

Paste this into your macos/Podfile (replacing any existing post_install block) and run pod install:

post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_macos_build_settings(target)
  end

  # flutter_gemma: bundle Apple accelerator dylibs as .framework bundles
  # into Contents/Frameworks/ and re-point LiteRtLm.dylib's LC_LOAD_DYLIB
  # reference to GemmaModelConstraintProvider's new path. 3-tier dylib
  # source fallback: Native Assets cache (pub.dev users) β†’ plugin symlink
  # β†’ in-repo prebuilt/. See README -> macOS Setup and #247/#255.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      phase_name = '[flutter_gemma] Setup LiteRT-LM macOS'

      # Only the app target embeds the Frameworks/ this phase patches.
      # RunnerTests inherits Runner's framework search paths and has no
      # Contents/Frameworks of its own β€” having the phase there creates a
      # cross-target dependency on Runner's framework output that Xcode
      # reports as "Cycle inside Flutter Assemble". Remove any stale copy
      # from non-app targets and skip them.
      unless user_target.name == 'Runner'
        user_target.build_phases
          .select { |p| p.respond_to?(:name) && p.name == phase_name }
          .each { |p| user_target.build_phases.delete(p) }
        next
      end

      existing = user_target.shell_script_build_phases.find { |p| p.name == phase_name }
      phase = existing || user_target.new_shell_script_build_phase(phase_name)
      # Declare a sentinel output so Xcode can order this phase in the
      # dependency graph instead of treating it as "runs every build" with
      # no outputs (the other half of the cycle warning). The script
      # `touch`es this file at the end.
      phase.output_paths = ['$(DERIVED_FILE_DIR)/flutter_gemma_litertlm_macos.stamp']
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Contents/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          exit 0
        fi
        # Sweep any leftover lib*.dylib symlinks from older flutter_gemma versions.
        for base in LiteRtMetalAccelerator LiteRtTopKMetalSampler GemmaModelConstraintProvider; do
          rm -f "${FRAMEWORKS}/lib${base}.dylib"
        done
        # Wrap each upstream dylib into a .framework bundle inside the app's
        # Contents/Frameworks/ so dlopen("@executable_path/../Frameworks/<X>.framework/<X>")
        # (the path the patched gpu_registry.cc uses) resolves at runtime.
        # Resolve dylib source β€” Native Assets cache (pub.dev), then path-dep fallbacks.
        for candidate in \
            "${HOME}/Library/Caches/flutter_gemma/native/macos_arm64" \
            "${PODS_ROOT}/../Flutter/ephemeral/.symlinks/plugins/flutter_gemma/native/litert_lm/prebuilt/macos_arm64" \
            "${SRCROOT}/../../native/litert_lm/prebuilt/macos_arm64"; do
          if [ -f "${candidate}/libGemmaModelConstraintProvider.dylib" ]; then
            PLUGIN_PREBUILT="${candidate}"
            break
          fi
        done
        if [ -z "${PLUGIN_PREBUILT:-}" ]; then
          echo "[flutter_gemma] ERROR: macOS companion dylibs not found. Run 'flutter clean && flutter pub get'."
          exit 1
        fi
        for base in GemmaModelConstraintProvider LiteRtMetalAccelerator LiteRtTopKMetalSampler; do
          src="${PLUGIN_PREBUILT}/lib${base}.dylib"
          if [ ! -f "${src}" ]; then
            echo "[flutter_gemma] WARNING: ${src} not found β€” runtime dlopen will fail"
            continue
          fi
          fw_dir="${FRAMEWORKS}/${base}.framework"
          mkdir -p "${fw_dir}/Versions/A/Resources"
          cp "${src}" "${fw_dir}/Versions/A/${base}"
          install_name_tool -id "@rpath/${base}.framework/Versions/A/${base}" \\
            "${fw_dir}/Versions/A/${base}" 2>/dev/null || true
          (cd "${fw_dir}" && ln -sfh A Versions/Current && ln -sfh "Versions/Current/${base}" "${base}" && ln -sfh "Versions/Current/Resources" Resources)
          cat > "${fw_dir}/Versions/A/Resources/Info.plist" <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>CFBundleExecutable</key><string>${base}</string>
  <key>CFBundleIdentifier</key><string>dev.flutterberlin.flutter_gemma.${base}</string>
  <key>CFBundleVersion</key><string>1</string>
  <key>CFBundleShortVersionString</key><string>1.0</string>
  <key>CFBundlePackageType</key><string>FMWK</string>
</dict>
</plist>
EOF
        done
        # Re-point LiteRtLm.dylib's LC_LOAD_DYLIB at the new framework path.
        LITERTLM="${FRAMEWORKS}/LiteRtLm.framework/Versions/A/LiteRtLm"
        if [ -f "${LITERTLM}" ]; then
          install_name_tool -change \\
            @rpath/libGemmaModelConstraintProvider.dylib \\
            @rpath/GemmaModelConstraintProvider.framework/Versions/A/GemmaModelConstraintProvider \\
            "${LITERTLM}" 2>/dev/null || true
          codesign --force --sign - "${LITERTLM}" 2>/dev/null || true
        fi
        # Write the declared output so Xcode marks the phase up-to-date and
        # orders it deterministically (avoids the Flutter Assemble cycle).
        mkdir -p "$(dirname "${SCRIPT_OUTPUT_FILE_0}")"
        touch "${SCRIPT_OUTPUT_FILE_0}"
      SHELL
    end
  end
end

Add to macos/Runner/DebugProfile.entitlements and Release.entitlements:

<key>com.apple.security.cs.disable-library-validation</key>
<true/>

Windows Setup:

No additional configuration required. hook/build.dart (Native Assets) downloads LiteRtLm.dll + companion DLLs + the DXC runtime (dxil.dll, dxcompiler.dll v1.9.2602) from the GitHub release on first build, verifies them via SHA256, and bundles them next to your app.exe. End users need the Microsoft Visual C++ Redistributable 2019+ (download) β€” most modern Windows 10/11 systems already have it.

Linux Setup:

No additional configuration required. Build dependencies:

sudo apt install clang cmake ninja-build libgtk-3-dev lld

For GPU acceleration, install the vendor Vulkan driver (NVIDIA / AMD / Intel) in addition to the Vulkan loader. Mesa's llvmpipe software fallback caps maxStorageBufferRange at 128 MB and Gemma 4 will not run on it.

sudo apt install vulkan-tools libvulkan1
# Plus your vendor driver, e.g. NVIDIA:
sudo apt install nvidia-driver-535-server

πŸ“š Full Desktop Documentation β†’

Quick Start

⚠️ Important: Complete platform setup before running this code.

1. Install a Model (One Time)

import 'package:flutter_gemma/flutter_gemma.dart';

// Install model. URL example uses the .litertlm variant so the same code
// works on Desktop (Windows/macOS/Linux) and mobile/web. For web only, the
// `.task`/`-web.task` variants of the same model also work.
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
).fromNetwork(
  'https://huggingface.co/litert-community/Gemma3-1B-IT/resolve/main/Gemma3-1B-IT_multi-prefill-seq_q4_ekv4096.litertlm',
  token: 'your_hf_token',
).withProgress((progress) {
  print('Downloading: $progress%');
}).install();

Mobile/Web shortcut: if you don't target Desktop, you can substitute the URL with the .task build of the same model. Desktop targets need the .litertlm build β€” .task and .bin are MediaPipe-only.

2. Create and Use Model (Multiple Times)

// Create model with specific configuration
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

// Use model
final chat = await model.createChat();
await chat.addQueryChunk(Message.text(
  text: 'Explain quantum computing',
  isUser: true,
));
final response = await chat.generateChatResponse();

// Cleanup
await model.close();

System Instructions

Control model behavior with a system-level instruction:

final chat = await model.createChat(
  systemInstruction: 'You are a concise assistant. Always respond in bullet points.',
);

Platform support:

  • Android .litertlm / Desktop: Passed natively via ConversationConfig.systemInstruction
  • Android .task / iOS / Web: Prepended to first user message as fallback

3. Multiple Instances from Same Model

// Install once
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// Create multiple instances
final quickModel = await FlutterGemma.getActiveModel(maxTokens: 512);
final deepModel = await FlutterGemma.getActiveModel(maxTokens: 4096);
// Both use the SAME model file!

Concurrent sessions (openSession)

A single loaded model can serve several independent dialogues at once. openSession() returns a session with its own conversation history, detached from the legacy model.session singleton; openChat() is the same for the higher-level chat API.

Why use it. The model weights (the big, expensive part β€” hundreds of MB to several GB) are loaded once and shared across every session; each session only adds its own lightweight conversation context. Without openSession, serving two independent conversations would mean either loading the model twice (doubling the weight memory) or constantly clearing and rebuilding one session's history when you switch between them.

When you'd reach for it:

  • Multiple chats in one app β€” e.g. a tabbed chat UI where each tab keeps its own thread, all backed by one loaded model.
  • Different roles / system instructions side by side β€” one session with a "translator" system instruction, another as a "code reviewer", without reloading weights between them.
  • Background work alongside an active chat β€” e.g. summarizing or tagging a document in one session while the user keeps chatting in another (they take turns on the accelerator β€” see the serialization note below).
  • A/B prompt comparison β€” run the same model with two different setups and compare, sharing the loaded weights.

If you only ever have one conversation at a time, stick with the simpler createSession() / createChat() singleton API β€” you don't need this.

final model = await FlutterGemma.getActiveModel(maxTokens: 1024);

final chatA = await model.openChat(); // independent context A
final chatB = await model.openChat(); // independent context B

await chatA.addQueryChunk(Message(text: 'My name is Alice.', isUser: true));
await chatA.generateChatResponse();

await chatB.addQueryChunk(Message(text: 'My name is Bob.', isUser: true));
await chatB.generateChatResponse();

// Each remembers only its own context.
await chatA.addQueryChunk(Message(text: 'What is my name?', isUser: true));
print(await chatA.generateChatResponse()); // "Alice"

model.sessions;        // all live sessions (legacy + open)
await chatA.session.close();  // closing one leaves the others usable

⚠️ Concurrent contexts, serialized inference. The sessions are logically independent, but only one session generates at a time β€” calling generateResponse() on a second session while another is still running blocks until the first finishes. Generation is not parallel. This is intentional: parallel on-device inference would contend for the accelerator and risk OOM.

Per-platform behavior (transparent to your code β€” the API is identical):

Path How it works
.litertlm β€” native (Android/iOS/macOS/Windows/Linux) Engine allows one live conversation; sessions multiplex β€” the active session's history is replayed on switch.
.litertlm β€” web (@litert-lm/core) Separate conversations; generation still serialized.
.task β€” MediaPipe (Android/iOS) N real LlmInferenceSession live at once (each with its own KV cache); generation serialized by a mutex.
.task β€” MediaPipe web ❌ Not yet β€” openSession() throws UnsupportedError. Planned for a future release.

Memory: each open session holds its own context (~100–500 MB depending on model + maxTokens). On phones with large models (Gemma 4 E2B+), several concurrent sessions can OOM. Cap the count with maxConcurrentSessions: on getActiveModel(...) β€” openSession() throws StateError past the cap. Multi-session is most reliable on desktop and high-end mobile with small models (Gemma 3 1B / 270M).

Installation Sources

// Network β€” .litertlm is the cross-platform default (Android/iOS/Desktop).
// For mobile-only or web-only apps you can substitute a .task URL of the
// same model.
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork('https://example.com/model.litertlm', token: 'optional')
  .install();

// Flutter assets
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromAsset('assets/models/model.litertlm')
  .install();

// Native bundle
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromBundled('model.litertlm')
  .install();

// External file
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromFile('/path/to/model.litertlm')
  .install();

Modern API vs Legacy API

Benefits:

  • βœ… Cleaner, more intuitive
  • βœ… Type-safe ModelSource
  • βœ… Automatic active model management
  • βœ… Install once, create many instances

Usage:

await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
final model = await FlutterGemma.getActiveModel(maxTokens: 2048);

Legacy API ⚠️ Deprecated

⚠️ DEPRECATED: This API is maintained for backwards compatibility only. New projects should use the Modern API above.

Still works but requires manual ModelType specification:

final model = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,  // Must specify every time
  maxTokens: 2048,
);

Initialize Flutter Gemma

Call FlutterGemma.initialize(...) once in main() and register the opt-in packages you added to pubspec.yaml. Core registers no engine on its own, so without this step getActiveModel() / createEmbeddingModel() throw a clear "add the engine package" error.

import 'package:flutter/widgets.dart';
import 'package:flutter_gemma/flutter_gemma.dart';
import 'package:flutter_gemma_litertlm/flutter_gemma_litertlm.dart';
import 'package:flutter_gemma_mediapipe/flutter_gemma_mediapipe.dart';
import 'package:flutter_gemma_embeddings/flutter_gemma_embeddings.dart';
import 'package:flutter_gemma_rag_qdrant/flutter_gemma_rag_qdrant.dart';

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  FlutterGemma.initialize(
    // Inference engines β€” add the ones whose packages you depend on:
    inferenceEngines: const [
      LiteRtLmEngine(),     // flutter_gemma_litertlm  β€” .litertlm models
      MediaPipeEngine(),    // flutter_gemma_mediapipe β€” .task / .bin models
    ],
    // Optional β€” embeddings (needed for RAG / generateEmbedding):
    embeddingBackends: const [
      LiteRtEmbeddingBackend(), // flutter_gemma_embeddings
    ],
    // Optional β€” RAG vector store (pick one; native here):
    vectorStore: QdrantVectorStore(), // flutter_gemma_rag_qdrant

    // Common settings:
    huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
    maxDownloadRetries: 10,
  );

  runApp(MyApp());
}

Which parameter ← which package:

Parameter Provided by Notes
inferenceEngines: [LiteRtLmEngine()] flutter_gemma_litertlm .litertlm (mobile + desktop + web)
inferenceEngines: [MediaPipeEngine()] flutter_gemma_mediapipe .task / .bin (mobile + web)
embeddingBackends: [LiteRtEmbeddingBackend()] flutter_gemma_embeddings text embeddings
vectorStore: QdrantVectorStore() flutter_gemma_rag_qdrant native RAG
vectorStore: SqliteVectorStore() / WebSqliteVectorStore() flutter_gemma_rag_sqlite native / web RAG

Add only the engines you ship. Passing both LiteRtLmEngine() and MediaPipeEngine() lets one app run both formats β€” the registry routes each model to the engine that handles its file type. On web, choose vectorStore: WebSqliteVectorStore() (flutter_gemma_rag_qdrant is native-only).

Common settings:

  • huggingFaceToken: Authentication token for gated models (Gemma3n, EmbeddingGemma)
  • maxDownloadRetries: Number of retry attempts for failed downloads (default: 10)
  • webStorageMode: (Web only) Storage strategy for model files (default: cacheApi)
    • WebStorageMode.cacheApi: Cache API with Blob URLs (for models <2GB)
    • WebStorageMode.streaming: OPFS streaming (for large models >2GB like E4B, 7B)
    • WebStorageMode.none: No caching (ephemeral mode for testing)

Use WebStorageMode.streaming when shipping .litertlm web models β€” the @litert-lm/core engine consumes an OPFS ReadableStream and avoids Chrome's ~2 GB blob-fetch limit on Gemma 4 E2B/E4B web builds.

Next Steps:

HuggingFace Authentication πŸ”

Many models require authentication to download from HuggingFace. Never commit tokens to version control.

This is the most secure way to handle tokens in development and production.

Step 1: Create config template file config.json.example:

{
  "HUGGINGFACE_TOKEN": ""
}

Step 2: Copy and add your token:

cp config.json.example config.json
# Edit config.json and add your token from https://huggingface.co/settings/tokens

Step 3: Add to .gitignore:

# Never commit tokens!
config.json

Step 4: Run with config:

flutter run --dart-define-from-file=config.json

Step 5: Access in code:

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Read from environment (populated by --dart-define-from-file)
  const token = String.fromEnvironment('HUGGINGFACE_TOKEN');

  // Initialize with token (optional if all models are public)
  FlutterGemma.initialize(
    huggingFaceToken: token.isNotEmpty ? token : null,
  );

  runApp(MyApp());
}

Alternative: Environment Variables

export HUGGINGFACE_TOKEN=hf_your_token_here
flutter run --dart-define=HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN

Alternative: Per-Download Token

// Pass token directly for specific downloads
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/resolve/main/gemma-3n-E2B-it-int4.litertlm',
    token: 'hf_your_token_here',  // ⚠️ Not recommended - use config.json
  )
  .install();

Which Models Require Authentication?

Common gated models:

  • βœ… Gemma3n (E2B, E4B) - google/ repos are gated
  • βœ… Gemma 3 1B - litert-community/ requires access
  • βœ… Gemma 3 270M - litert-community/ requires access
  • βœ… EmbeddingGemma - litert-community/ requires access

Public models (no auth needed):

  • ❌ DeepSeek, Qwen3, Qwen 2.5, SmolLM, Phi-4, FastVLM - Public repos

Get your token: https://huggingface.co/settings/tokens

Grant access to gated repos: Visit model page β†’ "Request Access" button

Model Sources πŸ“¦

Flutter Gemma supports multiple model sources with different capabilities:

Source Type Platform Progress Resume Authentication Use Case
NetworkSource All βœ… Detailed ⚠️ Server-dependent βœ… Supported HuggingFace, CDNs, private servers
AssetSource All ⚠️ End only ❌ No ❌ N/A Models bundled in app assets
BundledSource All ⚠️ End only ❌ No ❌ N/A Native platform resources
FileSource Native (no Web) ⚠️ End only ❌ No ❌ N/A User-selected files (file picker)

NetworkSource - Internet Downloads

Downloads models from HTTP/HTTPS URLs with full progress tracking and authentication.

Features:

  • βœ… Progress tracking (0-100%)
  • ⚠️ Resume after interruption (server-dependent, not supported by HuggingFace CDN)
  • βœ… HuggingFace authentication
  • βœ… Smart retry logic with exponential backoff
  • βœ… Background downloads on mobile
  • βœ… Cancellable downloads with CancelToken
  • βœ… Android foreground service for large downloads (>500MB)

Example:

// Public model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork('https://example.com/model.litertlm')
  .withProgress((progress) => print('$progress%'))
  .install();

// Private model with authentication
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-lm/resolve/main/gemma-3n-E2B-it-int4.litertlm',
    token: 'hf_...',  // Or use FlutterGemma.initialize(huggingFaceToken: ...)
  )
  .withProgress((progress) => setState(() => _progress = progress))
  .install();

Android Foreground Service (Large Downloads):

Android has a 9-minute background execution limit. For large models (>500MB), you can use foreground service mode which shows a notification but bypasses this timeout:

// Auto-detect based on file size (>500MB = foreground) - DEFAULT
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url)  // foreground: null (auto-detect)
  .install();

// Force foreground mode (always show notification)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: true)
  .install();

// Force background mode (may fail for large files)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: false)
  .install();

Foreground Parameter:

  • null (default): Auto-detect based on file size. Files >500MB use foreground service.
  • true: Always use foreground service (shows notification, no timeout)
  • false: Never use foreground service (subject to 9-minute timeout)

Note: iOS uses native URLSession which handles long downloads automatically - no foreground service needed.

Cancelling Downloads:

Use CancelToken to cancel downloads in progress:

import 'package:flutter_gemma/core/model_management/cancel_token.dart';

// Create cancel token
final cancelToken = CancelToken();

// Start download with cancel token
final future = FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(url)
  .withCancelToken(cancelToken)  // ← Pass cancel token via builder
  .withProgress((progress) => print('Progress: $progress%'))
  .install();

// Cancel download from another part of your code
// (e.g., user pressed cancel button)
cancelToken.cancel('User cancelled download');

// Handle cancellation
try {
  await future;
  print('Download completed');
} catch (e) {
  if (CancelToken.isCancel(e)) {
    print('Download was cancelled by user');
  } else {
    print('Download failed: $e');
  }
}

// Check if cancelled
if (cancelToken.isCancelled) {
  print('Reason: ${cancelToken.cancelReason}');
}

CancelToken Features:

  • βœ… Non-breaking: Optional parameter, existing code works without changes
  • βœ… Works with network downloads (inference + embedding models)
  • βœ… Cancels ALL files in multi-file downloads (embedding: model + tokenizer)
  • βœ… Platform-independent (Mobile + Web)
  • βœ… Throws DownloadCancelledException for proper error handling
  • βœ… Thread-safe cancellation

AssetSource - Flutter Assets

Copies models from Flutter assets (declared in pubspec.yaml).

Features:

  • βœ… No network required
  • βœ… Fast installation (local copy)
  • ⚠️ Increases app size significantly
  • βœ… Works offline

Example:

// 1. Add to pubspec.yaml
// assets:
//   - models/gemma3-1b-it.litertlm

// 2. Install from asset
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('models/gemma3-1b-it.litertlm')
  .install();

BundledSource - Native Resources

Production-Ready Offline Models: Include small models directly in your app bundle for instant availability without downloads.

Use Cases:

  • βœ… Offline-first applications (works without internet from first launch)
  • βœ… Small models (Gemma 3 270M ~300MB)
  • βœ… Core features requiring guaranteed availability
  • ⚠️ Not for large models (increases app size significantly)

Platform Setup:

Android (android/app/src/main/assets/models/)

# Place your model file. .litertlm works for both Android and Desktop,
# .task is MediaPipe-only and won't load on Desktop.
android/app/src/main/assets/models/gemma3-270m-it-q8.litertlm

iOS (Add to Xcode project)

  1. Drag model file into Xcode project (.litertlm for FFI; .task for MediaPipe)
  2. Check "Copy items if needed"
  3. Add to target membership

Web (Static files in web/ directory) β€” web uses MediaPipe only, so .task (or -web.task):

# Place model files in web/ directory
example/web/gemma3-270m-it.task

# Files are automatically copied to build/web/ during production build
flutter build web

⚠️ Web Platform Limitation:

  • Production only: Bundled resources work ONLY in production builds (flutter build web)
  • Debug mode: Files in web/ are NOT served by flutter run dev server
  • For development: Use NetworkSource or AssetSource instead

Features:

  • βœ… Zero network dependency
  • βœ… No installation delay
  • βœ… No storage permission needed
  • βœ… Direct path usage (no file copying)

Example:

await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromBundled('gemma3-270m-it-q8.litertlm')
  .install();

App Size Impact:

  • SmolLM 135M: ~135MB
  • Gemma 3 270M: ~300MB
  • Qwen3 0.6B: ~586MB
  • Consider hosting large models for download instead

FileSource - External Files (Native)

References external files (e.g., user-selected via file picker). Works on Android, iOS, macOS, Linux, Windows. Not available on Web (no local file system).

Features:

  • βœ… No copying (references original file)
  • βœ… Protected from cleanup
  • ❌ Web not supported (no local file system)

Example:

// Native only - after user selects file with file_picker
final path = '/data/user/0/com.app/files/model.litertlm';
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromFile(path)
  .install();

Important: On web, FileSource only works with URLs or asset paths, not local file system paths.

Migration from Legacy to Modern API πŸ”„

If you're upgrading from the Legacy API, here are common migration patterns:

Installing Models

Legacy API Modern API
// Network download
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: 'https://example.com/model.bin',
);

await FlutterGemmaPlugin.instance.modelManager
  .downloadModelWithProgress(spec, token: token)
  .listen((progress) {
    print('${progress.overallProgress}%');
  });
// Network download
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://example.com/model.bin',
    token: token,
  )
  .withProgress((progress) {
    print('$progress%');
  })
  .install();
// From assets
await modelManager.installModelFromAssetWithProgress(
  'model.bin',
  loraPath: 'lora.bin',
).listen((progress) {
  print('$progress%');
});
// From assets
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withProgress((progress) {
    print('$progress%');
  })
  .install();

// LoRA weights can be installed with the model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withLoraFromAsset('lora.bin')
  .install();

Checking Model Installation

Legacy API Modern API
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: url,
);

final isInstalled = await FlutterGemmaPlugin
  .instance.modelManager
  .isModelInstalled(spec);
final isInstalled = await FlutterGemma
  .isModelInstalled('model.bin');

Key Migration Notes

  • βœ… Simpler imports: Use package:flutter_gemma/core/api/flutter_gemma.dart
  • βœ… Builder pattern: Chain methods for cleaner code
  • βœ… Callback-based progress: Simpler than streams for most cases
  • βœ… Type-safe sources: Compile-time validation of source types
  • ⚠️ Breaking change: Progress values are now int (0-100) instead of DownloadProgress object
  • ⚠️ Separate files: Model and LoRA weights installed independently

Model Creation and Inference

Modern API (Recommended):

// Create model with runtime configuration
final inferenceModel = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Legacy API (Still supported):

// Works with both Legacy and Modern installation methods
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,
  preferredBackend: PreferredBackend.gpu,
  maxTokens: 2048,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Usage (Legacy API) ⚠️ DEPRECATED

The pre-Modern stream-based API (FlutterGemmaPlugin.instance.modelManager, installModelFromAsset, downloadModelFromNetworkWithProgress, etc.) is still supported but deprecated. New projects should use the Modern API above.

πŸ“š Full Legacy API reference: docs/LEGACY_API.md

πŸ–ΌοΈ Message Types

The plugin now supports different types of messages:

// Text only
final textMessage = Message.text(text: "Hello!", isUser: true);

// Text + Image
final multimodalMessage = Message.withImages(
  text: "What's in this image?",
  imageBytes: [imageBytes],
  isUser: true,
);

// Image only
final imageMessage = Message.imagesOnly(imageBytes: [imageBytes], isUser: true);

// Tool response (for function calling)
final toolMessage = Message.toolResponse(
  toolName: 'change_background_color',
  response: {'status': 'success', 'color': 'blue'},
);

// System information message
final systemMessage = Message.systemInfo(text: "Function completed successfully");

// Thinking content (for DeepSeek models)
final thinkingMessage = Message.thinking(text: "Let me analyze this problem...");

// Check if message contains image
if (message.hasImage) {
  print('This message contains an image');
}

// Create a copy of message
final copiedMessage = message.copyWith(text: "Updated text");

πŸ’¬ Response Types

The model can return different types of responses depending on capabilities:

// Handle different response types
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text token: ${response.token}');
    // Use response.token to update your UI incrementally
    
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function (Gemma 4, Gemma3n, Gemma 3 1B,
    // FunctionGemma, DeepSeek, Qwen3, Qwen 2.5, Phi-4)
    print('Function: ${response.name}');
    print('Arguments: ${response.args}');
    
    // Execute the function and send response back
    _handleFunctionCall(response);
  } else if (response is ThinkingResponse) {
    // Model's reasoning process (DeepSeek models only)
    print('Thinking: ${response.content}');
    
    // Show thinking process in UI
    _showThinkingBubble(response.content);
  }
});

Response Types:

  • TextResponse: Contains a text token (response.token) for regular model output
  • FunctionCallResponse: Contains function name (response.name) and arguments (response.args) when the model wants to call a function
  • ThinkingResponse: Contains the model's reasoning process (response.content) for DeepSeek models with thinking mode enabled

🎯 Supported Models

Platform Support

Model Size Desktop Mobile Web
Gemma 4 E2B 2.4GB βœ… βœ… βœ…
Gemma 4 E4B 4.3GB βœ… βœ… βœ…
Gemma3n E2B 3.1GB βœ… βœ… βœ…
Gemma3n E4B 6.5GB βœ… βœ… βœ…
FastVLM 0.5B 0.5GB βœ… ❌ ❌
Gemma-3 1B 0.5GB βœ… βœ… βœ…
Gemma 3 270M 0.3GB βœ… βœ… βœ…
FunctionGemma 270M 284MB βœ… βœ… ❌
Qwen3 0.6B 586MB βœ… βœ… βœ…
Qwen 2.5 1.5B 1.6GB βœ… βœ… ❌
Qwen 2.5 0.5B 0.5GB ❌ βœ… ❌
SmolLM 135M 135MB ❌ βœ… ❌
Phi-4 Mini 3.9GB βœ… βœ… βœ…
DeepSeek R1 1.7GB ❌ βœ… ❌

πŸ“Š Text Embedding Models

All embedding models generate 768-dimensional vectors. The numbers in names (64/256/512/1024/2048) indicate maximum input sequence length in tokens, not embedding dimension.

Model Parameters Dimensions Max Seq Length Size Best For Auth Required
Gecko 64 110M 768D 64 tokens 110MB Short queries, real-time search ❌
Gecko 256 110M 768D 256 tokens 114MB Balanced speed/accuracy ❌
Gecko 512 110M 768D 512 tokens 116MB Medium context documents ❌
EmbeddingGemma 256 300M 768D 256 tokens 179MB High accuracy, short context βœ…
EmbeddingGemma 512 300M 768D 512 tokens 179MB High accuracy, medium context βœ…
EmbeddingGemma 1024 300M 768D 1024 tokens 183MB Long documents, detailed content βœ…
EmbeddingGemma 2048 300M 768D 2048 tokens 196MB Very long documents βœ…

Performance Comparison (Android Pixel 8 with GPU acceleration):

  • Gecko 64: ~109ms/doc embedding, 130ms search (⚑ fastest - 2.6x faster than EmbeddingGemma)
  • EmbeddingGemma 256: ~286ms/doc embedding, 342ms search (🎯 more accurate - 300M vs 110M params)

Use Cases:

  • βœ… Gecko 64: Real-time search, mobile apps, short queries (≀64 tokens), fast inference
  • βœ… Gecko 256/512: Balanced use cases, general-purpose embeddings, good speed/quality tradeoff
  • βœ… EmbeddingGemma 256/512: High-quality embeddings, semantic search, better accuracy
  • βœ… EmbeddingGemma 1024/2048: Long documents, detailed content, research papers, articles

πŸ”Ž On-device RAG / Vector Store

Native platforms (Android, iOS, macOS, Linux, Windows) use qdrant-edge as the default vector store since 0.16. Web stays on wa-sqlite (qdrant-edge can't target WASM yet). Same Dart API on both β€” code is portable across platforms.

import 'package:flutter_gemma/flutter_gemma.dart';

// 1. Install an embedding model (any of Gecko / EmbeddingGemma)
await FlutterGemma.installEmbedder()
    .modelFromNetwork(
      'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/embeddinggemma-300M_seq256_mixed-precision.tflite',
      token: 'hf_...',
    )
    .tokenizerFromNetwork(
      'https://huggingface.co/litert-community/embeddinggemma-300m/resolve/main/sentencepiece.model',
      token: 'hf_...',
    )
    .install();

// 2. Initialize the vector store (one shard per database path)
await FlutterGemmaPlugin.instance.initializeVectorStore('rag_store');

// 3. Add documents β€” let the plugin compute embeddings for you
for (final doc in docs) {
  await FlutterGemmaPlugin.instance.addDocument(
    id: doc.id,
    content: doc.content,
    metadata: '{"category":"science","lang":"en"}',
  );
}

// 3b. Or batch-embed yourself and feed pre-computed vectors via
//     addDocumentWithEmbedding(...) for higher throughput.
final embedder = FlutterGemmaPlugin.instance.initializedEmbeddingModel!;
final embeddings = await embedder.generateEmbeddings(
  docs.map((d) => d.content).toList(),
  taskType: TaskType.retrievalDocument,
);
for (var i = 0; i < docs.length; i++) {
  await FlutterGemmaPlugin.instance.addDocumentWithEmbedding(
    id: docs[i].id,
    content: docs[i].content,
    embedding: embeddings[i],
    metadata: '{"category":"science","lang":"en"}',
  );
}

// 4. Semantic search, with optional payload-aware Filter (native only)
final results = await FlutterGemmaPlugin.instance.searchSimilar(
  query: 'quantum entanglement',
  topK: 10,
  threshold: 0.0,
  filter: Filter(
    must: [FieldEquals(key: 'category', value: 'science')],
    mustNot: [FieldEquals(key: 'lang', value: 'fr')],
  ),
);

Filter supports must / should / mustNot lists of FieldEquals, FieldRange, FieldMatchAny conditions. On Web the filter argument is silently ignored β€” wa-sqlite has no payload-filter support.

Benchmarks comparing qdrant-edge to the legacy sqlite + local_hnsw backend across 5 platforms (5 000 documents, EmbeddingGemma 300M, 768-dim): see example/integration_test/benchmarks/comparison.md.

πŸ› οΈ Model Function Calling Support

Function calling is currently supported by the following models:

βœ… Models with Function Calling Support

  • Gemma 4 (E2B, E4B) - Full function calling support
  • Gemma3n (E2B, E4B) - Full function calling support
  • Gemma 3 1B - Function calling support
  • FunctionGemma 270M - Google's specialized function calling model
  • DeepSeek R1 - Function calling + thinking mode support
  • Qwen models (0.5B, 0.6B, 1.5B) - Full function calling support
  • Phi-4 Mini - Advanced reasoning with function calling support

❌ Models WITHOUT Function Calling Support

  • Gemma 3 270M - Text generation only
  • SmolLM 135M - Text generation only
  • FastVLM 0.5B - Vision model, no function calling

Important Notes:

  • When using unsupported models with tools, the plugin will log a warning and ignore the tools
  • Models will work normally for text generation even if function calling is not supported
  • Check the supportsFunctionCalls property in your model configuration

Platform Support Details 🌐

Feature Comparison

Feature Android iOS Web Desktop Notes
Text Generation βœ… Full βœ… Full βœ… Full βœ… Full All models supported
Image Input (Multimodal) βœ… Full βœ… Full βœ… Full βœ… Full Verified on macOS Metal and Linux Vulkan (Gemma 4 + Gemma 3n)
Audio Input βœ… Full βœ… Full ΒΉ ❌ Not supported βœ… .litertlm only Gemma3n E2B/E4B + Gemma 4; iOS device-only; Desktop via FFI
Function Calling βœ… Full βœ… Full βœ… Full βœ… Full Gemma 4 native (SDK chat template)
Thinking Mode βœ… Full βœ… Full ❌ Not supported βœ… Full Gemma 4 / DeepSeek / Qwen3; Web MediaPipe has no extraContext
Stop Generation βœ… Full βœ… Full βœ… Full βœ… Full Cancel mid-process
GPU Acceleration βœ… Full βœ… Full βœ… Full βœ… Full Metal/WebGPU/Vulkan/DX12
NPU Acceleration βœ… Full ❌ Not supported ❌ Not supported βœ… Windows Android (.litertlm) + Windows Intel LunarLake/PantherLake
CPU Backend βœ… Full βœ… Full ❌ Not supported βœ… Full MediaPipe limitation
Streaming Responses βœ… Full βœ… Full βœ… Full βœ… Full Real-time generation
LoRA Support βœ… Full βœ… Full βœ… Full ❌ Not supported LiteRT-LM limitation
Text Embeddings βœ… Full βœ… Full βœ… Full βœ… Full EmbeddingGemma, Gecko
VectorStore (RAG) βœ… qdrant-edge βœ… qdrant-edge βœ… wa-sqlite (WASM) βœ… qdrant-edge Semantic search + payload Filter (native)
File Downloads βœ… Background βœ… Background βœ… In-memory βœ… Background Platform-specific
Asset Loading βœ… Full βœ… Full βœ… Full ❌ Not supported Flutter assets N/A
Bundled Resources βœ… Full βœ… Full βœ… Full ❌ Not supported Native bundles only
External Files (FileSource) βœ… Full βœ… Full ❌ Not supported βœ… Full No local FS on web

Web column note: the Web βœ… marks above describe the MediaPipe .task web path (image input, function calling, etc.). Thinking Mode is not available on Web β€” MediaPipe web exposes no extraContext hook. The newer web .litertlm path (@litert-lm/core) is an early-preview subset β€” text-only, no vision/audio/thinking/function-calling. See Web .litertlm support & limitations.

Web Platform Specifics

Authentication

  • Required for gated models: Gemma3n, Gemma 3 1B/270M, EmbeddingGemma
  • Configuration: Use FlutterGemma.initialize(huggingFaceToken: '...') or pass token per-download
  • Storage: Tokens stored in browser memory (not localStorage)

File Handling

  • Downloads: Creates blob URLs in browser memory (no actual files)
  • Storage: IndexedDB via WebFileSystemService
  • FileSource: Only works with HTTP/HTTPS URLs or assets/ paths
  • Local file paths: ❌ Not supported (browser security restriction)

Web Storage Modes

Three Storage Modes:

1. Cache API Mode (default, WebStorageMode.cacheApi):

  • Uses browser Cache API with Blob URLs
  • Models persist across browser restarts
  • Best for models <2GB

2. Streaming Mode (WebStorageMode.streaming):

  • Uses OPFS with ReadableStream
  • Bypasses browser 2GB ArrayBuffer limit
  • Required for large models (E4B 4GB+, 7B, 27B)
  • Requires Chrome 86+, Edge 86+, Safari 15.2+

3. Ephemeral Mode (WebStorageMode.none):

  • Models stored in memory only
  • Cleared when browser closes
  • For testing/demos
// Default: Cache API for small models
FlutterGemma.initialize(webStorageMode: WebStorageMode.cacheApi);

// Streaming for large models (>2GB)
FlutterGemma.initialize(webStorageMode: WebStorageMode.streaming);

// Check if streaming is supported
final supported = await FlutterGemma.isStreamingSupported();

Backend Support

CORS Configuration

  • Required for custom servers: Enable CORS headers on your model hosting server
  • Firebase Storage: See CORS configuration docs
  • HuggingFace: CORS already configured correctly

Memory Limitations

  • Large models: May hit browser memory limits (2GB typical)
  • Recommended: Use smaller models (1B-2B) for web platform
  • Best models for web:
    • Gemma 3 270M (300MB)
    • Gemma 3 1B (500MB-1GB)
    • Gemma3n E2B (3GB) - requires 6GB+ device RAM

Browser Cache Storage Limits

Browser Max Model Size Notes
Chrome/Firefox ~2 GB ArrayBuffer limit
Safari ~50 MB ⚠️ Not suitable

Web .litertlm support & limitations

Web .litertlm inference (added in 0.16.2) runs Gemma .litertlm models (verified on Gemma 4 E2B/E4B web variants) in the browser through the upstream @litert-lm/core package (WebGPU + WASM). It is an early preview and is intentionally a subset of the native .litertlm path. MediaPipe .task on web is unaffected and remains fully supported.

Works on web .litertlm:

  • βœ… Text generation (sync getResponse() and streaming getResponseAsync())
  • βœ… Multi-turn chat with history (createChat / openChat)
  • βœ… System instruction (via the conversation preface)
  • βœ… Concurrent sessions (openSession) β€” serialized inference (see Concurrent sessions)
  • βœ… Large models via OPFS streaming (WebStorageMode.streaming) β€” bypasses Chrome's ~2 GB blob limit
  • βœ… GPU only (WebGPU is required; there is no CPU backend on web)

Not supported on web .litertlm yet (mobile/desktop only):

  • ❌ Vision / image input β€” @litert-lm/core does not expose the Vision executor config; image inputs are dropped with a debug warning
  • ❌ Audio input β€” same reason (no Audio executor config in the JS API)
  • ❌ Thinking mode β€” extraContext thinking channel is not wired on web
  • ❌ Function calling / tool calls β€” prefill+decode tool models aren't available on the web runtime
  • ❌ LoRA weights β€” loraPath throws UnsupportedError
  • ⚠️ stopGeneration() β€” closes the local Dart stream and calls the upstream conversation.cancel() to abort generation; the cancel is best-effort (the early-preview JS API may throw if nothing is in flight, which is swallowed)
  • ⚠️ WebStorageMode.none + model > 2 GB β€” the engine fetch()es the in-memory blob and trips Chrome's ERR_BLOB_OUT_OF_MEMORY; use WebStorageMode.streaming for large models

These limits track the upstream @litert-lm/core early-preview API and will lift as Google extends the JS executor surface. For full vision / audio / thinking / function calling on web today, use MediaPipe .task web models instead.

Mobile Platform Specifics

Android

  • GPU Support: Requires OpenGL libraries in AndroidManifest.xml
  • ProGuard: Automatic rules included for release builds
  • Storage: Local file system in app documents directory

iOS

  • Minimum version: iOS 16.0 required for MediaPipe GenAI
  • Memory entitlements: Required for large models (see Setup section)
  • Linking: Static linking required (use_frameworks! :linkage => :static)
  • Storage: Local file system in app documents directory

Desktop Platform Specifics

Storage Locations

Desktop builds store downloaded models outside the user's Documents/ folder to avoid OneDrive / iCloud / Domain-Roaming sync corrupting FFI mmap of large .litertlm files (since 0.15.1):

  • Windows: %LOCALAPPDATA%\flutter_gemma\ (never OneDrive-synced)
  • macOS: ~/Library/Application Support/<bundle>/flutter_gemma/
  • Linux: ~/.local/share/<app>/flutter_gemma/

Models installed by 0.14.x / 0.15.0 builds that still live under Documents/ keep working via a fallback read (a debug log nudges users to re-install once for migration).

The full and complete example you can find in example folder

Important Considerations

  • Model Size: Larger models (such as 7b and 7b-it) might be too resource-intensive for on-device inference.
  • Function Calling Support: Gemma 4, Gemma3n, Gemma 3 1B, FunctionGemma, DeepSeek, Qwen3, Qwen 2.5, and Phi-4 models support function calling. Other models will ignore tools and show a warning. See Model Function Calling Support.
  • Thinking Mode: Gemma 4, DeepSeek, and Qwen3 models support thinking mode. Enable with isThinking: true on the matching ModelType.
  • Multimodal Models: Gemma3n models with vision support require more memory and are recommended for devices with 8GB+ RAM.
  • iOS Memory Requirements: Large models require memory entitlements in Runner.entitlements and minimum iOS 16.0.
  • LoRA Weights: They provide efficient customization without the need for full model retraining.
  • Development vs. Production: For production apps, do not embed the model or LoRA weights within your assets. Instead, load them once and store them securely on the device or via a network drive.
  • Web Models: Currently, Web support is available only for GPU backend models. Multimodal support is fully implemented.
  • Image Formats: The plugin automatically handles common image formats (JPEG, PNG, etc.) when using Message.withImages() or Message.withImage().

πŸ›Ÿ Troubleshooting

Multimodal Issues:

  • Ensure you're using a multimodal model (Gemma3n E2B/E4B)
  • Set supportImage: true when creating model and chat
  • Check device memory - multimodal models require more RAM

Performance:

  • Use GPU backend for better performance with multimodal models
  • Consider using CPU backend for text-only models on lower-end devices

Memory Issues:

  • iOS: Ensure Runner.entitlements contains memory entitlements (see iOS setup)
  • iOS: Set minimum platform to iOS 16.0 in Podfile
  • Reduce maxTokens if experiencing memory issues
  • Use smaller models (1B-2B parameters) for devices with <6GB RAM
  • Close sessions and models when not needed
  • Monitor token usage with sizeInTokens()

iOS Build Issues:

  • Ensure minimum iOS version is set to 16.0 in Podfile
  • Use static linking: use_frameworks! :linkage => :static
  • Clean and reinstall pods: cd ios && pod install --repo-update
  • Check that all required entitlements are in Runner.entitlements

Advanced Usage

ModelThinkingFilter (Advanced)

For advanced users who need to manually process model responses, the ModelThinkingFilter class provides utilities for cleaning model outputs:

import 'package:flutter_gemma/core/extensions.dart';

// Clean response based on model type
String cleanedResponse = ModelThinkingFilter.cleanResponse(
  rawResponse,
  ModelType.deepSeek
);

// The filter automatically removes model-specific tokens like:
// - <end_of_turn> tags (Gemma models)
// - <think>...</think> blocks (DeepSeek)
// - <|channel>thought\n...<channel|> blocks (Gemma 4 E2B/E4B)
// - Extra whitespace and formatting

This is automatically handled by the chat API, but can be useful for custom inference implementations.

β˜• Support the Project

If you find Flutter Gemma useful and want to support its development, consider buying me a coffee! Your support helps me:

  • πŸ”§ Maintain and improve the plugin
  • πŸ“š Keep documentation up-to-date
  • πŸ› Fix bugs and resolve issues faster
  • ✨ Add new features and model support
  • πŸ§ͺ Test on more devices and platforms

ko-fi

Every contribution, no matter how small, makes a difference. Thank you for your support! πŸ’™

Libraries

core/api/embedding_installation_builder
core/api/flutter_gemma
core/api/inference_installation_builder
core/chat
core/chat_event
core/di/platform/mobile_service_factory
Mobile-platform service factory. This file is only compiled on iOS/Android platforms. Uses background_downloader for model downloads.
core/di/platform/web_service_factory
Web-platform service factory. This file is only compiled on web platform. Uses dart:js_interop for browser-based downloads.
core/di/service_registry
core/domain/cache_metadata
core/domain/download_error
core/domain/download_exception
core/domain/model_source
core/domain/web_storage_mode
core/extensions
core/function_call_parser
core/handlers/asset_source_handler
core/handlers/bundled_source_handler
core/handlers/file_source_handler
core/handlers/network_source_handler
core/handlers/source_handler
core/handlers/source_handler_registry
core/handlers/web_asset_source_handler
core/handlers/web_asset_source_handler_stub
Stub implementation for non-web platforms This file is used when dart:js_interop is not available
core/handlers/web_bundled_source_handler
core/handlers/web_bundled_source_handler_stub
core/handlers/web_file_source_handler
core/handlers/web_file_source_handler_stub
core/handlers/web_network_source_handler
core/handlers/web_network_source_handler_stub
core/image_error_handler
core/image_processor
core/image_tokenizer
core/infrastructure/background_downloader_service
core/infrastructure/blob_url_manager
core/infrastructure/blob_url_manager_stub
core/infrastructure/flutter_asset_loader
core/infrastructure/flutter_asset_loader_stub
Stub implementation for platforms where dart:io is not available (web) This file is used when large_file_handler cannot be imported
core/infrastructure/in_memory_model_repository
core/infrastructure/platform_file_system_service
core/infrastructure/shared_preferences_model_repository
core/infrastructure/shared_preferences_protected_registry
core/infrastructure/unconfigured_vector_store
core/infrastructure/url_utils
core/infrastructure/web_cache_interop
JavaScript interop for Cache API
core/infrastructure/web_cache_interop_stub
Stub implementation for non-web platforms
core/infrastructure/web_cache_service
Web cache service for persistent model storage
core/infrastructure/web_cache_service_stub
core/infrastructure/web_download_service
core/infrastructure/web_download_service_stub
core/infrastructure/web_file_system_service
core/infrastructure/web_js_interop
core/infrastructure/web_js_interop_stub
core/infrastructure/web_opfs_interop
JavaScript interop for OPFS (Origin Private File System)
core/infrastructure/web_opfs_interop_stub
Stub for OPFS interop on non-web platforms
core/infrastructure/web_opfs_service
Dart service wrapper for OPFS (Origin Private File System)
core/lifecycle/close_notifier
core/message
core/migration/legacy_preferences_migrator
core/model
core/model_management/cancel_token
core/model_management/constants/preferences_keys
core/model_management/managers/web_model_manager
core/model_response
core/multimodal_image_handler
core/parsing/deepseek_function_call_format
core/parsing/function_call_format
core/parsing/function_call_format_factory
core/parsing/function_gemma_format
core/parsing/json_function_call_format
core/parsing/json_parsing_utils
core/parsing/llama_function_call_format
core/parsing/phi_function_call_format
core/parsing/qwen_function_call_format
core/parsing/sdk_passthrough_function_call_format
core/parsing/sdk_response_parser
core/parsing/sdk_text_extractor
core/registry/embedding_backend_provider
core/registry/embedding_registry
core/registry/engine_registry
core/registry/inference_engine_provider
core/registry/runtime_config
core/services/asset_loader
core/services/download_service
core/services/file_system_service
core/services/model_repository
core/services/protected_files_registry
core/services/vector_store_filter
core/services/vector_store_repository
core/tool
core/utils/file_name_utils
core/utils/gemma_log
core/vision_encoder_validator
desktop/flutter_gemma_desktop
desktop/flutter_gemma_desktop_stub
flutter_gemma
flutter_gemma_interface
mobile/flutter_gemma_mobile
mobile/smart_downloader
model_file_manager_interface
pigeon.g
rag/embedding_models
web/flutter_gemma_web
web/web_image_format
web/web_model_source