openSession method

@override

Future<InferenceModelSession> openSession({

double temperature = .8,
int randomSeed = 1,
int topK = 1,
double? topP,
String? loraPath,
bool? enableVisionModality,
bool? enableAudioModality,
String? systemInstruction,
bool enableThinking = false,
List<Tool> tools = const [],

})

override

Opens a new session detached from session. Each call returns a fresh independent session sharing the loaded model weights but with isolated context (history / KV cache). Use this for concurrent dialogues on a single loaded model.

Why: the (expensive) model weights are loaded once and shared across every session; each session only adds its own lightweight context. This lets one loaded model back several independent conversations — e.g. a tabbed chat UI, two different system instructions / roles side by side, or background summarization alongside an active chat — without reloading the weights or clearing+rebuilding a single session's history on every switch. If you only ever have one conversation at a time, use createSession / createChat instead.

Unlike createSession, this does NOT modify the legacy session field. Concurrent sessions are tracked separately and surface via sessions.

Concurrent contexts, serialized inference. The sessions are logically independent — each keeps its own conversation — but generation is serialized: only one session runs inference at a time. Calling getResponse() / getResponseAsync() on a second session while another is still generating blocks until the first finishes; the calls do NOT run in parallel. This is intentional (parallel on-device inference would contend for the accelerator and risk OOM) and is the same on every backend:

.litertlm (FFI, all native): the engine allows one live conversation at a time, so sessions multiplex — the active session's history is replayed into the single conversation on switch.
.litertlm (web, @litert-lm/core): separate conversations, but generation is still serialized.
.task (MediaPipe, Android/iOS): N real LlmInferenceSession live at once (each with its own KV cache), generation serialized by a mutex.

Memory caveat: each concurrent session holds its own context (~100-500 MB depending on model + maxTokens). On mobile with large models (Gemma 4 E2B+), several concurrent sessions can OOM the process. Multi-session is most reliable on desktop and on high-end mobile with small models (Gemma 3 1B / 270M). For larger models on phones the safer pattern is still close+recreate with InferenceChat's built-in history replay. Use maxConcurrentSessions on createModel to cap the count.

Not yet available on the MediaPipe web .task path — throws UnsupportedError there.

Throws StateError if maxConcurrentSessions (set on FlutterGemmaPlugin.createModel) is exceeded — close an existing session before opening a new one.

Implementation

@override
Future<InferenceModelSession> openSession({
  double temperature = .8,
  int randomSeed = 1,
  int topK = 1,
  double? topP,
  String? loraPath,
  bool? enableVisionModality,
  bool? enableAudioModality,
  String? systemInstruction,
  bool enableThinking = false,
  List<Tool> tools = const [],
}) async {
  if (_isClosed) {
    throw StateError(
        'Model is closed. Create a new instance to use it again');
  }
  final cap = maxConcurrentSessions;
  if (cap != null && _openSessions.length >= cap) {
    throw StateError(
      'Max concurrent sessions ($cap) reached. Close an existing session '
      'before opening a new one.',
    );
  }

  final id = _nextSessionId++;
  // MediaPipe holds N real LlmInferenceSession per engine — each call creates
  // an independent native session with its own KV cache. No singleton
  // overwrite; generation is serialized via [_generationMutex] on the
  // session objects, not here.
  await _platformService.createSessionForId(
    sessionId: id,
    randomSeed: randomSeed,
    temperature: temperature,
    topK: topK,
    topP: topP,
    loraPath: loraPath,
    enableVisionModality: enableVisionModality ?? supportImage,
    enableAudioModality: enableAudioModality ?? supportAudio,
    systemInstruction: systemInstruction,
    enableThinking: enableThinking,
  );

  late final MultiSessionMobileInferenceModelSession session;
  session = MultiSessionMobileInferenceModelSession(
    sessionId: id,
    modelType: modelType,
    fileType: fileType,
    supportImage: enableVisionModality ?? supportImage,
    supportAudio: enableAudioModality ?? supportAudio,
    systemInstruction: systemInstruction,
    generationMutex: _generationMutex,
    onClose: () => _openSessions.remove(session),
  );
  _openSessions.add(session);
  return session;
}

openSession method

Implementation

MobileInferenceModel class