createModel abstract method
Creates and returns a new InferenceModel instance.
modelType — model type to create.
maxTokens — the model's CONTEXT WINDOW: the total number of tokens
shared by input (system prompt + history + current message) AND the
generated output, i.e. the KV-cache budget. It is NOT the maximum
response length — to cap how much is generated, use maxOutputTokens
on InferenceModel.createSession. .litertlm models require a context
window of at least 1024 (their baked kv_cache_max_len); a smaller
value is clamped up to 1024 to avoid a native tensor-allocation crash
(#318). The default (1024) is safe for every supported model.
preferredBackend — backend preference (e.g., CPU, GPU).
loraRanks — optional supported LoRA ranks.
maxNumImages — maximum number of images (for multimodal models).
supportImage — whether the model supports images.
supportAudio — whether the model supports audio (Gemma 3n E4B only).
enableSpeculativeDecoding — Multi-Token Prediction toggle for Gemma 4
E2B/E4B (LiteRT-LM v0.11.0+). null honors the model's default;
true/false forces on/off. Older .litertlm files without an MTP
drafter ignore this flag at the SDK level.
maxConcurrentSessions — optional cap on the number of sessions open
at once via InferenceModel.openSession. null (default) = no cap,
backward-compatible. When set, the (cap+1)-th InferenceModel.openSession
throws StateError. Use this on mobile with large models to guard
against OOM from multiple concurrent KV caches.
Implementation
Future<InferenceModel> createModel({
required ModelType modelType,
ModelFileType fileType = ModelFileType.task,
int maxTokens = 1024,
PreferredBackend? preferredBackend,
List<int>? loraRanks,
int? maxNumImages, // Add image support
bool supportImage = false, // Add image support flag
bool supportAudio = false, // Add audio support flag (Gemma 3n E4B)
bool? enableSpeculativeDecoding,
int? maxConcurrentSessions,
});