llamadart 0.7.0
llamadart: ^0.7.0 copied to clipboard
Dart and Flutter local LLM inference with llama.cpp GGUF and LiteRT-LM across native platforms and web.
0.7.0 #
- LiteRT-LM backend and runtime selection:
- Added first-class
.litertlmrouting throughLlamaBackend()on native and web targets, with native bundle downloads fromleehack/litert-lm-nativeand web loading through@litert-lm/core. - Added
ModelParams.liteRtLmBackendso callers can select LiteRT-LM CPU, GPU, or Android NPU execution where supported.autochooses GPU on Android/macOS and CPU elsewhere on native targets. - Added cached Hugging Face
.litertlmloading throughloadModelSource(...), preserving the selected LiteRT-LM backend after the cache manager resolves the local file. - Added native LiteRT-LM tokenization, detokenization, log-level control,
runtime metrics, and high-level
ChatSessiontoken counting support. - Added
hooks.user_defines.llamadart.llamadart_native_tag,llamadart_native_repository, andllamadart_native_pathso apps can test a different compatible native runtime source without patchingllamadart. - Updated the Windows runtime fallback scanner to discover custom GitHub and
local archive cache namespaces when
.dart_tool/libis unavailable.
- Added first-class
- LiteRT-LM chat, templates, and generation quality:
- Added
GenerationParams.speculativeDecodingfor native LiteRT-LM. The default remains disabled; llama.cpp, WebGPU, and LiteRT-LM web reject the option until their speculative paths are implemented. - Fixed Gemma 4
.litertlmthinking and tool calling by replacing the stub template with the canonical Gemma 4 chat template, parsing the runtime thought channel as reasoning, and suppressing reasoning deltas when callers setenableThinking: false. - Added a filename-keyed
.litertlmchat-template registry seeded with Gemma 4/3/3n and Qwen 2.5/3. PassModelParams.chatTemplateto override detection for other models. - Fixed LiteRT-LM tool calling for grammar-using handlers by forwarding
supportsGrammarConstraintsfrom the activeNativeAutoBackenddelegate. - Stopped structured tool-call streams from leaking raw Hermes/Qwen JSON or
Gemma
<|tool_call>markers as assistant content before the finaltool_callschunk.
- Added
- Web and chat-app support:
- Added
ModelParams.preferMemory64andModelParams.modelBytesHintso large WebGPU GGUF models such as Gemma 4 E2B can choose the 64-bit bridge core before hitting the wasm32 address-space limit. - Fixed web
.litertlmchat-app turns by swallowing unsupported token-count refreshes, avoiding unsupportedminP/penaltyparameters for LiteRT-LM web generation, and replacing the stuck "Loading model 0%" label with an indeterminate load message. - Halved web
.litertlmload time by skipping WebGPUCacheStorageprefetch for LiteRT-LM models, which are fetched directly by@litert-lm/core. - Fixed web GGUF downloads reporting success before the bridge was ready by
awaiting
window.__llamadartBridgeReadyPromise, requiring the bridge prefetch API, and surfacing actionable errors for old bridge assets. - Allowed benign Hugging Face
?download=trueURLs to be prefetched into the browser cache while still skipping credentialed or signed URLs.
- Added
- Lifecycle, cancellation, and native stability:
- Fixed iOS
.litertlmloading by resolving embeddedLiteRtLmandStreamProxyframeworks from the app bundle, matching the macOS runtime path behavior. - Improved LiteRT-LM diagnostics before model load, including selected CPU/GPU/NPU backend reporting, platform availability errors, and complete dynamic-library candidate failures.
- Validated platform-specific LiteRT-LM companion libraries during native-asset setup so incomplete runtime bundles fail at build time.
- Hardened native and LiteRT-LM cancellation/disposal so in-flight generation no longer races token release, worker teardown, engine deletion, closed response ports, or stream writes after cancellation.
- Freed multimodal prompt buffers on tokenize/eval error paths, serialized multimodal projector load/unload, and closed a leaked native-backend handshake reply port.
- Fixed iOS
- Correctness and download resilience:
ChatSessionnow forwards empty-choices completion chunks instead of throwing, strips multiple<think>blocks, and trims history only on user-message turn boundaries.LlamaEngine.generatewraps unexpected backend errors inLlamaInferenceExceptionso callers catchingLlamaExceptionsee the documented error type.- Tool-call parsing now uses stable fallback ids, tolerates code-fence language tokens without trailing delimiters, and keeps commas inside quoted argument values.
- JSON-schema-to-GBNF conversion now resolves
$refs nested inside other$reftargets and fails loudly on unresolvable or external$refs. - Array grammar generation validates
minItems/maxItems, model downloads use connection and idle-read timeouts, and partial-download resume is restricted to files with stored validators.
- Benchmarks, docs, and validation:
- Added fair Gemma 4 LiteRT-LM versus llama.cpp/GGUF benchmark tooling for Android, macOS, and web, with speculative-decoding metrics, Pixel benchmark failure detection, and target-specific timeouts.
- Added
tool/gguf_chat_features_smoke.dartand thechat-app-web-gemma4-webgpu-smokeE2E scenario for real-model parser and WebGPU mem64 validation. - Updated README, website docs, and
doc/litert_lm_templates.mdfor backend selection, platform/runtime support, package-size controls, benchmark results, model templates, and current LiteRT-LM capability limits.
- Compatibility note: no public API breaking changes for existing GGUF /
llama.cpp callers. LiteRT-LM support is additive, with deprecated benchmark
wrappers retained for compatibility; unsupported llama.cpp-only parameters are
rejected for
.litertlmloads instead of being silently ignored.
0.6.17 #
- Native runtime sync:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b9371, picking up llama.cppb9371. - Picked up the Apple mobile Metal stability fix that disables Metal
residency sets on iOS/tvOS/visionOS native bundles, avoiding affected
device context-creation failures such as
MTLLibraryErrorDomain Code=3.
- Updated native hook pinning and regenerated bindings through
- Compatibility note: no public API breaking changes in
0.6.17; existing0.6.16callers remain compatible. The release only refreshes the pinned native runtime and generated low-level bindings.
0.6.16 #
- Native runtime diagnostics:
- Fixed native
getVramInfo()so it reports free/total VRAM from llama.cpp GPU-class backend devices when available, using props-based memory reporting first and the legacy memory probe as a fallback. - Routed native VRAM probing through the ggml registry fallback path so Windows split bundles resolve backend-device symbols from the runtime that owns the device registry.
- Fixed native
- WebGPU and chat app fixes:
- Improved browser recovery for large remote WebGPU model/projector loads by retrying wasm32 model-staging aborts with the wasm64 core before surfacing memory-pressure failures.
- Improved the runnable chat app's web remote-model startup path so model
assets are prefetched into browser cache when available, browser
CacheStoragefailures fall back to direct network loading, and credentialed/signed model URLs skip persistent browser cache storage.
- Model download UX:
- Improved the runnable chat app's mobile download behavior so lifecycle pauses no longer deliberately cancel active foreground downloads; the app now lets short screen-lock/background interruptions continue when the OS permits and still keeps explicit pause/dispose cancellation paths.
- Added in-app and docs guidance for mobile large-model downloads, including resumable partial files, foreground Dart lifecycle limits, and the need for opt-in native background download/model-store integrations for robust cross-app GGUF management.
- Compatibility note: no public API breaking changes in
0.6.16; existing0.6.15callers remain compatible. The changes improve native VRAM diagnostics, WebGPU browser recovery, and chat app download lifecycle behavior.
0.6.15 #
- Fixes:
- Fixed GLM-OCR and other multimodal chat-template workarounds so image and audio content parts are preserved when tool-call normalization runs, system prompts are merged before leading media parts, and invalid tool-call serialization fails loudly instead of silently falling back to the wrong template shape.
- Testing:
- Added
tool/testing/run_local_e2e.dartas a discovery and orchestration entry point for heavyweight local-only Dart E2E, Flutter device, and Web/Playwright smoke scenarios. - Hardened the upstream llama.cpp chat/template E2E runner against current
llama.cpp target renames, dynamic backend library lookup, and full
test-chatserver/mtmd build requirements. - Documented that real-model/device/WebGPU scenarios remain skipped from
default CI and should be opted into explicitly with
--listand--dry-runfirst.
- Added
- Compatibility note: no public API breaking changes in
0.6.15; existing0.6.14callers remain compatible. The chat-template changes fix multimodal serialization behavior for affected templates, and the local E2E runner is additive.
0.6.14 #
- WebGPU bridge assets:
- Updated the default WebGPU bridge asset pin to
leehack/llama-web-bridge-assets@v0.1.16(llama.cppb9165), picking up the published JS bridge build, TypeScript declaration asset, and refreshed bridge docs.
- Updated the default WebGPU bridge asset pin to
- Docs:
- Added WebGPU readiness guidance covering browser capability checks, cross-origin isolation, bridge asset/version diagnostics, fallback behavior, model/configuration pressure, and the Flutter Web real-model smoke path.
- Model download UX:
- Added
ModelDownloadController, a dependency-free helper that turnsModelDownloadManagercache/download work into app-facing lifecycle states for resolving, cache checks, downloads, verification, ready, failed, cancelled, and retry flows. - Wired the runnable chat app example through a
ModelDownloadManageradapter so its model-management UI demonstrates the controller while preserving the example's multi-asset and web-cache service behavior.
- Added
- Compatibility note: no public API breaking changes in
0.6.14; the WebGPU bridge asset update andModelDownloadControllerare additive, and existing0.6.13callers remain compatible.
0.6.13 #
- Model source download/cache manager:
- Added
ModelSourcefor local paths, HTTP(S) URLs, and Hugging Facehf://owner/repo/path/to/model.ggufreferences, including deterministic cache keys and redacted metadata/log identities for signed URLs. - Added
ModelLoadOptions,ModelCachePolicy, resolver targets, and download/cache metadata/progress value models for package-managed model download and cache management. - Added native/file-backed
DefaultModelDownloadManagersupport for streaming HTTP downloads,.partfiles, atomic promotion, persisted metadata, authenticated bearer/custom headers, cancellation, retry, Range resume, cache hit/refresh/cache-only/no-cache policies, SHA-256 verification, cache listing, removal, clearing, and age/size pruning. - Improved Hugging Face source ergonomics:
hf://references now accept?revision=...for branch/ref names containing slashes, and docs clarify current single-file behavior, private/gated bearer-token usage, separatemmprojasset handling, sharded-GGUF limitations, and redaction guarantees. - Serialized concurrent stable-cache downloads for the same remote cache entry
across manager instances so duplicate callers do not race on shared
.partfiles or metadata, while distinct cache entries can still download in parallel and waiting-caller cancellation does not cancel the active download. - Hardened versioned cache metadata recovery: completed files can rebuild missing, malformed, or unsupported-schema sidecars without network access, while byte-count and stored/caller SHA-256 mismatches are treated as cache misses and safely re-downloaded.
- Clarified
ModelSource.path(...)option semantics: local paths now reject remote/download-only options (non-default cache policies, cache directories, authenticated headers, resume, and retry overrides) while continuing to support cancellation and optional local SHA-256 verification. - Added
LlamaEngine.loadModelSource(...)to route local sources through the existing native local loader, remote sources through the native download cache before local loading, and simple remote sources through URL-capable web backends when available. - Migrated server/testing helpers away from ad-hoc model downloads so examples dogfood the package-managed cache manager.
- Added
- State persistence API:
- Added
LlamaEngine.supportsStatePersistence,LlamaEngine.stateSaveFile(...), andLlamaEngine.stateLoadFile(...)so callers can persist and restore llama.cpp KV-cache state for fast raw-prompt resume/fork workflows. - Added
BackendStatePersistence,BackendStatePersistenceSupport, andStateLoadResultfor custom backend implementers and diagnostics. - Documented that state files are opaque llama.cpp artifacts tied to the same
model and runtime/build, that native paths use the app filesystem while web
paths use the bridge WASMFS virtual filesystem, and that
ChatSessionmessage history must be persisted separately. - Added WebGPU bridge state persistence wiring for bridge assets
v0.1.15+, including Dart JS interop, backend forwarding, and browser integration test coverage.
- Added
- Compatibility note: no public API breaking changes in
0.6.13; existingloadModel(...)callers are unchanged. Code that probes state persistence support should preferLlamaEngine.supportsStatePersistenceover structural backend type checks so web/router backends can report bridge-version-dependent support accurately.
0.6.12 #
- Native runtime sync:
- Updated native hook pinning to
leehack/llamadart-native@b9016, picking up the CUDA 12.8 Blackwell-capable native bundles. - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.14(llama.cppb9016) so native and web runtimes track the same upstream revision. - Picked up the bridge-side Qwen UTF-8 streaming stabilization and multimodal fallback narrowing, while preserving control-token output for parser consumers.
- Picked up the bridge-side BERT embedding thread-pool sizing fix so automatic thread selection does not exceed the compiled WebAssembly pthread pool.
- Updated native hook pinning to
- Load-time tuning knobs:
- Added
ModelParams.useMmap(defaulttrue) andModelParams.useMlock(defaultfalse), wired tollama_model_params.use_mmap/use_mlock. Lets callers turn off mmap for platforms where memory-mapped weights hurt throughput, or pin weights in RAM to avoid first-token paging spikes. - Added
ModelParams.flashAttentionwith theFlashAttention.{auto, enabled, disabled}enum, wired tollama_context_params.flash_attn_type. Explicit settings win over the existing automatic Android/Vulkan heuristics;autopreserves prior behavior. - Added
ModelParams.cacheTypeKandModelParams.cacheTypeVwith theKvCacheType.{f16, q8_0, q4_0}enum, wired tollama_context_params.type_k/type_v. Enables KV-cache quantization (Q8_0 ≈ halves KV memory; Q4_0 ≈ quarters it). When the user requests a non-F16 KV type withflashAttention: auto, the service auto-promotes flash attention to enabled — llama.cpp requires it for KV quantization. - Added
ModelParams.kvUnified(nullable) for explicit override ofllama_context_params.kv_unified.nullkeeps the existing auto-enable-when-multi-sequence behavior. - Added
ModelParams.ropeFrequencyBaseandModelParams.ropeFrequencyScale(both nullable) for context-extension overrides onllama_context_params.rope_freq_base/rope_freq_scale.nullkeeps the model's trained values. - Forwarded native-compatible
ModelParamsload tuning knobs through the WebGPU bridge path, includingmaxParallelSequences, flash attention, KV-cache type, KV-unified, RoPE, split-mode, and main-GPU options. - Matched native batch defaults on the WebGPU path so unset
batchSize/microBatchSizecascade ton_batch = n_ctxandn_ubatch = n_batch, avoiding first-embedding aborts for BERT-class/non-causal encoder models while preserving explicit caller values and Qwen3.5 web tuning.
- Added
- GPU device selection API:
- Added
ModelParams.mainGpuand wired it to llama.cppllama_model_params.main_gpu. - Added
ModelParams.splitModeand wired it to llama.cppllama_model_params.split_mode, enabling explicit single-GPU selection withModelSplitMode.none.
- Added
- Windows split-bundle loader fix:
- Resolved ggml backend registry/device APIs from the loaded ggml runtime DLL when the generated default FFI asset cannot see those symbols, restoring explicit Vulkan device selection in Windows split bundles.
- Native packaging size fix:
- Filtered backend-owned runtime dependencies during native asset bundling so CUDA runtime DLLs and OpenBLAS runtime libraries are emitted only when their owning backend module is selected.
- Kept unknown non-core runtime libraries bundled for compatibility with future native bundle layouts.
- Compatibility note: no public API breaking changes in
0.6.12.
0.6.11 #
- Native runtime syncs:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b8955.
- Updated native hook pinning and regenerated bindings through
- Gemma 4 streaming fix:
- Parsed streamed
<|channel>thought ... <channel|>blocks into thinking deltas instead of leaking Gemma 4 thought markers into content output. - Added engine coverage for Gemma 4 thought-channel chunks split across native stream boundaries.
- Parsed streamed
- Release stability:
- Tracked the chat app lockfile so generated Flutter plugin metadata stays stable in CI and release validation.
- Compatibility note: no public API breaking changes in
0.6.11.
0.6.10 #
- Native runtime syncs:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b8638.
- Updated native hook pinning and regenerated bindings through
- Multimodal context-safety hardening:
- Converted native multimodal prompt-evaluation overflow paths into Dart exceptions instead of allowing downstream sampling asserts.
- Downscaled staged chat-app image picks to a
384pxmax edge across Android, iOS, macOS, and Web to reduce multimodal context pressure. - Added a local-only macOS Qwen3.5 multimodal repro harness plus CI-safe provider coverage for the new overflow guidance.
- Gemma 4 template support and multimodal capability gating:
- Added built-in Gemma 4 template detection, rendering, and parsing support, including thinking and tool-call handling.
- Added runtime projector capability checks so multimodal flows and the chat app gate image/audio input against
supportsVision/supportsAudioinstead of model-family assumptions. - Documented current Gemma 4 projector behavior in the docs site and chat app guidance.
- Compatibility note: no public API breaking changes in
0.6.10.
0.6.9 #
- iOS deployment target alignment:
- Documented that iOS builds require a minimum deployment target of
16.4or newer across the README, docs site, and example docs. - Updated
example/chat_appiOS Podfile and Runner project settings to use deployment target16.4.
- Documented that iOS builds require a minimum deployment target of
- Android backend safety:
- Honored
ggml_backend_scoreduring asset-based backend fallback so unsupported Android CPU variant libraries are skipped before initialization. - Changed Android
autobackend resolution to prefer CPU by default while keeping Vulkan available for explicit opt-in. - Clarified that changing
hooks.user_definesrequiresflutter clean && flutter pub getbefore rebuilding.
- Honored
- Compatibility note: no public API breaking changes in
0.6.9.
0.6.8 #
- Native runtime sync:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8480. - Refreshed generated low-level FFI bindings to match the synced upstream headers.
- Updated native hook pinning and regenerated bindings to
- Compatibility note: no public API breaking changes in
0.6.8.
0.6.7 #
- Native runtime sync and Linux loader hardening:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8373. - Hardened Linux bundle loading for packaged apps and accepted versioned
libllamadartmappings so colocated native dependencies resolve more reliably at runtime.
- Updated native hook pinning and regenerated bindings to
- Hermes tool-call parsing fix:
- Fixed Hermes handler parsing when whitespace appears between
<tool_call>and the JSON payload.
- Fixed Hermes handler parsing when whitespace appears between
- Compatibility note: no public API breaking changes in
0.6.7.
0.6.6 #
- Runtime syncs:
- Updated native hook pinning to
leehack/llamadart-native@b8216. - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.10(llama.cppb8216).
- Updated native hook pinning to
- Qwen3.5 runtime stabilization (Android + Web):
- Switched bundled Qwen3.5 presets to Unsloth
Q4_K_MGGUFs across the example catalog and tooling. - Added Android-native perf diagnostics chips (
p_eval,eval,sample,reuse) backed by llama.cpp context timings with manual timing fallback when built-in counters report zero. - Restored a targeted Android Vulkan fast path for local Qwen3.5
0.8B/2B/4Bmodels by re-enabling KQV/op-offload/flash-attention where stable. - Updated Android chat app defaults to prefer CPU for Qwen3.5
0.8Band2B, and reduced Android0.8Bcontext to2048for lower first-token latency. - Hardened Android multimodal handling by downscaling staged images in the chat app and forcing Qwen3.5
0.8Bprojector work onto CPU on Android. - Fixed WebGPU Qwen prompt/control-token handling and committed companion bridge-side streaming/multimodal fixes required by the local chat app runtime.
- Switched bundled Qwen3.5 presets to Unsloth
- Compatibility note: no public API breaking changes in
0.6.6.
0.6.5 #
- Embedding API (native backend capability):
- Added
LlamaEngine.embed(...)andLlamaEngine.embedBatch(...)for direct vector generation. - Added optional backend capability interface
BackendEmbeddingsfor custom backend implementers. - Added optional backend batch capability
BackendBatchEmbeddingsand worker-side batch embedding request/response path to reduce isolate round-trip overhead inembedBatch(...). - Added
ModelParams.maxParallelSequences(n_seq_max) so contexts can reserve multiple sequence slots for true multi-sequence embedding batches. - Wired native isolate/worker/service embedding flow to llama.cpp embedding outputs with optional L2 normalization.
- Added embedding-focused tests for engine behavior and worker message contracts.
- Added
- Examples/docs:
- Added
example/basic_app/bin/llamadart_embedding_example.dart. - Added
example/basic_app/bin/llamadart_sqlite_vector_example.dartfor local embedding retrieval with SQLite vector search. - Updated example docs and top-level README with embedding usage snippets.
- Added
tool/testing/native_embedding_benchmark.dartto compare sequential embedding calls vsembedBatch(...)throughput (with optional--json-out). - Added
tool/testing/native_embedding_sweep.dartto run max-seq sweeps and dump CSV speedup reports for plotting.
- Added
- Web bridge sync:
- Added WebGPU bridge embedding APIs and wired web backend support for
LlamaEngine.embed(...)/embedBatch(...). - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.8. - Validated the
v0.1.8bridge bundle through local fetch-script checksum verification.
- Added WebGPU bridge embedding APIs and wired web backend support for
- WebGPU runtime tuning + multimodal stability (chat app/web):
- Reduced bridge log noise and improved runtime profile diagnostics for web sessions.
- Stabilized multimodal backend switching using resolved runtime mode behavior and added an E2E regression gate.
- Tuned streaming/typewriter pacing and token callback overhead to improve incremental render smoothness.
- Added GPU-path multimodal image-size capping to reduce runtime pressure on large image inputs.
- Chat app model catalog + stability:
- Updated
example/chat_apprecommended Qwen presets to the Qwen3.5 lineup (0.8B,2B,4B,9B) and removed older Qwen2.5/Qwen3 defaults from the in-app library. - Added multimodal projector (
mmproj) wiring for Qwen3.5 model cards and tuned safer multimodal defaults (contextSize: 8192,maxTokens: 1024). - Fixed Flutter text paint crashes caused by malformed UTF-16 streaming boundaries by aligning incremental reveal to surrogate-pair boundaries and sanitizing text/tool payload rendering paths.
- Added sanitizer unit coverage and refreshed chat-app README architecture/troubleshooting sections for multimodal and UTF-16 guidance.
- Updated
- Compatibility note: no public API breaking changes in
0.6.5.
0.6.4 #
-
Multimodal projector offload alignment:
- Updated native multimodal projector initialization to follow effective model-load configuration.
- CPU-only model settings (
preferredBackend: cpuorgpuLayers: 0) now also disable mmproj GPU offload.
-
Package metadata cleanup:
- Removed unused Flutter-only constraints/dependencies from the root
pubspec.yaml(environment.flutter,flutter,path_provider,json_rpc_2,integration_test) to keep the core package pure Dart. - Kept Flutter-specific dependencies scoped to Flutter example apps.
- Removed unused Flutter-only constraints/dependencies from the root
-
Backend selection safety and status accuracy:
- Added strict CPU-mode behavior in native backend preparation so
preferredBackend: cpuno longer initializes optional GPU backends during startup/model load probing. - Disabled context-time GPU offload knobs (
offload_kqv,op_offload, flash-attention auto path) when effective GPU layers resolve to zero, preventing GPU allocation attempts during context creation in CPU mode. - Added
ModelParams.batchSize(n_batch) andModelParams.microBatchSize(n_ubatch) so context batch sizing can be tuned independently fromcontextSizewhile preserving legacy defaults. - Split backend reporting into two semantics: selectable backend options (
getAvailableBackends) vs active runtime backend (getBackendName). - Added optional
BackendAvailabilitycapability andLlamaEngine.getAvailableBackends()to support safe settings UIs without forcing GPU initialization. - Added optional
BackendRuntimeDiagnosticscapability andLlamaEngine.getResolvedGpuLayers()to expose resolved native load-time layer count for runtime diagnostics. - Updated
example/chat_appto populate backend selector options from safe availability discovery while keeping active-backend status bound to effective runtime backend. - Improved native auto/explicit backend status resolution to avoid false CPU labeling on Apple consolidated runtimes and false GPU labeling when explicit backend falls back.
- Added strict CPU-mode behavior in native backend preparation so
-
Web model cache + large-model UX improvements (chat app):
- Updated web Download flow to prefetch model/mmproj bytes into browser Cache Storage with live progress and cancellation support.
- Added best-effort cache eviction for web model delete actions.
- Added large-model web load fallback to fetch-backed worker runtime path (bridge) to reduce contiguous
ArrayBufferpressure. - Added dedicated web bridge worker entry wiring and worker fallback diagnostics to improve worker startup reliability.
- Reduced synthetic load-progress dominance so bridge/network progress appears earlier during web model load.
- Added warning-only UI guidance for very large web models that may exceed browser memory limits at load time.
-
Web model-load resilience:
- Updated
WebGpuLlamaBackendto retry web model loads with reduced context sizes (and CPU fallback as last attempt) when bridge errors indicate browser memory pressure. - Added bridge config plumbing for optional wasm64 core assets (
llama_webgpu_core_mem64) with automatic fallback to wasm32 when unsupported. - Added explicit runtime diagnostics and error normalization for worker-thread and cross-origin-isolation requirements in large web model load flows.
- Updated default bridge asset pinning in chat app/docs/fetch script to
leehack/llama-web-bridge-assets@v0.1.5. - Updated HF static chat-app deploy workflow to emit COI
custom_headersin generated Space README frontmatter.
- Updated
-
Android arm64 CPU variant policy and loader hardening:
- Updated native hook tag pin from
b8138tob8157to consume Android arm64 CPU-variant runtime bundles. - Added Android arm64 CPU policy keys in hook config:
cpu_profile(fulldefault,compact) and advancedcpu_variantsoverride. - Added hook tests and Android hook integration coverage to verify pubspec-driven CPU variant packaging behavior.
- Hardened Android runtime backend loading to resolve CPU variant modules even when backend module directory discovery is unavailable.
- Added Android runtime smoke helper (
scripts/android_runtime_smoke.sh) and smoke-plan docs for device verification. - Compatibility note: no public API breaking changes.
android-arm64now defaults tocpu_profile: full, which may increase package size compared with baseline-only CPU packaging.
- Updated native hook tag pin from
0.6.3 #
- Native runtime sync (llama.cpp b8138):
- Synced bundled native runtime/assets and regenerated bindings from
b8099tob8138. - Pulled in Android arm64 ISA compatibility hardening (including STLUR guard changes) to prevent launch-time crashes on older devices.
- Synced bundled native runtime/assets and regenerated bindings from
- Example app performance and UX polish:
- Reduced settings-write overhead during frequent parameter adjustments.
- Improved model manager responsiveness during download progress updates.
- Smoothed chat streaming auto-follow and rendering to reduce unnecessary UI work.
- Web model handling improvements:
- Updated web "Download" behavior to verify remote model/mmproj availability without pre-buffering large GGUF payloads in app memory.
- Clarified that web cache population occurs when a model is first loaded.
- Stability and quality:
- Added safe fallback handling for invalid persisted log-level settings.
- Added regression tests for persisted settings fallback behavior.
- New example app:
- Added
example/tui_coding_agent, anocterm-based terminal coding agent with tool-calling loop, workspace-scoped file/command tools, and runtime model switching. - Default model source is GLM 4.7 Flash (
unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL) with support for custom local paths/URLs/Hugging Face shorthand. - Added stable text-protocol tool mode as the default (native template grammar tool-calling remains available via
--native-tool-callingfor experimentation).
- Added
0.6.2 #
- Native inference performance improvements:
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
create(...). - Improved native stream throughput with worker-side token chunk batching
and configurable thresholds (
streamBatchTokenThreshold,streamBatchByteThreshold). - Added prompt-prefix reuse for native text generation
(
reusePromptPrefix, enabled by default) with conservative full-replay fallback to preserve deterministic parity. - Optimized
ChatSessioncontext trimming using bounded turn-offset search to avoid repeated linear recount loops on long histories.
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
- Benchmarking and parity tooling:
- Added
tool/testing/native_inference_benchmark.dartfor TTFT, throughput, and latency measurement with tunable generation settings. - Added
tool/testing/native_prompt_reuse_parity.dartand curated prompt sets for deterministic prompt-reuse parity validation. - Added CI prompt-reuse parity checks to catch native reuse regressions.
- Added
0.6.1 #
-
Publishing compatibility fix:
- Moved hook backend-config support code out of
hook/src/intolib/src/hook/because pub.dev currently only allowshook/build.dartunder hook files. - Updated hook/test imports accordingly to keep native-assets backend selection behavior unchanged.
- Moved hook backend-config support code out of
-
llama.cpp parity expansion (Dart-native template/parser pipeline):
- Reworked template detection/render/parse routing to align with llama.cpp semantics across supported chat formats, including format-specific tool-call parsing and fallback behavior.
- Added PEG parity components in Dart (
peg_parser_builder,peg_chat_parser) and integrated parser-carrying render/parse flow for PEG-native/constructed formats. - Removed brittle fallback coercions that could mutate valid tool names/argument keys, preserving model-emitted tool payloads for dispatch parity.
- Hardened template capability detection with Jinja AST + execution probing, while preventing typed-content false positives caused by raw content stringification.
- [BREAKING] Removed legacy custom template-handler APIs:
ChatTemplateMatcher,ChatTemplateRoutingContext,ChatTemplateEngine.registerHandler(...),ChatTemplateEngine.unregisterHandler(...),ChatTemplateEngine.clearCustomHandlers(...),ChatTemplateEngine.registerTemplateOverride(...),ChatTemplateEngine.unregisterTemplateOverride(...),ChatTemplateEngine.clearTemplateOverrides(...), and per-callcustomHandlerId/ parsehandlerIdrouting. - Removed silent render/parse fallback paths so handler/parser failures are surfaced instead of downgraded to content-only output.
- Added llama.cpp-equivalent per-call template globals/time injection via
chatTemplateKwargsandtemplateNow.
-
Parity test coverage and tooling:
- Added vendored llama.cpp template parity integration coverage for detection + render + parse paths.
- Added upstream llama.cpp chat/template suite runners and local E2E harness (
run_llama_cpp_chat_tests.sh,run_template_parity_suites.sh). - Added mirrored unit tests for new internal template components (
peg_parser_builder,template_internal_metadata) to satisfy structure guards.
-
Test cleanup and maintainability:
- Reduced noisy diagnostics in template integration tests and centralized format sample parse payload fixtures for easier parity maintenance.
-
Native integration cleanup (llamadart-native migration):
- Added
tool/testing/prepare_llama_cpp_source.shto fetch/refreshggml-org/llama.cppinto.dart_tool/llama_cpp(orLLAMA_CPP_SOURCE_DIR) pinned to a resolved ref (LLAMA_CPP_REF, defaultlatestrelease tag). - Updated
tool/testing/run_llama_cpp_chat_tests.shto use prepared.dart_toolsource instead ofthird_party/llama_cpp, so local upstream chat-suite runs no longer depend on vendored source. - Updated template parity tests to resolve fixtures from
LLAMA_CPP_TEMPLATES_DIRor.dart_tool/llama_cpp/models/templatesinstead ofthird_party/llama_cpp. - Clarified README backend matrix notes:
KleidiAI/ZenDNNare CPU-path optimizations, not selectable runtime backend modules. - Runtime backend probing for split-module bundles now runs during backend initialization (not only after first model load), so device/backend availability is visible earlier in app flows.
- Native-assets hook output now refreshes emitted native files per build to prevent stale backend module carryover when backend config changes.
- Added
-
Linux runtime/link validation and backend loader hardening:
- Hardened split-module backend loading to avoid probing backends that are not bundled for the active platform/arch, reducing noisy optional-backend load failures.
- Added failed-backend memoization so missing optional modules are not retried on every model load.
- Tightened Linux cache source selection to the current ABI bundle (
linux-arm64vslinux-x64) when preparing runtime dependencies. - Added Linux backend/runtime setup guidance in README, including distro-specific package baselines (Ubuntu/Debian, Fedora/RHEL/CentOS, Arch).
- Added reproducible Docker link-check flows for baseline (
cpu/vulkan/blas) and optionalcuda/hipmodule dependency resolution. - Added
scripts/check_native_link_deps.shhelper plus dedicated validation images:docker/validation/Dockerfile.cuda-linkcheckanddocker/validation/Dockerfile.hip-linkcheck.
-
Chat example backend UX cleanup:
- Removed user-facing
Autobackend option from settings; only concrete runtime-detected backends are shown. - Added migration behavior that resolves legacy saved
Autopreference to the best detected backend at runtime.
- Removed user-facing
0.5.4 #
-
llama.cpp parity hardening:
ChatTemplateEnginenow preserves handler-provided tokens even when grammar is attached via params, avoiding token-loss regressions in tool/thinking formats.- Native stop-sequence handling now skips preserved tokens so parser-critical markers are not terminated early.
- Generic tool-instruction system injection now follows llama.cpp semantics more closely (replace first system content when supported, otherwise prepend to first message content).
- LFM2 output parsing now extracts reasoning more consistently across tool and non-tool output shapes.
-
Chat example loop/lifecycle hardening:
- Improved tool-loop guards (first-turn force-only behavior, duplicate/equivalent call suppression, per-tool budget, and loop-stop messaging).
- Added response fallback that can ground final answers from recent tool results when the model emits stale real-time disclaimers.
- Added assistant debug badges (
fmt:*,think:*,content:json,fallback:tool-result) and strengthened detach/exit disposal paths.
-
Parity/integration test robustness:
tool_calling_integration_testnow accepts both structuredtool_callsdeltas and XML-style<tool_call>payloads.- llama.cpp template-detection integration expectations were updated for current Ministral-family routing outcomes.
-
Documentation updates:
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
{"response":"..."}) and documentedcontent:jsondiagnostics. - Documented example server sampling defaults (
penalty=1.0,top_p=0.95,min_p=0.05) and added a CLI README batch parity-matrix usage example.
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
-
Chat app backend/status fixes:
- Backend switching now preserves configured
gpuLayerswhile still allowing load-time CPU enforcement. - Runtime backend labeling and GPU activity diagnostics now follow effective user selection, preventing false "VULKAN active" status when CPU mode is selected.
- Backend switching now preserves configured
-
Context size auto mode:
- Restored support for
Context Size: Autoby preserving0in persisted settings and passing auto behavior through to session context-limit resolution.
- Restored support for
-
Tool-call parsing fixes (Hermes):
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
{{...}}layer second, and only fall back to full_normalizeDoubleBraceswhen all braces are consistently doubled. - Added a consistency gate to
_normalizeDoubleBracesthat bails out on mixed single/double brace payloads to prevent corruption of valid nested JSON.
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
-
Tool-call parsing fixes (Magistral):
- Broadened whitespace skipping in
_extractJsonObjectto handle\n,\r, and\tbetween[ARGS]and the JSON body.
- Broadened whitespace skipping in
-
Example app (basic_app):
- Replaced
toList()buffering withawait forstreaming for real-time token yield. - Added
toolsparameter to every follow-upcreate()call and bounded tool-execution loop with_maxToolRounds = 10.
- Replaced
-
Test coverage:
- Added chat app regression tests for backend switching behavior and context-size auto persistence.
- Added regression tests for Hermes wrapped+nested double-brace payloads and Magistral
[ARGS]with newline/nested arguments.
-
Example rename (server):
- Renamed
example/api_servertoexample/llamadart_server. - Renamed the example package/bin entrypoint to
llamadart_server. - Updated llama.cpp tool-call parity defaults/docs to target
example/llamadart_server.
- Renamed
-
GLM 4.5 template parity:
- Added XML tool-call grammar generation for
<tool_call>payloads with<arg_key>/<arg_value>pairs. - Added GLM-specific preserved tokens and
<|user|>stop handling for tool-call flows. - Updated parser extraction to handle GLM XML tool calls from assistant content and reasoning blocks.
- Added XML tool-call grammar generation for
-
Template/native runtime fixes:
- Typed-content template rendering now activates only when messages actually include media parts.
- Native context reset now clears llama memory in-place instead of reinitializing the context.
0.5.3 #
- Sampling controls:
- Added
minPtoGenerationParamswith a default value of0.0andcopyWithsupport.
- Added
- Native backend parity:
- Added optional llama.cpp
min_psampler initialization inLlamaCppServicewhenminP > 0.
- Added optional llama.cpp
- Test coverage:
- Added unit coverage for
GenerationParams.minPdefault andcopyWithbehavior.
- Added unit coverage for
0.5.2 #
- Chat template parity hardening:
- Expanded llama.cpp parity across additional format handlers, including grammar construction, lazy-grammar triggers, preserved tokens, and parser behavior for tool-call payload extraction.
- Added shared
ToolCallGrammarUtilshelpers for wrapped object/array tool-call grammar generation and root-rule wrapping.
- Crash fix (grammar parsing):
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
llama_grammar_init_implparse failures during tool-calling generations.
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
- Test coverage expansion:
- Added and expanded handler-level parity tests (Apertus, LFM2, Nemotron V2, Magistral, Seed-OSS, Xiaomi MiMo, DeepSeek R1/V3, Hermes) and mirrored unit tests for new grammar utilities.
0.5.1 #
- Documentation fixes:
- Updated README internal links to absolute GitHub URLs so they resolve reliably on pub.dev.
- Updated release/migration wording after 0.5.0 publication and refreshed installation/version snippets.
- Corrected iOS simulator architecture notes and contributor prerequisites/build target docs.
- Publishing hygiene:
- Expanded
.pubignoreto exclude local build outputs, large model/test artifacts, and checked-outthird_partysources from package uploads.
- Expanded
0.5.0 #
-
[BREAKING] Public API Changes:
- Root exports were tightened; previously exposed internals such as
ToolRegistry,LlamaTokenizer, andChatTemplateProcessorare no longer part of the public package API. ChatSessionnow centers oncreate(...)streamingLlamaCompletionChunk; legacychat(...)/chatText(...)style usage must migrate.LlamaChatMessageconstructor names were standardized (.fromText,.withContent) in place of older named constructors.- Default
maxTokensinGenerationParamsincreased from512to4096. LlamaChatMessage.toJson()no longer includesnameontoolrole messages.ModelParams.logLevelwas removed; logging control now lives onLlamaEngineviasetDartLogLevel(...)andsetNativeLogLevel(...).LlamaBackendinterface changed for custom backend implementers (notablygetVramInfoand updatedapplyChatTemplate).- Model reload behavior is stricter:
loadModel(...)now requires unloading first. - Migration details are documented in
MIGRATION.md.
- Root exports were tightened; previously exposed internals such as
-
Template/Parser Parity Expansion:
- Added llama.cpp-aligned format detection and handlers for additional templates including FireFunction v2, Functionary v3.2, Functionary v3.1 (Llama 3.1), GPT-OSS, Seed-OSS, Nemotron V2, Apertus, Solar Open, EXAONE MoE, Xiaomi MiMo, and TranslateGemma.
- Improved parser parity for format-specific tool-calling and reasoning extraction, including
<|python_tag|>parsing for Llama 3 flows. - Narrowed generic grammar auto-application to generic/content-only routing to avoid interfering with format-specific tool schemas.
-
Template Extensibility APIs:
- Added global custom handler registration and template override APIs in
ChatTemplateEngine. - Added per-call
customTemplateandcustomHandlerIdrouting support and threaded handler identity into parse paths. - Added cookbook examples and regression tests for registration precedence and fallback behavior.
- Added global custom handler registration and template override APIs in
-
Logging Controls:
- Added split logging controls in
LlamaEngine:setDartLogLevelandsetNativeLogLevel, while keepingsetLogLevelas a convenience method. - Fixed native
nonelog level suppression so llama.cpp/ggml logs are fully muted when requested.
- Added split logging controls in
-
Chat App Improvements:
- Added model capability badges and per-model generation presets.
- Added template-aware tool enablement guardrails and separate Dart/native log level settings in the UI.
-
Test Suite Overhaul:
- Expanded template parity coverage (detection, handlers, grammar, workarounds, registry precedence, and integration scenarios).
- Added additional unit tests for exceptions, logging, and core model definitions.
0.4.0 #
- Cross-Platform Architecture:
- Refactored
LlamaBackendfor strict Web isolation using "Native-First" conditional exports, ensuring native performance and full web safety. - Standardized backend instantiation via a unified
LlamaBackend()factory across all examples and scripts.
- Refactored
- Web & Context Stability:
- Resolved "Max Tokens is 0" on Web by implementing
getLoadedContextInfo()and robust GGUF metadata fallback inLlamaEngine. - Improved numeric metadata extraction on Web for better compatibility with varied GGUF exporters.
- Resolved "Max Tokens is 0" on Web by implementing
- GBNF Grammar Stability:
- Resolved "Unexpected empty grammar stack" crash by reordering the sampler chain (filtering tokens via GBNF before performing probability-based sampling).
- Test Suite Overhaul:
- Pivoted from mock-based unit tests to real-world integration tests using the actual
llama.cppnative backend. - Ensured full verification of model loading, tokenization, text generation, and grammar constraints against physical models.
- Multi-Platform Configuration: Introduced
dart_test.yamland@TestOntags to enable seamless execution of all tests across VM and Chrome with a singledart testcommand.
- Pivoted from mock-based unit tests to real-world integration tests using the actual
- Robust Log Silencing:
- Implemented FD-level redirection (
dup2to/dev/null) forLlamaLogLevel.noneon native platforms. - This provides a crash-free alternative to FFI-based log callbacks, which were unstable during low-level native initialization (e.g., Metal).
- Implemented FD-level redirection (
- Project Hygiene:
- Achieved 100% clean
dart analyzeacross the core library and all example applications. - Replaced legacy stubs in the chat application with a clean, interface-based
ModelServicearchitecture.
- Achieved 100% clean
- Resumable Downloads:
- Implemented robust resumable downloads for large models using HTTP Range requests.
- Added persistent
.metafiles to track download progress across app restarts.
- Enhanced Download UI:
- Refined the
ModelCardwith a visual Pause/Resume toggle. - Added a Trash icon in the card header for full cancellation and data discard of active or partial downloads.
- Improved progress feedback with clear "Paused" and "Downloading" states.
- Refined the
- Multimodal Support (Vision & Audio): Integrated the experimental
mtmdmodule fromllama.cppfor native platforms.- Added
loadMultimodalProjectortoLlamaEngine. - Introduced
LlamaChatMessage.withContentandLlamaContentPart(Text, Image, Audio). - Fix: Resolved missing multimodal symbols in native builds by properly linking the
mtmdmodule.
- Added
- Moondream 2 & Phi-2 Optimization:
- Implemented a specialized
Question: / Answer:chat template fallback for Moondream models. - Added dynamic BOS token handling: Automatically disables BOS injection for models where BOS == EOS (like Moondream) to prevent immediate "End of Generation".
- Implemented a specialized
- Chat API Consolidation:
- Moved high-level
chat()andchatWithTools()logic fromLlamaEnginetoChatSession. LlamaEngineis now a dedicated low-level orchestrator for model loading, tokenization, and raw inference.
- Moved high-level
- Intelligent Tool Flow:
- Optional Tool Calls: Tools are no longer forced by default. The model now decides when to use a tool vs. responding directly based on context.
- Final Response Generation: After a tool returns a result, the model now generates a natural language response (without grammar constraints) to interpret the result for the user.
- forceToolCall: Added a session-level flag to re-enable strict grammar-constrained tool calls for smaller models (e.g., 0.5B - 1B).
- App Stability & Resources:
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
dispose()inChatService. - Added Qwen 2.5 3B and 7B models to the download list with clear RAM/VRAM requirements for testing complex instruction following and tool use.
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
- ChatSession Manager: Introduced a new high-level
ChatSessionclass to automatically manage conversation history and system prompts. - Context Window Management:
ChatSessionnow implements an automated sliding window to truncate history when the model's context limit is approached. - Windows Robustness:
- Improved export management for MSVC to ensure symbol visibility.
- Added Sccache support for Windows builds to significantly improve CI performance.
- Automated Lifecycle:
- Implemented GitHub Actions to automate
llama.cppupdates, regression testing, and release artifact generation.
- Implemented GitHub Actions to automate
- [BREAKING] API Changes:
LlamaChatMessage.rolenow returns aLlamaChatRoleenum instead of aString. All manual role string comparisons should be updated to use the enum.
- [DEPRECATED] API Changes:
- Default
LlamaChatMessageconstructor (string-based) is now deprecated; use.fromText()or.withContent()instead. LlamaChatMessage.roleStringis deprecated and will be removed in v1.0.
- Default
- Engine Upgrades: Upgraded core
llama.cppto tagb7898. - Robust Media Loading: Support for loading images and audio via both file paths and raw byte buffers.
- Bug Fixes: Improved native resource cleanup and fixed potential null-pointer crashes in the multimodal pipeline.
0.3.0 #
- [BREAKING] Removal of
LlamaService: The legacyLlamaServicefacade has been removed. UseLlamaEnginewithLlamaBackend()instead for all platforms. - LoRA Support: Added full support for Low-Rank Adaptation (LoRA) on all native platforms (iOS, Android, macOS, Linux, Windows).
- Web Improvements: Significantly enhanced the web implementation using
wllamav2 features, including native chat templating and threading info. - Logging Refactor: Implemented a unified logging architecture.
- Native Platforms: Simplified to an on/off toggle to ensure stability.
LlamaLogLevel.nonesuppresses all output; other levels enable default stderr logging. - Web: Supports full granular filtering (Debug, Info, Warn, Error).
- Native Platforms: Simplified to an on/off toggle to ensure stability.
- Stability Fixes: Resolved frequent "Cannot invoke native callback from a leaf call" crashes during Flutter Hot Restarts by refactoring native resource lifecycle.
- Improved Lifecycle: Removed
NativeFinalizerdependency to avoid race conditions. Explicitly calldispose()to release native resources. - Robust Loading: Improved model loading on all platforms with better instance cleanup, script injection, and URL-based loading support.
- Dynamic Adapters: Implemented APIs to dynamically add, update scale, or remove LoRA adapters at runtime.
- LoRA Training Pipeline: Added a comprehensive Jupyter Notebook for fine-tuning models and converting adapters to GGUF format.
- API Enhancements: Updated
ModelParamsto include initial LoRA configurations and introducedsupportsUrlLoadingfor better platform abstraction. - CLI Tooling: Updated the
basic_appexample to support testing LoRA adapters via the--loraflag.
0.2.0+b7883 #
- Project Rebrand: Renamed package from
llama_darttollamadart. - Pure Native Assets: Migrated to the modern Dart Native Assets mechanism (
hook/build.dart). - Zero Setup: Native binaries are now automatically downloaded and bundled at runtime based on the target platform and architecture.
- Version Alignment: Aligned package versioning and binary distribution with
llama.cpprelease tags (starting withb7883). - Logging Control: Implemented comprehensive logging interception for both
llamaandggmlbackends with configurable log levels. - Performance Optimization: Added token caching to message processing, significantly reducing latency in long conversations.
- Architecture Overhaul:
- Refactored Flutter Chat Example into a clean, layered architecture (Models, Services, Providers, Widgets).
- Rebuilt CLI Basic Example into a robust conversation tool with interactive and single-response modes.
- Cross-Platform GPU: Verified and improved hardware acceleration on macOS/iOS (Metal) and Android/Linux/Windows (Vulkan).
- New Build System: Consolidated all native source and build infrastructure into a unified
third_party/directory. - Windows Support: Added robust MinGW + Vulkan cross-compilation pipeline.
- UI Enhancements: Added fine-grained rebuilds using Selectors and isolated painting with RepaintBoundaries.
0.1.0 #
- WASM Support: Full support for running the Flutter app and LLM inference in WASM on the web.
- Performance Improvements: Optimized memory usage and loading times for web models.
- Enhanced Web Interop: Improved
wllamaintegration with better error handling and progress reporting. - Bug Fixes: Resolved minor UI issues on mobile and web layouts.
0.0.1 #
- Initial release.
- Supported platforms: iOS, macOS, Android, Linux, Windows, Web.
- Features:
- Text generation with
llama.cppbackend. - GGUF model support.
- Hardware acceleration (Metal, Vulkan).
- Flutter Chat Example.
- CLI Basic Example.
- Text generation with