llamadart 0.6.10
llamadart: ^0.6.10 copied to clipboard
A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models
Unreleased #
0.6.10 #
- Native runtime syncs:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b8638.
- Updated native hook pinning and regenerated bindings through
- Multimodal context-safety hardening:
- Converted native multimodal prompt-evaluation overflow paths into Dart exceptions instead of allowing downstream sampling asserts.
- Downscaled staged chat-app image picks to a
384pxmax edge across Android, iOS, macOS, and Web to reduce multimodal context pressure. - Added a local-only macOS Qwen3.5 multimodal repro harness plus CI-safe provider coverage for the new overflow guidance.
- Gemma 4 template support and multimodal capability gating:
- Added built-in Gemma 4 template detection, rendering, and parsing support, including thinking and tool-call handling.
- Added runtime projector capability checks so multimodal flows and the chat app gate image/audio input against
supportsVision/supportsAudioinstead of model-family assumptions. - Documented current Gemma 4 projector behavior in the docs site and chat app guidance.
- Compatibility note: no public API breaking changes in
0.6.10.
0.6.9 #
- iOS deployment target alignment:
- Documented that iOS builds require a minimum deployment target of
16.4or newer across the README, docs site, and example docs. - Updated
example/chat_appiOS Podfile and Runner project settings to use deployment target16.4.
- Documented that iOS builds require a minimum deployment target of
- Android backend safety:
- Honored
ggml_backend_scoreduring asset-based backend fallback so unsupported Android CPU variant libraries are skipped before initialization. - Changed Android
autobackend resolution to prefer CPU by default while keeping Vulkan available for explicit opt-in. - Clarified that changing
hooks.user_definesrequiresflutter clean && flutter pub getbefore rebuilding.
- Honored
- Compatibility note: no public API breaking changes in
0.6.9.
0.6.8 #
- Native runtime sync:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8480. - Refreshed generated low-level FFI bindings to match the synced upstream headers.
- Updated native hook pinning and regenerated bindings to
- Compatibility note: no public API breaking changes in
0.6.8.
0.6.7 #
- Native runtime sync and Linux loader hardening:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8373. - Hardened Linux bundle loading for packaged apps and accepted versioned
libllamadartmappings so colocated native dependencies resolve more reliably at runtime.
- Updated native hook pinning and regenerated bindings to
- Hermes tool-call parsing fix:
- Fixed Hermes handler parsing when whitespace appears between
<tool_call>and the JSON payload.
- Fixed Hermes handler parsing when whitespace appears between
- Compatibility note: no public API breaking changes in
0.6.7.
0.6.6 #
- Runtime syncs:
- Updated native hook pinning to
leehack/llamadart-native@b8216. - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.10(llama.cppb8216).
- Updated native hook pinning to
- Qwen3.5 runtime stabilization (Android + Web):
- Switched bundled Qwen3.5 presets to Unsloth
Q4_K_MGGUFs across the example catalog and tooling. - Added Android-native perf diagnostics chips (
p_eval,eval,sample,reuse) backed by llama.cpp context timings with manual timing fallback when built-in counters report zero. - Restored a targeted Android Vulkan fast path for local Qwen3.5
0.8B/2B/4Bmodels by re-enabling KQV/op-offload/flash-attention where stable. - Updated Android chat app defaults to prefer CPU for Qwen3.5
0.8Band2B, and reduced Android0.8Bcontext to2048for lower first-token latency. - Hardened Android multimodal handling by downscaling staged images in the chat app and forcing Qwen3.5
0.8Bprojector work onto CPU on Android. - Fixed WebGPU Qwen prompt/control-token handling and committed companion bridge-side streaming/multimodal fixes required by the local chat app runtime.
- Switched bundled Qwen3.5 presets to Unsloth
- Compatibility note: no public API breaking changes in
0.6.6.
0.6.5 #
- Embedding API (native backend capability):
- Added
LlamaEngine.embed(...)andLlamaEngine.embedBatch(...)for direct vector generation. - Added optional backend capability interface
BackendEmbeddingsfor custom backend implementers. - Added optional backend batch capability
BackendBatchEmbeddingsand worker-side batch embedding request/response path to reduce isolate round-trip overhead inembedBatch(...). - Added
ModelParams.maxParallelSequences(n_seq_max) so contexts can reserve multiple sequence slots for true multi-sequence embedding batches. - Wired native isolate/worker/service embedding flow to llama.cpp embedding outputs with optional L2 normalization.
- Added embedding-focused tests for engine behavior and worker message contracts.
- Added
- Examples/docs:
- Added
example/basic_app/bin/llamadart_embedding_example.dart. - Added
example/basic_app/bin/llamadart_sqlite_vector_example.dartfor local embedding retrieval with SQLite vector search. - Updated example docs and top-level README with embedding usage snippets.
- Added
tool/testing/native_embedding_benchmark.dartto compare sequential embedding calls vsembedBatch(...)throughput (with optional--json-out). - Added
tool/testing/native_embedding_sweep.dartto run max-seq sweeps and dump CSV speedup reports for plotting.
- Added
- Web bridge sync:
- Added WebGPU bridge embedding APIs and wired web backend support for
LlamaEngine.embed(...)/embedBatch(...). - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.8. - Validated the
v0.1.8bridge bundle through local fetch-script checksum verification.
- Added WebGPU bridge embedding APIs and wired web backend support for
- WebGPU runtime tuning + multimodal stability (chat app/web):
- Reduced bridge log noise and improved runtime profile diagnostics for web sessions.
- Stabilized multimodal backend switching using resolved runtime mode behavior and added an E2E regression gate.
- Tuned streaming/typewriter pacing and token callback overhead to improve incremental render smoothness.
- Added GPU-path multimodal image-size capping to reduce runtime pressure on large image inputs.
- Chat app model catalog + stability:
- Updated
example/chat_apprecommended Qwen presets to the Qwen3.5 lineup (0.8B,2B,4B,9B) and removed older Qwen2.5/Qwen3 defaults from the in-app library. - Added multimodal projector (
mmproj) wiring for Qwen3.5 model cards and tuned safer multimodal defaults (contextSize: 8192,maxTokens: 1024). - Fixed Flutter text paint crashes caused by malformed UTF-16 streaming boundaries by aligning incremental reveal to surrogate-pair boundaries and sanitizing text/tool payload rendering paths.
- Added sanitizer unit coverage and refreshed chat-app README architecture/troubleshooting sections for multimodal and UTF-16 guidance.
- Updated
- Compatibility note: no public API breaking changes in
0.6.5.
0.6.4 #
-
Multimodal projector offload alignment:
- Updated native multimodal projector initialization to follow effective model-load configuration.
- CPU-only model settings (
preferredBackend: cpuorgpuLayers: 0) now also disable mmproj GPU offload.
-
Package metadata cleanup:
- Removed unused Flutter-only constraints/dependencies from the root
pubspec.yaml(environment.flutter,flutter,path_provider,json_rpc_2,integration_test) to keep the core package pure Dart. - Kept Flutter-specific dependencies scoped to Flutter example apps.
- Removed unused Flutter-only constraints/dependencies from the root
-
Backend selection safety and status accuracy:
- Added strict CPU-mode behavior in native backend preparation so
preferredBackend: cpuno longer initializes optional GPU backends during startup/model load probing. - Disabled context-time GPU offload knobs (
offload_kqv,op_offload, flash-attention auto path) when effective GPU layers resolve to zero, preventing GPU allocation attempts during context creation in CPU mode. - Added
ModelParams.batchSize(n_batch) andModelParams.microBatchSize(n_ubatch) so context batch sizing can be tuned independently fromcontextSizewhile preserving legacy defaults. - Split backend reporting into two semantics: selectable backend options (
getAvailableBackends) vs active runtime backend (getBackendName). - Added optional
BackendAvailabilitycapability andLlamaEngine.getAvailableBackends()to support safe settings UIs without forcing GPU initialization. - Added optional
BackendRuntimeDiagnosticscapability andLlamaEngine.getResolvedGpuLayers()to expose resolved native load-time layer count for runtime diagnostics. - Updated
example/chat_appto populate backend selector options from safe availability discovery while keeping active-backend status bound to effective runtime backend. - Improved native auto/explicit backend status resolution to avoid false CPU labeling on Apple consolidated runtimes and false GPU labeling when explicit backend falls back.
- Added strict CPU-mode behavior in native backend preparation so
-
Web model cache + large-model UX improvements (chat app):
- Updated web Download flow to prefetch model/mmproj bytes into browser Cache Storage with live progress and cancellation support.
- Added best-effort cache eviction for web model delete actions.
- Added large-model web load fallback to fetch-backed worker runtime path (bridge) to reduce contiguous
ArrayBufferpressure. - Added dedicated web bridge worker entry wiring and worker fallback diagnostics to improve worker startup reliability.
- Reduced synthetic load-progress dominance so bridge/network progress appears earlier during web model load.
- Added warning-only UI guidance for very large web models that may exceed browser memory limits at load time.
-
Web model-load resilience:
- Updated
WebGpuLlamaBackendto retry web model loads with reduced context sizes (and CPU fallback as last attempt) when bridge errors indicate browser memory pressure. - Added bridge config plumbing for optional wasm64 core assets (
llama_webgpu_core_mem64) with automatic fallback to wasm32 when unsupported. - Added explicit runtime diagnostics and error normalization for worker-thread and cross-origin-isolation requirements in large web model load flows.
- Updated default bridge asset pinning in chat app/docs/fetch script to
leehack/llama-web-bridge-assets@v0.1.5. - Updated HF static chat-app deploy workflow to emit COI
custom_headersin generated Space README frontmatter.
- Updated
-
Android arm64 CPU variant policy and loader hardening:
- Updated native hook tag pin from
b8138tob8157to consume Android arm64 CPU-variant runtime bundles. - Added Android arm64 CPU policy keys in hook config:
cpu_profile(fulldefault,compact) and advancedcpu_variantsoverride. - Added hook tests and Android hook integration coverage to verify pubspec-driven CPU variant packaging behavior.
- Hardened Android runtime backend loading to resolve CPU variant modules even when backend module directory discovery is unavailable.
- Added Android runtime smoke helper (
scripts/android_runtime_smoke.sh) and smoke-plan docs for device verification. - Compatibility note: no public API breaking changes.
android-arm64now defaults tocpu_profile: full, which may increase package size compared with baseline-only CPU packaging.
- Updated native hook tag pin from
0.6.3 #
- Native runtime sync (llama.cpp b8138):
- Synced bundled native runtime/assets and regenerated bindings from
b8099tob8138. - Pulled in Android arm64 ISA compatibility hardening (including STLUR guard changes) to prevent launch-time crashes on older devices.
- Synced bundled native runtime/assets and regenerated bindings from
- Example app performance and UX polish:
- Reduced settings-write overhead during frequent parameter adjustments.
- Improved model manager responsiveness during download progress updates.
- Smoothed chat streaming auto-follow and rendering to reduce unnecessary UI work.
- Web model handling improvements:
- Updated web "Download" behavior to verify remote model/mmproj availability without pre-buffering large GGUF payloads in app memory.
- Clarified that web cache population occurs when a model is first loaded.
- Stability and quality:
- Added safe fallback handling for invalid persisted log-level settings.
- Added regression tests for persisted settings fallback behavior.
- New example app:
- Added
example/tui_coding_agent, anocterm-based terminal coding agent with tool-calling loop, workspace-scoped file/command tools, and runtime model switching. - Default model source is GLM 4.7 Flash (
unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL) with support for custom local paths/URLs/Hugging Face shorthand. - Added stable text-protocol tool mode as the default (native template grammar tool-calling remains available via
--native-tool-callingfor experimentation).
- Added
0.6.2 #
- Native inference performance improvements:
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
create(...). - Improved native stream throughput with worker-side token chunk batching
and configurable thresholds (
streamBatchTokenThreshold,streamBatchByteThreshold). - Added prompt-prefix reuse for native text generation
(
reusePromptPrefix, enabled by default) with conservative full-replay fallback to preserve deterministic parity. - Optimized
ChatSessioncontext trimming using bounded turn-offset search to avoid repeated linear recount loops on long histories.
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
- Benchmarking and parity tooling:
- Added
tool/testing/native_inference_benchmark.dartfor TTFT, throughput, and latency measurement with tunable generation settings. - Added
tool/testing/native_prompt_reuse_parity.dartand curated prompt sets for deterministic prompt-reuse parity validation. - Added CI prompt-reuse parity checks to catch native reuse regressions.
- Added
0.6.1 #
-
Publishing compatibility fix:
- Moved hook backend-config support code out of
hook/src/intolib/src/hook/because pub.dev currently only allowshook/build.dartunder hook files. - Updated hook/test imports accordingly to keep native-assets backend selection behavior unchanged.
- Moved hook backend-config support code out of
-
llama.cpp parity expansion (Dart-native template/parser pipeline):
- Reworked template detection/render/parse routing to align with llama.cpp semantics across supported chat formats, including format-specific tool-call parsing and fallback behavior.
- Added PEG parity components in Dart (
peg_parser_builder,peg_chat_parser) and integrated parser-carrying render/parse flow for PEG-native/constructed formats. - Removed brittle fallback coercions that could mutate valid tool names/argument keys, preserving model-emitted tool payloads for dispatch parity.
- Hardened template capability detection with Jinja AST + execution probing, while preventing typed-content false positives caused by raw content stringification.
- [BREAKING] Removed legacy custom template-handler APIs:
ChatTemplateMatcher,ChatTemplateRoutingContext,ChatTemplateEngine.registerHandler(...),ChatTemplateEngine.unregisterHandler(...),ChatTemplateEngine.clearCustomHandlers(...),ChatTemplateEngine.registerTemplateOverride(...),ChatTemplateEngine.unregisterTemplateOverride(...),ChatTemplateEngine.clearTemplateOverrides(...), and per-callcustomHandlerId/ parsehandlerIdrouting. - Removed silent render/parse fallback paths so handler/parser failures are surfaced instead of downgraded to content-only output.
- Added llama.cpp-equivalent per-call template globals/time injection via
chatTemplateKwargsandtemplateNow.
-
Parity test coverage and tooling:
- Added vendored llama.cpp template parity integration coverage for detection + render + parse paths.
- Added upstream llama.cpp chat/template suite runners and local E2E harness (
run_llama_cpp_chat_tests.sh,run_template_parity_suites.sh). - Added mirrored unit tests for new internal template components (
peg_parser_builder,template_internal_metadata) to satisfy structure guards.
-
Test cleanup and maintainability:
- Reduced noisy diagnostics in template integration tests and centralized format sample parse payload fixtures for easier parity maintenance.
-
Native integration cleanup (llamadart-native migration):
- Added
tool/testing/prepare_llama_cpp_source.shto fetch/refreshggml-org/llama.cppinto.dart_tool/llama_cpp(orLLAMA_CPP_SOURCE_DIR) pinned to a resolved ref (LLAMA_CPP_REF, defaultlatestrelease tag). - Updated
tool/testing/run_llama_cpp_chat_tests.shto use prepared.dart_toolsource instead ofthird_party/llama_cpp, so local upstream chat-suite runs no longer depend on vendored source. - Updated template parity tests to resolve fixtures from
LLAMA_CPP_TEMPLATES_DIRor.dart_tool/llama_cpp/models/templatesinstead ofthird_party/llama_cpp. - Clarified README backend matrix notes:
KleidiAI/ZenDNNare CPU-path optimizations, not selectable runtime backend modules. - Runtime backend probing for split-module bundles now runs during backend initialization (not only after first model load), so device/backend availability is visible earlier in app flows.
- Native-assets hook output now refreshes emitted native files per build to prevent stale backend module carryover when backend config changes.
- Added
-
Linux runtime/link validation and backend loader hardening:
- Hardened split-module backend loading to avoid probing backends that are not bundled for the active platform/arch, reducing noisy optional-backend load failures.
- Added failed-backend memoization so missing optional modules are not retried on every model load.
- Tightened Linux cache source selection to the current ABI bundle (
linux-arm64vslinux-x64) when preparing runtime dependencies. - Added Linux backend/runtime setup guidance in README, including distro-specific package baselines (Ubuntu/Debian, Fedora/RHEL/CentOS, Arch).
- Added reproducible Docker link-check flows for baseline (
cpu/vulkan/blas) and optionalcuda/hipmodule dependency resolution. - Added
scripts/check_native_link_deps.shhelper plus dedicated validation images:docker/validation/Dockerfile.cuda-linkcheckanddocker/validation/Dockerfile.hip-linkcheck.
-
Chat example backend UX cleanup:
- Removed user-facing
Autobackend option from settings; only concrete runtime-detected backends are shown. - Added migration behavior that resolves legacy saved
Autopreference to the best detected backend at runtime.
- Removed user-facing
0.5.4 #
-
llama.cpp parity hardening:
ChatTemplateEnginenow preserves handler-provided tokens even when grammar is attached via params, avoiding token-loss regressions in tool/thinking formats.- Native stop-sequence handling now skips preserved tokens so parser-critical markers are not terminated early.
- Generic tool-instruction system injection now follows llama.cpp semantics more closely (replace first system content when supported, otherwise prepend to first message content).
- LFM2 output parsing now extracts reasoning more consistently across tool and non-tool output shapes.
-
Chat example loop/lifecycle hardening:
- Improved tool-loop guards (first-turn force-only behavior, duplicate/equivalent call suppression, per-tool budget, and loop-stop messaging).
- Added response fallback that can ground final answers from recent tool results when the model emits stale real-time disclaimers.
- Added assistant debug badges (
fmt:*,think:*,content:json,fallback:tool-result) and strengthened detach/exit disposal paths.
-
Parity/integration test robustness:
tool_calling_integration_testnow accepts both structuredtool_callsdeltas and XML-style<tool_call>payloads.- llama.cpp template-detection integration expectations were updated for current Ministral-family routing outcomes.
-
Documentation updates:
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
{"response":"..."}) and documentedcontent:jsondiagnostics. - Documented example server sampling defaults (
penalty=1.0,top_p=0.95,min_p=0.05) and added a CLI README batch parity-matrix usage example.
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
-
Chat app backend/status fixes:
- Backend switching now preserves configured
gpuLayerswhile still allowing load-time CPU enforcement. - Runtime backend labeling and GPU activity diagnostics now follow effective user selection, preventing false "VULKAN active" status when CPU mode is selected.
- Backend switching now preserves configured
-
Context size auto mode:
- Restored support for
Context Size: Autoby preserving0in persisted settings and passing auto behavior through to session context-limit resolution.
- Restored support for
-
Tool-call parsing fixes (Hermes):
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
{{...}}layer second, and only fall back to full_normalizeDoubleBraceswhen all braces are consistently doubled. - Added a consistency gate to
_normalizeDoubleBracesthat bails out on mixed single/double brace payloads to prevent corruption of valid nested JSON.
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
-
Tool-call parsing fixes (Magistral):
- Broadened whitespace skipping in
_extractJsonObjectto handle\n,\r, and\tbetween[ARGS]and the JSON body.
- Broadened whitespace skipping in
-
Example app (basic_app):
- Replaced
toList()buffering withawait forstreaming for real-time token yield. - Added
toolsparameter to every follow-upcreate()call and bounded tool-execution loop with_maxToolRounds = 10.
- Replaced
-
Test coverage:
- Added chat app regression tests for backend switching behavior and context-size auto persistence.
- Added regression tests for Hermes wrapped+nested double-brace payloads and Magistral
[ARGS]with newline/nested arguments.
-
Example rename (server):
- Renamed
example/api_servertoexample/llamadart_server. - Renamed the example package/bin entrypoint to
llamadart_server. - Updated llama.cpp tool-call parity defaults/docs to target
example/llamadart_server.
- Renamed
-
GLM 4.5 template parity:
- Added XML tool-call grammar generation for
<tool_call>payloads with<arg_key>/<arg_value>pairs. - Added GLM-specific preserved tokens and
<|user|>stop handling for tool-call flows. - Updated parser extraction to handle GLM XML tool calls from assistant content and reasoning blocks.
- Added XML tool-call grammar generation for
-
Template/native runtime fixes:
- Typed-content template rendering now activates only when messages actually include media parts.
- Native context reset now clears llama memory in-place instead of reinitializing the context.
0.5.3 #
- Sampling controls:
- Added
minPtoGenerationParamswith a default value of0.0andcopyWithsupport.
- Added
- Native backend parity:
- Added optional llama.cpp
min_psampler initialization inLlamaCppServicewhenminP > 0.
- Added optional llama.cpp
- Test coverage:
- Added unit coverage for
GenerationParams.minPdefault andcopyWithbehavior.
- Added unit coverage for
0.5.2 #
- Chat template parity hardening:
- Expanded llama.cpp parity across additional format handlers, including grammar construction, lazy-grammar triggers, preserved tokens, and parser behavior for tool-call payload extraction.
- Added shared
ToolCallGrammarUtilshelpers for wrapped object/array tool-call grammar generation and root-rule wrapping.
- Crash fix (grammar parsing):
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
llama_grammar_init_implparse failures during tool-calling generations.
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
- Test coverage expansion:
- Added and expanded handler-level parity tests (Apertus, LFM2, Nemotron V2, Magistral, Seed-OSS, Xiaomi MiMo, DeepSeek R1/V3, Hermes) and mirrored unit tests for new grammar utilities.
0.5.1 #
- Documentation fixes:
- Updated README internal links to absolute GitHub URLs so they resolve reliably on pub.dev.
- Updated release/migration wording after 0.5.0 publication and refreshed installation/version snippets.
- Corrected iOS simulator architecture notes and contributor prerequisites/build target docs.
- Publishing hygiene:
- Expanded
.pubignoreto exclude local build outputs, large model/test artifacts, and checked-outthird_partysources from package uploads.
- Expanded
0.5.0 #
-
[BREAKING] Public API Changes:
- Root exports were tightened; previously exposed internals such as
ToolRegistry,LlamaTokenizer, andChatTemplateProcessorare no longer part of the public package API. ChatSessionnow centers oncreate(...)streamingLlamaCompletionChunk; legacychat(...)/chatText(...)style usage must migrate.LlamaChatMessageconstructor names were standardized (.fromText,.withContent) in place of older named constructors.- Default
maxTokensinGenerationParamsincreased from512to4096. LlamaChatMessage.toJson()no longer includesnameontoolrole messages.ModelParams.logLevelwas removed; logging control now lives onLlamaEngineviasetDartLogLevel(...)andsetNativeLogLevel(...).LlamaBackendinterface changed for custom backend implementers (notablygetVramInfoand updatedapplyChatTemplate).- Model reload behavior is stricter:
loadModel(...)now requires unloading first. - Migration details are documented in
MIGRATION.md.
- Root exports were tightened; previously exposed internals such as
-
Template/Parser Parity Expansion:
- Added llama.cpp-aligned format detection and handlers for additional templates including FireFunction v2, Functionary v3.2, Functionary v3.1 (Llama 3.1), GPT-OSS, Seed-OSS, Nemotron V2, Apertus, Solar Open, EXAONE MoE, Xiaomi MiMo, and TranslateGemma.
- Improved parser parity for format-specific tool-calling and reasoning extraction, including
<|python_tag|>parsing for Llama 3 flows. - Narrowed generic grammar auto-application to generic/content-only routing to avoid interfering with format-specific tool schemas.
-
Template Extensibility APIs:
- Added global custom handler registration and template override APIs in
ChatTemplateEngine. - Added per-call
customTemplateandcustomHandlerIdrouting support and threaded handler identity into parse paths. - Added cookbook examples and regression tests for registration precedence and fallback behavior.
- Added global custom handler registration and template override APIs in
-
Logging Controls:
- Added split logging controls in
LlamaEngine:setDartLogLevelandsetNativeLogLevel, while keepingsetLogLevelas a convenience method. - Fixed native
nonelog level suppression so llama.cpp/ggml logs are fully muted when requested.
- Added split logging controls in
-
Chat App Improvements:
- Added model capability badges and per-model generation presets.
- Added template-aware tool enablement guardrails and separate Dart/native log level settings in the UI.
-
Test Suite Overhaul:
- Expanded template parity coverage (detection, handlers, grammar, workarounds, registry precedence, and integration scenarios).
- Added additional unit tests for exceptions, logging, and core model definitions.
0.4.0 #
- Cross-Platform Architecture:
- Refactored
LlamaBackendfor strict Web isolation using "Native-First" conditional exports, ensuring native performance and full web safety. - Standardized backend instantiation via a unified
LlamaBackend()factory across all examples and scripts.
- Refactored
- Web & Context Stability:
- Resolved "Max Tokens is 0" on Web by implementing
getLoadedContextInfo()and robust GGUF metadata fallback inLlamaEngine. - Improved numeric metadata extraction on Web for better compatibility with varied GGUF exporters.
- Resolved "Max Tokens is 0" on Web by implementing
- GBNF Grammar Stability:
- Resolved "Unexpected empty grammar stack" crash by reordering the sampler chain (filtering tokens via GBNF before performing probability-based sampling).
- Test Suite Overhaul:
- Pivoted from mock-based unit tests to real-world integration tests using the actual
llama.cppnative backend. - Ensured full verification of model loading, tokenization, text generation, and grammar constraints against physical models.
- Multi-Platform Configuration: Introduced
dart_test.yamland@TestOntags to enable seamless execution of all tests across VM and Chrome with a singledart testcommand.
- Pivoted from mock-based unit tests to real-world integration tests using the actual
- Robust Log Silencing:
- Implemented FD-level redirection (
dup2to/dev/null) forLlamaLogLevel.noneon native platforms. - This provides a crash-free alternative to FFI-based log callbacks, which were unstable during low-level native initialization (e.g., Metal).
- Implemented FD-level redirection (
- Project Hygiene:
- Achieved 100% clean
dart analyzeacross the core library and all example applications. - Replaced legacy stubs in the chat application with a clean, interface-based
ModelServicearchitecture.
- Achieved 100% clean
- Resumable Downloads:
- Implemented robust resumable downloads for large models using HTTP Range requests.
- Added persistent
.metafiles to track download progress across app restarts.
- Enhanced Download UI:
- Refined the
ModelCardwith a visual Pause/Resume toggle. - Added a Trash icon in the card header for full cancellation and data discard of active or partial downloads.
- Improved progress feedback with clear "Paused" and "Downloading" states.
- Refined the
- Multimodal Support (Vision & Audio): Integrated the experimental
mtmdmodule fromllama.cppfor native platforms.- Added
loadMultimodalProjectortoLlamaEngine. - Introduced
LlamaChatMessage.withContentandLlamaContentPart(Text, Image, Audio). - Fix: Resolved missing multimodal symbols in native builds by properly linking the
mtmdmodule.
- Added
- Moondream 2 & Phi-2 Optimization:
- Implemented a specialized
Question: / Answer:chat template fallback for Moondream models. - Added dynamic BOS token handling: Automatically disables BOS injection for models where BOS == EOS (like Moondream) to prevent immediate "End of Generation".
- Implemented a specialized
- Chat API Consolidation:
- Moved high-level
chat()andchatWithTools()logic fromLlamaEnginetoChatSession. LlamaEngineis now a dedicated low-level orchestrator for model loading, tokenization, and raw inference.
- Moved high-level
- Intelligent Tool Flow:
- Optional Tool Calls: Tools are no longer forced by default. The model now decides when to use a tool vs. responding directly based on context.
- Final Response Generation: After a tool returns a result, the model now generates a natural language response (without grammar constraints) to interpret the result for the user.
- forceToolCall: Added a session-level flag to re-enable strict grammar-constrained tool calls for smaller models (e.g., 0.5B - 1B).
- App Stability & Resources:
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
dispose()inChatService. - Added Qwen 2.5 3B and 7B models to the download list with clear RAM/VRAM requirements for testing complex instruction following and tool use.
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
- ChatSession Manager: Introduced a new high-level
ChatSessionclass to automatically manage conversation history and system prompts. - Context Window Management:
ChatSessionnow implements an automated sliding window to truncate history when the model's context limit is approached. - Windows Robustness:
- Improved export management for MSVC to ensure symbol visibility.
- Added Sccache support for Windows builds to significantly improve CI performance.
- Automated Lifecycle:
- Implemented GitHub Actions to automate
llama.cppupdates, regression testing, and release artifact generation.
- Implemented GitHub Actions to automate
- [BREAKING] API Changes:
LlamaChatMessage.rolenow returns aLlamaChatRoleenum instead of aString. All manual role string comparisons should be updated to use the enum.
- [DEPRECATED] API Changes:
- Default
LlamaChatMessageconstructor (string-based) is now deprecated; use.fromText()or.withContent()instead. LlamaChatMessage.roleStringis deprecated and will be removed in v1.0.
- Default
- Engine Upgrades: Upgraded core
llama.cppto tagb7898. - Robust Media Loading: Support for loading images and audio via both file paths and raw byte buffers.
- Bug Fixes: Improved native resource cleanup and fixed potential null-pointer crashes in the multimodal pipeline.
0.3.0 #
- [BREAKING] Removal of
LlamaService: The legacyLlamaServicefacade has been removed. UseLlamaEnginewithLlamaBackend()instead for all platforms. - LoRA Support: Added full support for Low-Rank Adaptation (LoRA) on all native platforms (iOS, Android, macOS, Linux, Windows).
- Web Improvements: Significantly enhanced the web implementation using
wllamav2 features, including native chat templating and threading info. - Logging Refactor: Implemented a unified logging architecture.
- Native Platforms: Simplified to an on/off toggle to ensure stability.
LlamaLogLevel.nonesuppresses all output; other levels enable default stderr logging. - Web: Supports full granular filtering (Debug, Info, Warn, Error).
- Native Platforms: Simplified to an on/off toggle to ensure stability.
- Stability Fixes: Resolved frequent "Cannot invoke native callback from a leaf call" crashes during Flutter Hot Restarts by refactoring native resource lifecycle.
- Improved Lifecycle: Removed
NativeFinalizerdependency to avoid race conditions. Explicitly calldispose()to release native resources. - Robust Loading: Improved model loading on all platforms with better instance cleanup, script injection, and URL-based loading support.
- Dynamic Adapters: Implemented APIs to dynamically add, update scale, or remove LoRA adapters at runtime.
- LoRA Training Pipeline: Added a comprehensive Jupyter Notebook for fine-tuning models and converting adapters to GGUF format.
- API Enhancements: Updated
ModelParamsto include initial LoRA configurations and introducedsupportsUrlLoadingfor better platform abstraction. - CLI Tooling: Updated the
basic_appexample to support testing LoRA adapters via the--loraflag.
0.2.0+b7883 #
- Project Rebrand: Renamed package from
llama_darttollamadart. - Pure Native Assets: Migrated to the modern Dart Native Assets mechanism (
hook/build.dart). - Zero Setup: Native binaries are now automatically downloaded and bundled at runtime based on the target platform and architecture.
- Version Alignment: Aligned package versioning and binary distribution with
llama.cpprelease tags (starting withb7883). - Logging Control: Implemented comprehensive logging interception for both
llamaandggmlbackends with configurable log levels. - Performance Optimization: Added token caching to message processing, significantly reducing latency in long conversations.
- Architecture Overhaul:
- Refactored Flutter Chat Example into a clean, layered architecture (Models, Services, Providers, Widgets).
- Rebuilt CLI Basic Example into a robust conversation tool with interactive and single-response modes.
- Cross-Platform GPU: Verified and improved hardware acceleration on macOS/iOS (Metal) and Android/Linux/Windows (Vulkan).
- New Build System: Consolidated all native source and build infrastructure into a unified
third_party/directory. - Windows Support: Added robust MinGW + Vulkan cross-compilation pipeline.
- UI Enhancements: Added fine-grained rebuilds using Selectors and isolated painting with RepaintBoundaries.
0.1.0 #
- WASM Support: Full support for running the Flutter app and LLM inference in WASM on the web.
- Performance Improvements: Optimized memory usage and loading times for web models.
- Enhanced Web Interop: Improved
wllamaintegration with better error handling and progress reporting. - Bug Fixes: Resolved minor UI issues on mobile and web layouts.
0.0.1 #
- Initial release.
- Supported platforms: iOS, macOS, Android, Linux, Windows, Web.
- Features:
- Text generation with
llama.cppbackend. - GGUF model support.
- Hardware acceleration (Metal, Vulkan).
- Flutter Chat Example.
- CLI Basic Example.
- Text generation with