llama_cpp_dart library

Dart FFI binding for llama.cpp targeting iOS, Android, and macOS.

Public API surface for the v1.0 rewrite. See plan.md for the full architecture.

Classes

AdaptivePConfig
Adaptive-P sampler configuration. Disabled when target < 0.
BackendDevice
One compute device available to llama.cpp at runtime. Read by LlamaEngine.devices after engine spawn.
ChatMessage
One entry in a chat conversation. Immutable, isolate-sendable.
ChatTemplate
Thin Dart wrapper over llama_chat_apply_template and llama_model_chat_template.
ContextParams
Declarative configuration for LlamaContext.create.
ContextShift
Shift configuration. Mirrors the n_keep / n_discard knobs from llama-server. Used by ContextShiftPolicy.auto.
DoneEvent
Terminal event of a generation stream.
DryConfig
DRY (Don't Repeat Yourself) sampler configuration. Disabled when multiplier <= 0.
DynamicTempConfig
Dynamic temperature (entropy-aware) configuration. Activated when range > 0; replaces the static temperature stage.
EngineChat
A multi-turn chat handle backed by an EngineSession.
EngineSession
A handle to a session running inside the engine worker isolate.
GenerationEvent
One event in a generation stream from Generator or LlamaSession.generate.
Generator
Drives the prefill + decode loop for one Request.
GrammarConfig
GBNF grammar constraint. Disabled when grammar is null or empty.
KnownChatTemplates
Magic-substring chat templates that llama.cpp's llama_chat_apply_template pattern-matcher picks up.
KvOverride
Override for a single GGUF metadata key. Used to patch model metadata at load time without re-quantizing the file.
LlamaBackends
Static helpers for enumerating ggml-backend devices.
LlamaBatch
Owned wrapper around llama_batch. Reusable across decode calls.
LlamaBindings
Raw FFI bindings for llama.cpp (llama.h + mtmd.h). Do not edit by hand.
LlamaContext
Inference context bound to a LlamaModel.
LlamaEngine
Off-thread llama.cpp inference engine.
LlamaLibrary
Process-wide owner of the loaded llama.cpp dynamic library.
LlamaLog
Control over llama.cpp / ggml log output.
LlamaMedia
One image or audio clip attached to a chat turn.
LlamaModel
A loaded GGUF model. Owns the underlying llama_model*.
LlamaSession
In-RAM conversation state on top of a LlamaContext.
LlamaVersion
Build-time pin info for this binding.
LlamaVocab
Read-only view over a model's vocabulary.
LogitBiasEntry
One token-level logit bias entry.
MirostatConfig
Mirostat sampler configuration. Replaces top-k/top-p/min-p/typical/temp/dist when version is not MirostatVersion.off — Mirostat is a terminal sampler.
ModelParams
Declarative configuration for LlamaModel.load.
MultimodalContext
Owned wrapper around an mtmd_context.
MultimodalParams
Declarative configuration for MultimodalContext.
Request
A single generation request. Immutable.
Sampler
Owned wrapper around a llama_sampler chain.
SamplerFactory
Builds a Sampler from declarative SamplerParams.
SamplerParams
Declarative configuration for sampling.
ShiftEvent
Emitted when the generator just performed a context-shift to make room for the next decode. By the time this fires the KV cache has already been mutated; consumers that mirror the token list (like LlamaSession) should drop tokens at indices [nKeep, nKeep + nDiscard) from their own copy.
StateMetadata
Snapshot of the engine-side identity that produced a state file. Used both as the metadata payload at save time and the comparison target at load time.
StopEog
The sampled token was an end-of-generation token.
StopMaxTokens
The configured maxTokens budget was reached.
StopReason
Reason a Generator stopped emitting tokens.
StopUserAbort
The consumer cancelled the stream.
TokenEvent
A token was sampled.
Tokenizer
Encode/decode between strings and token ids using a LlamaVocab.
Utf8Accumulator
Accumulates raw bytes from streamed token pieces and flushes only complete UTF-8 codepoints.
XtcConfig
XTC (eXclude Top Choices) sampler configuration. Disabled when probability <= 0.

Enums

AttentionType
Attention type for embedding workloads. auto keeps the runtime default.
BackendDeviceType
What kind of compute device a backend represents. Mirrors ggml_backend_dev_type so callers can branch on hardware class without parsing the device name.
ContextShiftPolicy
Policy for what Generator does when the next decode would push past the context window.
FlashAttention
Three-state flag for FlashAttention.
KvCacheType
Tensor types usable for the K/V cache. Mirrors a curated subset of ggml_type. f16 is the llama.cpp default; q8_0 roughly halves KV-cache memory at a small quality cost and is the common choice on memory-constrained mobile devices.
KvOverrideType
Type of a KvOverride entry.
LlamaStateError
MediaKind
Kind of media carried by a LlamaMedia.
MirostatVersion
Mirostat algorithm version.
PoolingType
Embedding pooling strategy. auto lets the runtime / model decide.
RopeScalingType
RoPE scaling strategy. auto (the runtime default) lets the model decide.
SplitMode
How a multi-GPU model is split across devices.

Constants

defaultSeed → const int
Special seed value meaning "let the runtime pick a random seed."
stateCodecVersion → const int
Wire format version of the state codec. Bumped when the binary layout changes incompatibly. Files written with a different version are rejected on load.

Exceptions / Errors

ChatTemplateException
Thrown when chat template rendering fails.
LlamaContextException
LlamaDecodeException
LlamaException
Base type for all exceptions thrown by the llama_cpp_dart binding.
LlamaLibraryException
LlamaLogException
LlamaModelLoadException
LlamaStateException
Thrown when a state file can't be parsed or doesn't match the runtime.
LlamaTokenizeException
MultimodalException
Thrown when multimodal init / encoding fails.