llama_cpp_dart library
Dart FFI binding for llama.cpp targeting iOS, Android, and macOS.
Public API surface for the v1.0 rewrite. See plan.md for the full
architecture.
Classes
- AdaptivePConfig
- Adaptive-P sampler configuration. Disabled when target < 0.
- BackendDevice
-
One compute device available to llama.cpp at runtime. Read by
LlamaEngine.devicesafter engine spawn. - ChatMessage
- One entry in a chat conversation. Immutable, isolate-sendable.
- ChatTemplate
-
Thin Dart wrapper over
llama_chat_apply_templateandllama_model_chat_template. - ContextParams
- Declarative configuration for LlamaContext.create.
- ContextShift
-
Shift configuration. Mirrors the
n_keep/n_discardknobs from llama-server. Used by ContextShiftPolicy.auto. - DoneEvent
- Terminal event of a generation stream.
- DryConfig
- DRY (Don't Repeat Yourself) sampler configuration. Disabled when multiplier <= 0.
- DynamicTempConfig
-
Dynamic temperature (entropy-aware) configuration. Activated when
range > 0; replaces the static
temperaturestage. - EngineChat
- A multi-turn chat handle backed by an EngineSession.
- EngineSession
- A handle to a session running inside the engine worker isolate.
- GenerationEvent
- One event in a generation stream from Generator or LlamaSession.generate.
- Generator
- Drives the prefill + decode loop for one Request.
- GrammarConfig
- GBNF grammar constraint. Disabled when grammar is null or empty.
- KnownChatTemplates
-
Magic-substring chat templates that llama.cpp's
llama_chat_apply_templatepattern-matcher picks up. - KvOverride
- Override for a single GGUF metadata key. Used to patch model metadata at load time without re-quantizing the file.
- LlamaBackends
- Static helpers for enumerating ggml-backend devices.
- LlamaBatch
-
Owned wrapper around
llama_batch. Reusable across decode calls. - LlamaBindings
- Raw FFI bindings for llama.cpp (llama.h + mtmd.h). Do not edit by hand.
- LlamaContext
- Inference context bound to a LlamaModel.
- LlamaEngine
- Off-thread llama.cpp inference engine.
- LlamaLibrary
- Process-wide owner of the loaded llama.cpp dynamic library.
- LlamaLog
- Control over llama.cpp / ggml log output.
- LlamaMedia
- One image or audio clip attached to a chat turn.
- LlamaModel
-
A loaded GGUF model. Owns the underlying
llama_model*. - LlamaSession
- In-RAM conversation state on top of a LlamaContext.
- LlamaVersion
- Build-time pin info for this binding.
- LlamaVocab
- Read-only view over a model's vocabulary.
- LogitBiasEntry
- One token-level logit bias entry.
- MirostatConfig
- Mirostat sampler configuration. Replaces top-k/top-p/min-p/typical/temp/dist when version is not MirostatVersion.off — Mirostat is a terminal sampler.
- ModelParams
- Declarative configuration for LlamaModel.load.
- MultimodalContext
-
Owned wrapper around an
mtmd_context. - MultimodalParams
- Declarative configuration for MultimodalContext.
- Request
- A single generation request. Immutable.
- Sampler
-
Owned wrapper around a
llama_samplerchain. - SamplerFactory
- Builds a Sampler from declarative SamplerParams.
- SamplerParams
- Declarative configuration for sampling.
- ShiftEvent
-
Emitted when the generator just performed a context-shift to make room
for the next decode. By the time this fires the KV cache has already
been mutated; consumers that mirror the token list (like
LlamaSession) should drop tokens at indices[nKeep, nKeep + nDiscard)from their own copy. - StateMetadata
- Snapshot of the engine-side identity that produced a state file. Used both as the metadata payload at save time and the comparison target at load time.
- StopEog
- The sampled token was an end-of-generation token.
- StopMaxTokens
-
The configured
maxTokensbudget was reached. - StopReason
- Reason a Generator stopped emitting tokens.
- StopUserAbort
- The consumer cancelled the stream.
- TokenEvent
- A token was sampled.
- Tokenizer
- Encode/decode between strings and token ids using a LlamaVocab.
- Utf8Accumulator
- Accumulates raw bytes from streamed token pieces and flushes only complete UTF-8 codepoints.
- XtcConfig
- XTC (eXclude Top Choices) sampler configuration. Disabled when probability <= 0.
Enums
- AttentionType
-
Attention type for embedding workloads.
autokeeps the runtime default. - BackendDeviceType
-
What kind of compute device a backend represents. Mirrors
ggml_backend_dev_typeso callers can branch on hardware class without parsing the device name. - ContextShiftPolicy
-
Policy for what
Generatordoes when the next decode would push past the context window. - FlashAttention
- Three-state flag for FlashAttention.
- KvCacheType
-
Tensor types usable for the K/V cache. Mirrors a curated subset of
ggml_type.f16is the llama.cpp default;q8_0roughly halves KV-cache memory at a small quality cost and is the common choice on memory-constrained mobile devices. - KvOverrideType
- Type of a KvOverride entry.
- LlamaStateError
- MediaKind
- Kind of media carried by a LlamaMedia.
- MirostatVersion
- Mirostat algorithm version.
- PoolingType
-
Embedding pooling strategy.
autolets the runtime / model decide. - RopeScalingType
-
RoPE scaling strategy.
auto(the runtime default) lets the model decide. - SplitMode
- How a multi-GPU model is split across devices.
Constants
- defaultSeed → const int
- Special seed value meaning "let the runtime pick a random seed."
- stateCodecVersion → const int
- Wire format version of the state codec. Bumped when the binary layout changes incompatibly. Files written with a different version are rejected on load.
Exceptions / Errors
- ChatTemplateException
- Thrown when chat template rendering fails.
- LlamaContextException
- LlamaDecodeException
- LlamaException
- Base type for all exceptions thrown by the llama_cpp_dart binding.
- LlamaLibraryException
- LlamaLogException
- LlamaModelLoadException
- LlamaStateException
- Thrown when a state file can't be parsed or doesn't match the runtime.
- LlamaTokenizeException
- MultimodalException
- Thrown when multimodal init / encoding fails.