llama_cpp_dart library

Classes

AdaptivePConfig: Adaptive-P sampler configuration. Disabled when target < 0.
BackendDevice: One compute device available to llama.cpp at runtime. Read by LlamaEngine.devices after engine spawn.
ChatMessage: One entry in a chat conversation. Immutable, isolate-sendable.
ChatTemplate: Thin Dart wrapper over llama_chat_apply_template and llama_model_chat_template.
ContextParams: Declarative configuration for LlamaContext.create.
ContextShift: Shift configuration. Mirrors the n_keep / n_discard knobs from llama-server. Used by ContextShiftPolicy.auto.
DoneEvent: Terminal event of a generation stream.
DryConfig: DRY (Don't Repeat Yourself) sampler configuration. Disabled when multiplier <= 0.
DynamicTempConfig: Dynamic temperature (entropy-aware) configuration. Activated when range > 0; replaces the static temperature stage.
EngineChat: A multi-turn chat handle backed by an EngineSession.
EngineSession: A handle to a session running inside the engine worker isolate.
GenerationEvent: One event in a generation stream from Generator or LlamaSession.generate.
Generator: Drives the prefill + decode loop for one Request.
GrammarConfig: GBNF grammar constraint. Disabled when grammar is null or empty.
KnownChatTemplates: Magic-substring chat templates that llama.cpp's llama_chat_apply_template pattern-matcher picks up.
KvOverride: Override for a single GGUF metadata key. Used to patch model metadata at load time without re-quantizing the file.
LlamaBackends: Static helpers for enumerating ggml-backend devices.
LlamaBatch: Owned wrapper around llama_batch. Reusable across decode calls.
LlamaBindings: Raw FFI bindings for llama.cpp (llama.h + mtmd.h). Do not edit by hand.
LlamaContext: Inference context bound to a LlamaModel.
LlamaEngine: Off-thread llama.cpp inference engine.
LlamaLibrary: Process-wide owner of the loaded llama.cpp dynamic library.
LlamaLog: Control over llama.cpp / ggml log output.
LlamaMedia: One image or audio clip attached to a chat turn.
LlamaModel: A loaded GGUF model. Owns the underlying llama_model*.
LlamaSession: In-RAM conversation state on top of a LlamaContext.
LlamaVersion: Build-time pin info for this binding.
LlamaVocab: Read-only view over a model's vocabulary.
LogitBiasEntry: One token-level logit bias entry.
MirostatConfig: Mirostat sampler configuration. Replaces top-k/top-p/min-p/typical/temp/dist when version is not MirostatVersion.off — Mirostat is a terminal sampler.
ModelParams: Declarative configuration for LlamaModel.load.
MultimodalContext: Owned wrapper around an mtmd_context.
MultimodalParams: Declarative configuration for MultimodalContext.
Request: A single generation request. Immutable.
Sampler: Owned wrapper around a llama_sampler chain.
SamplerFactory: Builds a Sampler from declarative SamplerParams.
SamplerParams: Declarative configuration for sampling.
ShiftEvent: Emitted when the generator just performed a context-shift to make room for the next decode. By the time this fires the KV cache has already been mutated; consumers that mirror the token list (like LlamaSession) should drop tokens at indices [nKeep, nKeep + nDiscard) from their own copy.
StateMetadata: Snapshot of the engine-side identity that produced a state file. Used both as the metadata payload at save time and the comparison target at load time.
StopEog: The sampled token was an end-of-generation token.
StopMaxTokens: The configured maxTokens budget was reached.
StopReason: Reason a Generator stopped emitting tokens.
StopUserAbort: The consumer cancelled the stream.
TokenEvent: A token was sampled.
Tokenizer: Encode/decode between strings and token ids using a LlamaVocab.
Utf8Accumulator: Accumulates raw bytes from streamed token pieces and flushes only complete UTF-8 codepoints.
XtcConfig: XTC (eXclude Top Choices) sampler configuration. Disabled when probability <= 0.

Enums

AttentionType: Attention type for embedding workloads. auto keeps the runtime default.
BackendDeviceType: What kind of compute device a backend represents. Mirrors ggml_backend_dev_type so callers can branch on hardware class without parsing the device name.
ContextShiftPolicy: Policy for what Generator does when the next decode would push past the context window.
FlashAttention: Three-state flag for FlashAttention.
KvCacheType: Tensor types usable for the K/V cache. Mirrors a curated subset of ggml_type. f16 is the llama.cpp default; q8_0 roughly halves KV-cache memory at a small quality cost and is the common choice on memory-constrained mobile devices.
KvOverrideType: Type of a KvOverride entry.
LlamaStateError
MediaKind: Kind of media carried by a LlamaMedia.
MirostatVersion: Mirostat algorithm version.
PoolingType: Embedding pooling strategy. auto lets the runtime / model decide.
RopeScalingType: RoPE scaling strategy. auto (the runtime default) lets the model decide.
SplitMode: How a multi-GPU model is split across devices.

Constants

defaultSeed → const int: Special seed value meaning "let the runtime pick a random seed."
stateCodecVersion → const int: Wire format version of the state codec. Bumped when the binary layout changes incompatibly. Files written with a different version are rejected on load.

Exceptions / Errors

ChatTemplateException: Thrown when chat template rendering fails.
LlamaContextException
LlamaDecodeException
LlamaException: Base type for all exceptions thrown by the llama_cpp_dart binding.
LlamaLibraryException
LlamaLogException
LlamaModelLoadException
LlamaStateException: Thrown when a state file can't be parsed or doesn't match the runtime.
LlamaTokenizeException
MultimodalException: Thrown when multimodal init / encoding fails.

llama_cpp_dart library

Classes

Enums

Constants

Exceptions / Errors

llama_cpp_dart package

llama_cpp_dart library