juice_llm_llamacpp #

Embedded on-device LLM runtime for juice_llm — an LlamaCppProvider that runs GGUF models on llama.cpp (Metal/CPU), in-process and private, via llama_cpp_dart.

final llm = LlmBloc.withConfig(LlmConfig(
  provider: LlamaCppProvider(libraryPath: '/path/to/libllama.dylib'), // macOS dev
  resolvePath: (model) => '/path/to/model.gguf',
));
llm.loadModel(myModel);
llm.generate(LlmRequest(requestId: 'r1', messages: [LlmMessage.user('hi')]));

Nothing else in the app changes — it's just an LlmProvider. Swap it for the EchoLlmProvider default (or an HTTP provider) without touching widgets or use cases.

Native binary (the one setup step) #

This package is pure Dart. The native llama.cpp library is llama_cpp_dart's concern — grab the prebuilt binary from its GitHub Releases:

Target	Artifact	Wire it up
macOS dev / CLI / tests	`macos-libllama.zip` (`libllama.dylib` + siblings)	unzip anywhere, `LlamaCppProvider(libraryPath: '…/libllama.dylib')`. Downloaded dylibs are Gatekeeper-quarantined — `xattr -dr com.apple.quarantine <dir>`.
iOS / macOS app	`llama.xcframework`	drag into Xcode → Embed & Sign, then `LlamaCppProvider(useProcessSymbols: true)` (no path; dyld resolves it).

Cancellation #

Cancelling the generation stream cancels the underlying generation. On the published llama_cpp_dart 0.9.0-dev.9 this is soft — token delivery stops immediately (the session reaches cancelled), but the worker finishes the current decode. That's fine for short generations (e.g. a one-line reflection).

True mid-decode interrupt lands when netdur/llama_cpp_dart#106 merges (an event-loop-starvation fix); no change is needed here when it does.

Models #

Any GGUF llama.cpp loads. Gemma 4 (Apache-2.0, mirror-able) is a good on-device pick: gemma-4-E2B-it-qat ~2.6 GB Q4 text-only. Memory (not policy) is the device ceiling — ~2.6 GB fits 8 GB-RAM iPhones comfortably, marginal on 6 GB; context length is the tuning knob. Weights are downloaded at runtime (a ModelSource), not bundled — see juice_llm's provisioning notes.

Embeddings #

embed() maps to llama_cpp_dart's embedding pass (for juice_llm's semantic search). It requires the engine to be configured for embeddings on the loaded model.

Status #

0.1.0 — built and verified end-to-end on macOS/Metal (real model through LlmBloc: streaming generation, KV reuse across requests, cancel). The embedded runtime behind juice_llm's LlmProvider seam.

juice_llm_llamacpp 0.1.0
juice_llm_llamacpp: ^0.1.0 copied to clipboard

Metadata

juice_llm_llamacpp #

Native binary (the one setup step) #

Cancellation #

Models #

Embeddings #

Status #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Funding

License

Dependencies

More

juice_llm_llamacpp 0.1.0 juice_llm_llamacpp: ^0.1.0 copied to clipboard

Metadata

juice_llm_llamacpp #

Native binary (the one setup step) #

Cancellation #

Models #

Embeddings #

Status #

← Metadata

Publisher

Weekly Downloads

Metadata

Topics

Funding

License

Dependencies

More

juice_llm_llamacpp 0.1.0
juice_llm_llamacpp: ^0.1.0 copied to clipboard