juice_llm_llamacpp
Embedded on-device LLM runtime for juice_llm —
an LlamaCppProvider that runs GGUF models on llama.cpp (Metal/CPU),
in-process and private, via llama_cpp_dart.
final llm = LlmBloc.withConfig(LlmConfig(
provider: LlamaCppProvider(libraryPath: '/path/to/libllama.dylib'), // macOS dev
resolvePath: (model) => '/path/to/model.gguf',
));
llm.loadModel(myModel);
llm.generate(LlmRequest(requestId: 'r1', messages: [LlmMessage.user('hi')]));
Nothing else in the app changes — it's just an LlmProvider. Swap it for the
EchoLlmProvider default (or an HTTP provider) without touching widgets or use
cases.
Native binary (the one setup step)
This package is pure Dart. The native llama.cpp library is llama_cpp_dart's
concern — grab the prebuilt binary from its
GitHub Releases:
| Target | Artifact | Wire it up |
|---|---|---|
| macOS dev / CLI / tests | macos-libllama.zip (libllama.dylib + siblings) |
unzip anywhere, LlamaCppProvider(libraryPath: '…/libllama.dylib'). Downloaded dylibs are Gatekeeper-quarantined — xattr -dr com.apple.quarantine <dir>. |
| iOS / macOS app | llama.xcframework |
drag into Xcode → Embed & Sign, then LlamaCppProvider(useProcessSymbols: true) (no path; dyld resolves it). |
Cancellation
Cancelling the generation stream cancels the underlying generation. On the
published llama_cpp_dart 0.9.0-dev.9 this is soft — token delivery stops
immediately (the session reaches cancelled), but the worker finishes the
current decode. That's fine for short generations (e.g. a one-line reflection).
True mid-decode interrupt lands when netdur/llama_cpp_dart#106 merges (an event-loop-starvation fix); no change is needed here when it does.
Models
Any GGUF llama.cpp loads. Gemma 4 (Apache-2.0, mirror-able) is a good on-device
pick: gemma-4-E2B-it-qat ~2.6 GB Q4 text-only. Memory (not policy) is the
device ceiling — ~2.6 GB fits 8 GB-RAM iPhones comfortably, marginal on 6 GB;
context length is the tuning knob. Weights are downloaded at runtime (a
ModelSource), not bundled — see juice_llm's provisioning notes.
Embeddings
embed() maps to llama_cpp_dart's embedding pass (for juice_llm's semantic
search). It requires the engine to be configured for embeddings on the loaded
model.
Status
0.1.0 — built and verified end-to-end on macOS/Metal (real model through
LlmBloc: streaming generation, KV reuse across requests, cancel). The
embedded runtime behind juice_llm's LlmProvider seam.
Libraries
- juice_llm_llamacpp
- Embedded on-device LLM runtime for
juice_llm: an LlamaCppProvider that implementsLlmProviderby wrapping llama.cpp (GGUF weights, Metal/CPU) throughllama_cpp_dart. Private, in-process — no server.