juice_llm_llamacpp 0.1.1 copy "juice_llm_llamacpp: ^0.1.1" to clipboard
juice_llm_llamacpp: ^0.1.1 copied to clipboard

Embedded on-device LLM runtime for juice_llm — an LlamaCppProvider backed by llama.cpp (GGUF, Metal) via llama_cpp_dart.

juice_llm_llamacpp #

Embedded on-device LLM runtime for juice_llm — an LlamaCppProvider that runs GGUF models on llama.cpp (Metal/CPU), in-process and private, via llama_cpp_dart.

final llm = LlmBloc.withConfig(LlmConfig(
  provider: LlamaCppProvider(libraryPath: '/path/to/libllama.dylib'), // macOS dev
  resolvePath: (model) => '/path/to/model.gguf',
));
llm.loadModel(myModel);
llm.generate(LlmRequest(requestId: 'r1', messages: [LlmMessage.user('hi')]));

Nothing else in the app changes — it's just an LlmProvider. Swap it for the EchoLlmProvider default (or an HTTP provider) without touching widgets or use cases.

Native binary (the one setup step) #

This package is pure Dart. The native llama.cpp library is llama_cpp_dart's concern — grab the prebuilt binary from its GitHub Releases:

Target Artifact Wire it up
macOS dev / CLI / tests macos-libllama.zip (libllama.dylib + siblings) unzip anywhere, LlamaCppProvider(libraryPath: '…/libllama.dylib'). Downloaded dylibs are Gatekeeper-quarantined — xattr -dr com.apple.quarantine <dir>.
iOS / macOS app llama.xcframework drag into Xcode → Embed & Sign, then LlamaCppProvider(useProcessSymbols: true) (no path; dyld resolves it).

Cancellation #

Cancelling the generation stream cancels the underlying generation. On the published llama_cpp_dart 0.9.0-dev.9 this is soft — token delivery stops immediately (the session reaches cancelled), but the worker finishes the current decode. That's fine for short generations (e.g. a one-line reflection).

True mid-decode interrupt lands when netdur/llama_cpp_dart#106 merges (an event-loop-starvation fix); no change is needed here when it does.

Models #

Any GGUF llama.cpp loads. Gemma 4 (Apache-2.0, mirror-able) is a good on-device pick: gemma-4-E2B-it-qat ~2.6 GB Q4 text-only. Memory (not policy) is the device ceiling — ~2.6 GB fits 8 GB-RAM iPhones comfortably, marginal on 6 GB; context length is the tuning knob. Weights are downloaded at runtime (a ModelSource), not bundled — see juice_llm's provisioning notes.

Embeddings #

embed() maps to llama_cpp_dart's embedding pass (for juice_llm's semantic search). It requires the engine to be configured for embeddings on the loaded model.

Status #

0.1.0 — built and verified end-to-end on macOS/Metal (real model through LlmBloc: streaming generation, KV reuse across requests, cancel). The embedded runtime behind juice_llm's LlmProvider seam.

0
likes
150
points
0
downloads

Documentation

API reference

Publisher

unverified uploader

Weekly Downloads

Embedded on-device LLM runtime for juice_llm — an LlamaCppProvider backed by llama.cpp (GGUF, Metal) via llama_cpp_dart.

Homepage
Repository (GitHub)
View/report issues
Contributing

Topics

#llm #on-device #llama-cpp #ai #juice

Funding

Consider supporting this project:

github.com

License

MIT (license)

Dependencies

flutter, juice_llm, llama_cpp_dart

More

Packages that depend on juice_llm_llamacpp