LLAMA.CPP DART

A high-performance Dart binding for llama.cpp, enabling advanced text generation capabilities in both Dart and Flutter applications with flexible integration options.

Overview

This library provides three levels of abstraction for integrating llama.cpp into your Dart/Flutter projects, allowing you to choose the right balance between control and convenience:

Low-Level FFI Bindings: Direct access to llama.cpp functions
High-Level Wrapper: Simplified, object-oriented API
Managed Isolate: Flutter-friendly, non-blocking implementation

Usage Examples

Low-Level FFI Bindings

Direct llama.cpp integration with maximum control:

import 'package:llama_cpp_dart/src/llama_cpp.dart';

void main() {
  final lib = llama_cpp(DynamicLibrary.open("libllama.dylib"));
  // Initialize model, context, and sampling parameters
  // See examples/low_level.dart for complete example
}

check examples:

High-Level Wrapper

Simplified API for common use cases:

import 'package:llama_cpp_dart/llama_cpp_dart.dart';

void main() {
  Llama.libraryPath = "libllama.dylib";
  final llama = Llama("path/to/model.gguf");
  
  llama.setPrompt("2 * 2 = ?");
  while (true) {
    var (token, done) = llama.getNext();
    print(token);
    if (done) break;
  }
  llama.dispose();
}

check examples:

Managed Isolate

Perfect for Flutter applications:

import 'package:llama_cpp_dart/llama_cpp_dart.dart';

void main() async {
  final loadCommand = LlamaLoad(
    path: "path/to/model.gguf",
    modelParams: ModelParams(),
    contextParams: ContextParams(),
    samplingParams: SamplerParams(),
    format: ChatMLFormat(),
  );

  final llamaParent = LlamaParent(loadCommand);
  await llamaParent.init();

  llamaParent.stream.listen((response) => print(response));
  llamaParent.sendPrompt("2 * 2 = ?");
}

check examples:

test
chat

Getting Started

Prerequisites

Dart SDK (for console applications)
Flutter SDK (for Flutter applications)
Compiled llama.cpp shared library

Building llama.cpp Library

Clone the llama.cpp repository:

git clone https://github.com/ggml-org/llama.cpp

Compile into a shared library:

Windows: Outputs .dll
Linux: Outputs .so
macOS: Outputs .dylib

check BUILD.md

Place the compiled library in your project's accessible directory

Installation

Add to your pubspec.yaml:

dependencies:
  llama_cpp_dart: ^latest_version

Model Selection Guide

When choosing and using LLM models with this library, consider the following:

Use-Case Specific Models

Different models excel at different tasks:

Text Generation: Most LLMs work well for general text generation.
Embeddings: Not all models produce high-quality embeddings for semantic search. For example, while Gemma 3 can generate embeddings, it's not optimized for vector search. Instead, consider dedicated embedding models like E5, BGE, or SGPT.
Code Generation: Models like CodeLlama or StarCoder are specifically trained for code.
Multilingual: Some models have better support for non-English languages.

Chat Formats

Each model family expects prompts in a specific format:

Llama 2: Uses a specific format with [INST] and [/INST] tags
ChatML: Used by models like Claude and GPT
Gemma: Has its own system prompt format
Mistral/Mixtral: Uses <s> tags in a particular way

Using the correct format is critical for optimal results. Our library provides common format templates:

// Example of setting the right chat format
final loadCommand = LlamaLoad(
  path: "path/to/llama2.gguf",
  format: Llama2ChatFormat(), // Choose the correct format for your model
);

// Other available formats
// ChatMLFormat()
// GemmaChatFormat()
// MistralChatFormat()
// Custom formats can be created by implementing the ChatFormat interface

Model Size Considerations

Balance quality and performance:

7B models: Fastest, lowest memory requirements, but less capable
13-14B models: Good balance of performance and quality
30-70B models: Highest quality, but significantly higher memory and processing requirements

Quantization

Models come in different quantization levels that affect size, speed, and quality:

F16: Highest quality, largest size
Q4_K_M: Good balance of quality and size
Q3_K_M: Smaller size, slightly reduced quality
Q2_K: Smallest size, noticeable quality degradation

For most applications, Q4_K_M provides an excellent balance.

Hardware Considerations

CPU: All models work on CPU, but larger models require more RAM
Metal (Apple): Significant speed improvements on Apple Silicon
CUDA (NVIDIA): Best performance for NVIDIA GPUs
ROCm (AMD): Support for AMD GPUs

Ensure your compiled llama.cpp library includes support for your target hardware.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

LLAMA.CPP DART

Overview

Usage Examples

Low-Level FFI Bindings

High-Level Wrapper

Managed Isolate

Getting Started

Prerequisites

Building llama.cpp Library

Installation

Model Selection Guide

Use-Case Specific Models

Chat Formats

Model Size Considerations

Quantization

Hardware Considerations

License

Libraries

llama_cpp_dart package