llamafu 0.1.0 copy "llamafu: ^0.1.0" to clipboard
llamafu: ^0.1.0 copied to clipboard

A Flutter package for running language models on device with support for completion, instruct mode, tool calling, streaming, constrained generation, LoRA, and multi-modal inputs (images, audio).

Llamafu #

pub package License: MIT Platform Dart 3 Flutter 3 llama.cpp

Run AI models directly on mobile devices. No cloud. No latency. Complete privacy.

Llamafu is a Flutter FFI plugin that brings the power of large language models to your mobile apps. Built on llama.cpp, it delivers high-performance inference with support for text generation, vision, tool calling, and more—all running locally on the device.

Why Llamafu? #

Feature Benefit
100% On-Device No API keys, no network calls, works offline
Privacy First Data never leaves the device
Low Latency No round-trip to cloud servers
Cost Effective No per-token API charges

Features #

Core Capabilities

  • Text generation with streaming support
  • Chat completions with conversation history
  • Embeddings generation for semantic search

Advanced AI

  • Vision/multimodal (images, audio) with LLaVA, Qwen2-VL
  • Tool calling / function calling
  • Structured JSON output with schema validation
  • Grammar-constrained generation (GBNF)

Customization

  • LoRA adapter loading and hot-swapping
  • Fine-grained sampling controls (temperature, top-k, top-p, penalties)
  • Configurable context size and threading

Platform Support

  • Android (API 21+) and iOS (12.0+)
  • Optimized native code via FFI
  • GPU acceleration where available

Requirements #

Platform Minimum Version
Flutter 3.10.0+
Dart SDK 3.1.0+
Android API 21+ (Android 5.0), NDK 21+
iOS 12.0+, Xcode 14+

Models must be in GGUF format. Quantized versions (Q4_K_M, Q8_0) recommended for mobile.

Installation #

flutter pub add llamafu

Or add manually to pubspec.yaml:

dependencies:
  llamafu: ^0.1.0

Quick Start #

import 'package:llamafu/llamafu.dart';

void main() async {
  // Load a GGUF model
  final llamafu = await Llamafu.init(
    modelPath: '/path/to/model.gguf',
    threads: 4,
    contextSize: 2048,
  );

  // Generate text
  final result = await llamafu.complete(
    prompt: 'Explain quantum computing in simple terms:',
    maxTokens: 256,
    temperature: 0.7,
  );

  print(result);

  // Always clean up resources
  llamafu.close();
}

That's it! The model runs entirely on the device—no internet required.

Documentation #

Guide Description
Getting Started Installation and first steps
Model Guide Choosing and obtaining models
High-Level APIs Chat, LoRA, and multimodal
Tool Calling Function calling and JSON output
Performance Guide Memory and speed optimization
API Reference Complete API documentation
Architecture Technical design (for contributors)
Building Build from source
Contributing Development guidelines

Usage Examples #

Text Generation #

final result = await llamafu.complete(
  prompt: 'Write a function to sort an array:',
  maxTokens: 300,
  temperature: 0.7,
  topK: 40,
  topP: 0.9,
  repeatPenalty: 1.1,
);

Multimodal (Vision) #

// Load vision model with projection
final llamafu = await Llamafu.init(
  modelPath: '/path/to/llava-model.gguf',
  mmprojPath: '/path/to/mmproj.gguf',
);

final result = await llamafu.multimodalComplete(
  prompt: 'Describe this image:',
  mediaInputs: [
    MediaInput(type: MediaType.image, data: '/path/to/image.jpg'),
  ],
  maxTokens: 200,
);

LoRA Adapters #

// Load and apply adapter
final adapter = await llamafu.loadLoraAdapter('/path/to/adapter.gguf');
await llamafu.applyLoraAdapter(adapter, scale: 0.8);

// Generate with adapter
final result = await llamafu.complete(
  prompt: 'Translate to French: Hello',
  maxTokens: 50,
);

// Remove adapter
await llamafu.removeLoraAdapter(adapter);

Structured Output #

const jsonGrammar = '''
root ::= object
object ::= "{" ws string ":" ws value "}" ws
string ::= "\\"" [a-zA-Z]+ "\\""
value ::= string | number
number ::= [0-9]+
ws ::= [ ]*
''';

final result = await llamafu.completeWithGrammar(
  prompt: 'Generate user data:',
  grammarStr: jsonGrammar,
  grammarRoot: 'root',
  maxTokens: 100,
);

Tool Calling #

// Define tools
final weatherTool = Tool(
  name: 'get_weather',
  description: 'Get weather for a location',
  parameters: {
    'type': 'object',
    'properties': {
      'location': {'type': 'string'},
    },
    'required': ['location'],
  },
);

// Generate tool call
final toolCall = await llamafu.generateToolCall(
  prompt: "What's the weather in Paris?",
  tools: [weatherTool],
);

print(toolCall.name);       // "get_weather"
print(toolCall.arguments);  // {"location": "Paris"}

JSON Output #

// Generate JSON matching a schema
final result = await llamafu.generateJson(
  prompt: 'Extract: John is 25 years old',
  schema: {
    'type': 'object',
    'properties': {
      'name': {'type': 'string'},
      'age': {'type': 'integer'},
    },
    'required': ['name', 'age'],
  },
);

print(result);  // {"name": "John", "age": 25}

Tokenization #

// Tokenize
final tokens = await llamafu.tokenize('Hello world');
print('Token count: ${tokens.length}');

// Detokenize
final text = await llamafu.detokenize(tokens);

// Model info
final info = await llamafu.getModelInfo();
print('Vocab: ${info.vocabularySize}');
print('Context: ${info.contextLength}');

Supported Models #

Works with any GGUF-format model. Popular choices for mobile:

Category Models
General LLaMA 3, Mistral, Phi-3, Qwen2, Gemma 2
Code Code LLaMA, DeepSeek Coder, StarCoder2
Vision LLaVA, Qwen2-VL, Moondream
Small/Fast Phi-3 Mini, TinyLlama, Gemma 2B

Recommended quantizations for mobile: Q4_K_M (best quality/size), Q4_0 (fastest), Q8_0 (highest quality)

Find models at Hugging Face or convert your own with llama.cpp.


For Developers #

Building from Source #

# Clone with submodules
git clone --recursive https://github.com/neul-labs/llamafu.git
cd llamafu

# Setup development environment
make setup

# Build and test
make build
make test

Platform-specific builds:

make build-android    # Android AAR
make build-ios        # iOS framework
make build-local      # Local dev with GPU support

Project Structure #

llamafu/
├── lib/src/
│   ├── llamafu_base.dart       # High-level Dart API
│   └── llamafu_bindings.dart   # FFI bindings (dart:ffi)
├── android/src/main/cpp/
│   ├── llamafu.h               # C API header
│   └── llamafu.cpp             # Native implementation
├── ios/Classes/                # iOS native code
├── llama.cpp/                  # llama.cpp submodule (inference engine)
├── test/                       # Comprehensive test suite
├── tools/                      # Build and test scripts
├── example/                    # Example Flutter app
└── docs/                       # Documentation

Architecture #

┌─────────────────────────────────────────────────────────┐
│                    Your Flutter App                      │
├─────────────────────────────────────────────────────────┤
│              Llamafu Dart API (lib/src/)                │
│         High-level, type-safe, async interface          │
├─────────────────────────────────────────────────────────┤
│           FFI Bindings (llamafu_bindings.dart)          │
│              dart:ffi ↔ Native C bridge                 │
├─────────────────────────────────────────────────────────┤
│            Native C++ Layer (llamafu.cpp)               │
│              RAII, memory safety, validation            │
├─────────────────────────────────────────────────────────┤
│                    llama.cpp Engine                      │
│        High-performance inference, GGUF loading         │
└─────────────────────────────────────────────────────────┘

Performance Tips #

  • Use quantized models — Q4_K_M offers the best quality/size tradeoff for mobile
  • Right-size context — Smaller context = less memory (start with 2048, increase if needed)
  • Tune threadingthreads: Platform.numberOfProcessors - 1 is a good default
  • Clean up resources — Always call close() when done to free native memory
  • Stream responses — Use streaming for better perceived performance in chat UIs

Error Handling #

try {
  final result = await llamafu.complete(prompt: input, maxTokens: 200);
} on LlamafuException catch (e) {
  switch (e.code) {
    case LlamafuErrorCode.modelLoadFailed:
      print('Could not load model: ${e.message}');
    case LlamafuErrorCode.outOfMemory:
      print('Not enough memory—try a smaller model or context size');
    default:
      print('Error: ${e.message}');
  }
}

Contributing #

Contributions are welcome! See docs/contributing.md for guidelines.

# Run the full test suite
make test

# Format and lint
dart format . && dart analyze

Support #

License #

MIT License. See LICENSE for details.


Acknowledgments #

Built on the shoulders of giants:

  • llama.cpp — The inference engine that makes this possible
  • ggml — Tensor library for efficient ML computation
0
likes
120
points
40
downloads

Publisher

unverified uploader

Weekly Downloads

A Flutter package for running language models on device with support for completion, instruct mode, tool calling, streaming, constrained generation, LoRA, and multi-modal inputs (images, audio).

Topics

#ai #llm #llama #on-device #machine-learning

Documentation

API reference

License

MIT (license)

Dependencies

ffi, flutter

More

Packages that depend on llamafu

Packages that implement llamafu