smollm2 1.0.0 copy "smollm2: ^1.0.0" to clipboard
smollm2: ^1.0.0 copied to clipboard

Pure Dart inference engine for SmolLM2 language models, delivering surprisingly capable local LLM results without requiring CUDA.

smollm2 #

pub package Null Safety GitHub Tag Last Commit License

smollm2 is a pure Dart LLM inference engine, local language model runtime, and Hugging Face exporter for SmolLM2 language models.

It allows you to:

  • run SmolLM2 text generation locally
  • export Hugging Face SmolLM2 checkpoints into an optimized Dart binary format
  • use Q8 or Q16 quantized weights
  • generate deterministic or seeded outputs
  • embed inference directly inside Dart applications

No Python runtime, no llama.cpp dependency, and no external native bindings are required.


Features #

  • 🧠 Pure Dart transformer inference
  • ⚡ SIMD optimized math kernels
  • 💾 Built-in Q8 and Q16 quantization formats
  • 🔁 KV cache for autoregressive generation
  • 🌀 RoPE positional embeddings
  • 🎲 Temperature + repetition penalty + deterministic seed
  • 📦 Hugging Face safetensors exporter
  • 🖥 CLI runner included
  • 🔧 Programmatic API for Dart apps

Supported Models #

This package is designed for the SmolLM2 family published by Hugging Face Smol Models Research.

Typical supported checkpoints:

  • SmolLM2-135M-Instruct
  • SmolLM2-360M-Instruct

Other SmolLM2 variants with the same architecture may also work.


Installation #

Add to pubspec.yaml:

dependencies:
  smollm2: ^latest_version

or install with:

dart pub add smollm2

Model Export #

Before inference, a Hugging Face SmolLM2 model checkpoint must be converted into the native custom optimized .bin format.

Directory Mode #

dart run bin/export_smollm2.dart -Q8 models/smollm2-135m-instruct/

or

dart run bin/export_smollm2.dart -Q16 models/smollm2-135m-instruct/

Expected directory contents:

config.json
tokenizer.json
model.safetensors

or:

config.json
tokenizer.json
model.index.json + shard files

The exporter automatically detects whether the model is single-file or sharded.

Generated output example:

models/smollm2-135m-instruct/smollm2-q8.bin

Explicit File Mode #

dart run bin/export_smollm2.dart \
  config.json \
  tokenizer.json \
  model.safetensors \
  smollm2-q8.bin

Export Notes #

Available quantization formats:

  • -Q8 → smaller file, faster loading
  • -Q16 → larger file, better numeric precision

The exporter converts:

  • configuration
  • tokenizer vocabulary
  • merge pairs
  • all transformer weights

into a single portable binary file optimized for Dart runtime loading.

Custom Binary Format #

The exporter writes the Hugging Face model checkpoint into a single custom SMOL binary file designed specifically for fast Dart loading and low runtime overhead.

This binary format stores, in sequence:

  • package header and format version
  • quantization metadata
  • model configuration
  • tokenizer vocabulary
  • tokenizer merge pairs
  • all transformer tensors already converted to the selected quantized representation

Unlike Hugging Face safetensors, which require parsing many named tensors and JSON metadata at runtime, the SMOL format is a direct sequential memory layout. This allows the Dart engine to read the file in one pass with minimal allocations and without expensive tensor name resolution.

Additional advantages:

  • faster startup time
  • much lower parsing complexity
  • portable single-file deployment
  • deterministic tensor ordering
  • direct compatibility with Q8/Q16 internal kernels

The file begins with the magic bytes SMOL, followed by a version field, making the format extensible for future quantization modes and runtime improvements.


CLI Inference #

Run a local generation:

dart run bin/smollm2.dart

Default parameters:

model           = models/smollm2-135m-instruct/smollm2-q16.bin
prompt          = The capital of France is
maxTokens       = 60
temperature     = 0.0
repeatPenalty   = 1.09
seed            = auto-generated

CLI Options #

dart run bin/smollm2.dart [options]
Option Description
-m Model .bin file path
-p Prompt text
-n Maximum generated tokens
-t Temperature
-r Repetition penalty
-s Seed
-h Help

Example #

dart run bin/smollm2.dart \
  -m models/smollm2-360m-instruct/smollm2-q8.bin \
  -p "Explain what quantum computing is in simple terms." \
  -n 120 \
  -t 0.7 \
  -r 1.12 \
  -s 12345

Programmatic Usage #

import 'package:smollm2/smollm2.dart';

Future<void> main() async {
  final model = SmolLM2();

  await model.load('models/smollm2-135m-instruct/smollm2-q16.bin');

  await model.generate(
    'Write a short poem about the sea.',
    maxTokens: 80,
    temperature: 0.8,
    repeatPenalty: 1.10,
    seed: 42,
  );
}

Generation Parameters #

Temperature #

Controls randomness.

Value Behavior
0.0 Fully deterministic / greedy
0.2 - 0.5 Conservative
0.6 - 0.9 Balanced
1.0+ Highly creative / unstable

Repetition Penalty #

Discourages token loops and repeated phrases.

Typical values:

Value Behavior
1.00 disabled
1.05 - 1.10 light control
1.10 - 1.20 strong control

Seed #

Generation is reproducible when the same:

  • prompt
  • model
  • temperature
  • repetition penalty
  • seed

are used together.

Random seed can also be generated automatically:


final seed = SmolLM2.generateSeed();

Runtime Statistics #

After generation the engine reports:

  • prompt token count
  • generated token count
  • total tokens
  • prompt ingestion speed
  • generation speed

Example:

--- stats ---
prompt tokens    : 6
generated tokens : 60
total tokens     : 66
prompt ingest    : 0.842 s (7.12 tk/s)
generation       : 5.101 s (11.76 tk/s)

Downloading SmolLM2 Models from Hugging Face #

SmolLM2 checkpoints can be downloaded directly from Hugging Face using the companion package huggingface_downloader, a Dart CLI utility for resumable and structured model downloads.

This is especially useful because LLM checkpoints are large and may include multiple shard files.

Install globally:

dart pub global activate huggingface_downloader

Download SmolLM2-135M-Instruct #

huggingface_downloader \
  HuggingFaceTB/SmolLM2-135M-Instruct \
  ./models/smollm2-135m \
  --llm-only

Download SmolLM2-360M-Instruct #

huggingface_downloader \
  HuggingFaceTB/SmolLM2-360M-Instruct \
  ./models/smollm2-360m \
  --llm-only

What --llm-only Does #

The --llm-only flag downloads only the files required for language model export and inference, skipping unrelated repository assets such as README files, training metadata, images, or auxiliary resources.

Typical downloaded structure:

models/smollm2-135m/HuggingFaceTB/SmolLM2-135M-Instruct/
├── config.json
├── tokenizer.json
├── model.safetensors

For sharded checkpoints:

models/smollm2-360m/HuggingFaceTB/SmolLM2-360M-Instruct/
├── config.json
├── tokenizer.json
├── model.safetensors.index.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors

Next Step After Download #

After downloading, continue to Model Export to convert the checkpoint into the optimized native SMOL binary format.


Internal Architecture #

smollm2 implements the full SmolLM2 forward pass in Dart:

  • tokenizer encoding + BPE merges
  • embedding lookup
  • RMSNorm
  • QKV projections
  • RoPE application
  • grouped-query attention
  • KV cache storage
  • softmax attention
  • SwiGLU MLP
  • final projection to logits
  • temperature/repetition sampling

Optimizations include:

  • SIMD Float32x4 vector math
  • reusable activation buffers
  • cached FP32 embedding matrix
  • precomputed RoPE sin/cos tables
  • quantized tensor loading

Performance Goals #

This project focuses on:

  • portability
  • simplicity
  • pure Dart experimentation
  • educational transformer implementation
  • local offline inference

It is not intended to outperform native CUDA/Metal inference engines, but aims to provide a lightweight and hackable LLM runtime fully inside Dart.


Future Improvements #

Planned possible additions:

  • top-k / top-p sampling
  • chat template helpers
  • streaming callback API
  • CUDA/GPU accelerated tensor kernels
  • Metal / Vulkan backend experimentation
  • additional quantization modes
  • isolate-based parallel tensor ops
  • batched token generation
  • conversational chat session helpers

Issues & Feature Requests #

Please report bugs or request features via the issue tracker.


Author #

Graciliano M. Passos: gmpassos@GitHub


License #

Apache License - Version 2.0

1
likes
0
points
339
downloads

Publisher

unverified uploader

Weekly Downloads

Pure Dart inference engine for SmolLM2 language models, delivering surprisingly capable local LLM results without requiring CUDA.

Repository (GitHub)
View/report issues

License

unknown (license)

Dependencies

collection

More

Packages that depend on smollm2