smollm2 #

smollm2 is a pure Dart LLM inference engine, local language model runtime, and Hugging Face exporter for SmolLM2 language models.

It allows you to:

run SmolLM2 text generation locally
export Hugging Face SmolLM2 checkpoints into an optimized Dart binary format
use Q8 or Q16 quantized weights
generate deterministic or seeded outputs
embed inference directly inside Dart applications

No Python runtime, no llama.cpp dependency, and no external native bindings are required.

Features #

🧠 Pure Dart transformer inference
⚡ SIMD optimized math kernels
💾 Built-in Q8 and Q16 quantization formats
🔁 KV cache for autoregressive generation
🌀 RoPE positional embeddings
🎲 Temperature + repetition penalty + deterministic seed
📦 Hugging Face safetensors exporter
🖥 CLI runner included
🔧 Programmatic API for Dart apps

Supported Models #

This package is designed for the SmolLM2 family published by Hugging Face Smol Models Research.

Typical supported checkpoints:

SmolLM2-135M-Instruct
SmolLM2-360M-Instruct

Other SmolLM2 variants with the same architecture may also work.

Installation #

Add to pubspec.yaml:

dependencies:
  smollm2: ^latest_version

or install with:

dart pub add smollm2

Model Export #

Before inference, a Hugging Face SmolLM2 model checkpoint must be converted into the native custom optimized .bin format.

Directory Mode #

dart run bin/export_smollm2.dart -Q8 models/smollm2-135m-instruct/

dart run bin/export_smollm2.dart -Q16 models/smollm2-135m-instruct/

Expected directory contents:

config.json
tokenizer.json
model.safetensors

or:

config.json
tokenizer.json
model.index.json + shard files

The exporter automatically detects whether the model is single-file or sharded.

Generated output example:

models/smollm2-135m-instruct/smollm2-q8.bin

Explicit File Mode #

dart run bin/export_smollm2.dart \
  config.json \
  tokenizer.json \
  model.safetensors \
  smollm2-q8.bin

Export Notes #

Available quantization formats:

-Q8 → smaller file, faster loading
-Q16 → larger file, better numeric precision

The exporter converts:

configuration
tokenizer vocabulary
merge pairs
all transformer weights

into a single portable binary file optimized for Dart runtime loading.

Custom Binary Format #

The exporter writes the Hugging Face model checkpoint into a single custom SMOL binary file designed specifically for fast Dart loading and low runtime overhead.

This binary format stores, in sequence:

package header and format version
quantization metadata
model configuration
tokenizer vocabulary
tokenizer merge pairs
all transformer tensors already converted to the selected quantized representation

Unlike Hugging Face safetensors, which require parsing many named tensors and JSON metadata at runtime, the SMOL format is a direct sequential memory layout. This allows the Dart engine to read the file in one pass with minimal allocations and without expensive tensor name resolution.

Additional advantages:

faster startup time
much lower parsing complexity
portable single-file deployment
deterministic tensor ordering
direct compatibility with Q8/Q16 internal kernels

The file begins with the magic bytes SMOL, followed by a version field, making the format extensible for future quantization modes and runtime improvements.

CLI Inference #

Run a local generation:

dart run bin/smollm2.dart

Default parameters:

model           = models/smollm2-135m-instruct/smollm2-q16.bin
prompt          = The capital of France is
maxTokens       = 60
temperature     = 0.0
repeatPenalty   = 1.09
seed            = auto-generated

CLI Options #

dart run bin/smollm2.dart [options]

Option	Description
`-m`	Model `.bin` file path
`-p`	Prompt text
`-n`	Maximum generated tokens
`-t`	Temperature
`-r`	Repetition penalty
`-s`	Seed
`-h`	Help

Example #

dart run bin/smollm2.dart \
  -m models/smollm2-360m-instruct/smollm2-q8.bin \
  -p "Explain what quantum computing is in simple terms." \
  -n 120 \
  -t 0.7 \
  -r 1.12 \
  -s 12345

Programmatic Usage #

import 'package:smollm2/smollm2.dart';

Future<void> main() async {
  final model = SmolLM2();

  await model.load('models/smollm2-135m-instruct/smollm2-q16.bin');

  await model.generate(
    'Write a short poem about the sea.',
    maxTokens: 80,
    temperature: 0.8,
    repeatPenalty: 1.10,
    seed: 42,
  );
}

Generation Parameters #

Temperature #

Controls randomness.

Value	Behavior
`0.0`	Fully deterministic / greedy
`0.2 - 0.5`	Conservative
`0.6 - 0.9`	Balanced
`1.0+`	Highly creative / unstable

Repetition Penalty #

Discourages token loops and repeated phrases.

Typical values:

Value	Behavior
`1.00`	disabled
`1.05 - 1.10`	light control
`1.10 - 1.20`	strong control

Seed #

Generation is reproducible when the same:

prompt
model
temperature
repetition penalty
seed

are used together.

Random seed can also be generated automatically:


final seed = SmolLM2.generateSeed();

Runtime Statistics #

After generation the engine reports:

prompt token count
generated token count
total tokens
prompt ingestion speed
generation speed

Example:

--- stats ---
prompt tokens    : 6
generated tokens : 60
total tokens     : 66
prompt ingest    : 0.842 s (7.12 tk/s)
generation       : 5.101 s (11.76 tk/s)

Downloading SmolLM2 Models from Hugging Face #

SmolLM2 checkpoints can be downloaded directly from Hugging Face using the companion package huggingface_downloader, a Dart CLI utility for resumable and structured model downloads.

This is especially useful because LLM checkpoints are large and may include multiple shard files.

Install globally:

dart pub global activate huggingface_downloader

Download SmolLM2-135M-Instruct #

huggingface_downloader \
  HuggingFaceTB/SmolLM2-135M-Instruct \
  ./models/smollm2-135m \
  --llm-only

Download SmolLM2-360M-Instruct #

huggingface_downloader \
  HuggingFaceTB/SmolLM2-360M-Instruct \
  ./models/smollm2-360m \
  --llm-only

What `--llm-only` Does #

The --llm-only flag downloads only the files required for language model export and inference, skipping unrelated repository assets such as README files, training metadata, images, or auxiliary resources.

Typical downloaded structure:

models/smollm2-135m/HuggingFaceTB/SmolLM2-135M-Instruct/
├── config.json
├── tokenizer.json
├── model.safetensors

For sharded checkpoints:

models/smollm2-360m/HuggingFaceTB/SmolLM2-360M-Instruct/
├── config.json
├── tokenizer.json
├── model.safetensors.index.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors

Next Step After Download #

After downloading, continue to Model Export to convert the checkpoint into the optimized native SMOL binary format.

Internal Architecture #

smollm2 implements the full SmolLM2 forward pass in Dart:

tokenizer encoding + BPE merges
embedding lookup
RMSNorm
QKV projections
RoPE application
grouped-query attention
KV cache storage
softmax attention
SwiGLU MLP
final projection to logits
temperature/repetition sampling

Optimizations include:

SIMD Float32x4 vector math
reusable activation buffers
cached FP32 embedding matrix
precomputed RoPE sin/cos tables
quantized tensor loading

Performance Goals #

This project focuses on:

portability
simplicity
pure Dart experimentation
educational transformer implementation
local offline inference

It is not intended to outperform native CUDA/Metal inference engines, but aims to provide a lightweight and hackable LLM runtime fully inside Dart.

Future Improvements #

Planned possible additions:

top-k / top-p sampling
chat template helpers
streaming callback API
CUDA/GPU accelerated tensor kernels
Metal / Vulkan backend experimentation
additional quantization modes
isolate-based parallel tensor ops
batched token generation
conversational chat session helpers

Issues & Feature Requests #

Please report bugs or request features via the issue tracker.

Author #

Graciliano M. Passos: gmpassos@GitHub

License #

Apache License - Version 2.0

smollm2 1.0.0
smollm2: ^1.0.0 copied to clipboard

Metadata

smollm2 #

Features #

Supported Models #

Installation #

Model Export #

Directory Mode #

Explicit File Mode #

Export Notes #

Custom Binary Format #

CLI Inference #

CLI Options #

Example #

Programmatic Usage #

Generation Parameters #

Temperature #

Repetition Penalty #

Seed #

Runtime Statistics #

Downloading SmolLM2 Models from Hugging Face #

Download SmolLM2-135M-Instruct #

Download SmolLM2-360M-Instruct #

What `--llm-only` Does #

Next Step After Download #

Internal Architecture #

Performance Goals #

Future Improvements #

Issues & Feature Requests #

Author #

License #

← Metadata

Publisher

Weekly Downloads

Metadata

License

Dependencies

More

smollm2 1.0.0 smollm2: ^1.0.0 copied to clipboard

Metadata

smollm2 #

Features #

Supported Models #

Installation #

Model Export #

Directory Mode #

Explicit File Mode #

Export Notes #

Custom Binary Format #

CLI Inference #

CLI Options #

Example #

Programmatic Usage #

Generation Parameters #

Temperature #

Repetition Penalty #

Seed #

Runtime Statistics #

Downloading SmolLM2 Models from Hugging Face #

Download SmolLM2-135M-Instruct #

Download SmolLM2-360M-Instruct #

What --llm-only Does #

Next Step After Download #

Internal Architecture #

Performance Goals #

Future Improvements #

Issues & Feature Requests #

Author #

License #

← Metadata

Publisher

Weekly Downloads

Metadata

License

Dependencies

More

smollm2 1.0.0
smollm2: ^1.0.0 copied to clipboard

What `--llm-only` Does #