A comprehensive pure Dart PDF library + OCR-free Document Understanding Transformer (Donut).
No native dependencies — works on all platforms (Android, iOS, Web, macOS, Windows, Linux).

| Module |
Description |
| dart_mupdf |
Pure Dart PDF engine inspired by PyMuPDF — parse, extract text/images, annotate, merge, create PDFs |
| donut |
Pure Dart Donut implementation — Swin Transformer encoder + mBART decoder for structured document extraction from images (receipts, invoices, forms) |
dependencies:
dart_mupdf_donut: ^1.0.0
dart pub get
import 'package:dart_mupdf_donut/dart_mupdf.dart';
// Open a PDF
final doc = DartMuPDF.openBytes(pdfBytes);
print('Pages: ${doc.pageCount}');
print('Title: ${doc.metadata.title}');
// Extract text
final page = doc.getPage(0);
print(page.getText());
// Search
for (final rect in page.searchFor('invoice')) {
print('Found at: $rect');
}
// Extract images
for (final img in page.getImages()) {
final data = doc.extractImage(img.xref);
print('Image: ${data.width}x${data.height}');
}
// Table of contents
for (final entry in doc.getToc()) {
print('${" " * (entry.level - 1)}${entry.title} → p.${entry.pageNumber}');
}
doc.close();
| Feature |
PyMuPDF (Python) |
dart_mupdf_donut (Dart) |
| Open PDF from file/bytes |
✅ fitz.open() |
✅ DartMuPDF.openFile() / openBytes() |
| Page count & metadata |
✅ doc.page_count |
✅ doc.pageCount / doc.metadata |
| Extract plain text |
✅ page.get_text() |
✅ page.getText() |
| Extract text blocks |
✅ page.get_text("blocks") |
✅ page.getTextBlocks() |
| Extract text words |
✅ page.get_text("words") |
✅ page.getTextWords() |
| Extract text as dict |
✅ page.get_text("dict") |
✅ page.getTextDict() |
| Search text |
✅ page.search_for() |
✅ page.searchFor() |
| Get images list |
✅ page.get_images() |
✅ page.getImages() |
| Extract image bytes |
✅ doc.extract_image() |
✅ doc.extractImage() |
| Get links |
✅ page.get_links() |
✅ page.getLinks() |
| Table of contents |
✅ doc.get_toc() |
✅ doc.getToc() |
| Annotations |
✅ page.annots() |
✅ page.getAnnotations() |
| Insert text |
✅ page.insert_text() |
✅ page.insertText() |
| Insert image |
✅ page.insert_image() |
✅ page.insertImage() |
| Merge PDFs |
✅ doc.insert_pdf() |
✅ doc.insertPdf() |
| Delete / rotate / copy pages |
✅ |
✅ |
| Create new PDF |
✅ fitz.open() |
✅ DartMuPDF.createPdf() |
| Save to bytes / file |
✅ doc.tobytes() |
✅ doc.toBytes() / doc.save() |
| Encryption & auth |
✅ |
✅ doc.isEncrypted / doc.authenticate() |
| Embedded files |
✅ doc.embfile_* |
✅ doc.embeddedFiles |
| Form fields |
✅ page.widgets() |
✅ page.getWidgets() |
| Page labels |
✅ |
✅ doc.getPageLabels() |
| PDF repair |
✅ garbage=3 |
✅ doc.save(garbage: 3) |
final text = page.getText(); // plain text
final blocks = page.getTextBlocks(); // blocks with position
final words = page.getTextWords(); // individual words
final dict = page.getTextDict(); // full structure
final html = page.getText(format: TextFormat.html);
doc.deletePage(2);
doc.getPage(0).setRotation(90);
doc.movePage(from: 5, to: 0);
doc.select([0, 2, 4]);
doc.copyPage(0, to: doc.pageCount);
for (final a in page.getAnnotations()) {
print('${a.type}: ${a.content}');
}
page.addHighlightAnnot(quads);
page.addTextAnnot(Point(100, 100), 'Note');
final shape = Shape(pageWidth: 595, pageHeight: 842);
shape.drawRect(Rect(50, 50, 300, 200));
shape.finish(color: [1, 0, 0], fill: [0.9, 0.9, 1.0], width: 1);
shape.drawCircle(Point(200, 400), 50);
shape.finish(color: [0, 0, 1], width: 1.5);
final stream = shape.commit();
import 'package:dart_mupdf_donut/donut.dart';
import 'dart:io';
// 1. Configure & build model
final config = DonutConfig.base();
final model = DonutModel(config);
// 2. Load pretrained weights (from HuggingFace export)
await model.loadWeights('path/to/donut-model/');
model.loadTokenizerFromFile('path/to/tokenizer.json');
// 3. Run inference on a receipt image
final bytes = File('receipt.jpg').readAsBytesSync();
final result = model.inferenceFromBytes(
imageBytes: bytes,
prompt: '<s_cord-v2>',
);
// 4. Structured JSON output
print(result.json);
// {
// "menu": [
// {"nm": "Cappuccino", "price": "4.50"},
// {"nm": "Croissant", "price": "3.00"}
// ],
// "total": {"total_price": "7.50"}
// }
import 'package:dart_mupdf_donut/dart_mupdf.dart';
import 'package:dart_mupdf_donut/donut.dart';
final doc = DartMuPDF.openBytes(pdfBytes);
final page = doc.getPage(0);
final png = page.getPixmap(dpi: 150).toPng();
final model = DonutModel(DonutConfig.base());
await model.loadWeights('model/');
model.loadTokenizerFromFile('tokenizer.json');
final result = model.inferenceFromBytes(
imageBytes: png,
prompt: '<s_cord-v2>',
);
print(result.json);
doc.close();
Document Image
│
▼
┌─────────────────────────┐
│ Swin Encoder │ Hierarchical vision transformer
│ Patch Embed → Stages │ Window attention + patch merging
│ [2, 2, 14, 2] layers │ Output: (1, N, 1024) features
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ BART Decoder │ Auto-regressive text decoder
│ Cross-attention to │ 4 layers × 16 heads
│ encoder features │ Generates structured tokens
└───────────┬─────────────┘
│
▼
Structured JSON Output
Key insight: Donut skips OCR entirely — it learns to read documents end-to-end from pixels to structured data.
| Task |
Prompt |
Output |
| Receipt Parsing (CORD-v2) |
<s_cord-v2> |
Menu items, prices, totals |
| Document Classification (RVL-CDIP) |
<s_rvlcdip> |
"letter", "invoice", etc. |
| Visual QA (DocVQA) |
<s_docvqa><s_question>…</s_question><s_answer> |
Free-text answer |
| Text Reading (SynthDoG) |
<s_synthdog> |
OCR-free text |
// Default (matches HuggingFace donut-base)
final config = DonutConfig.base();
// Custom
final config = DonutConfig(
inputSize: [1280, 960],
windowSize: 10,
encoderLayer: [2, 2, 14, 2],
decoderLayer: 4,
maxLength: 1536,
encoderEmbedDim: 128,
encoderNumHeads: [4, 8, 16, 32],
decoderEmbedDim: 1024,
decoderFfnDim: 4096,
decoderNumHeads: 16,
vocabSize: 57522,
);
// Small (for testing / dev)
final config = DonutConfig.small();
// From HuggingFace config.json
final config = DonutConfig.fromJson(jsonDecode(str));
// From file bytes (PNG, JPEG, etc.)
final tensor = DonutImageUtils.preprocessBytes(imageBytes, config);
// → Tensor shape [1, 3, H, W] with ImageNet normalization
// Pipeline description
print(DonutImageUtils.describePipeline(config));
// 1. Decode → RGB
// 2. Resize to fit [H, W] (aspect ratio preserved)
// 3. Pad with white
// 4. Normalize: (px/255 − μ) / σ (ImageNet stats)
// 5. → Tensor [1, 3, H, W]
// Debug: tensor back to image
final debugImg = DonutImageUtils.tensorToImage(tensor);
Donut uses XML-like tokens for structured output:
// JSON → Donut tokens
final tokens = DonutModel.json2token({
'menu': [
{'nm': 'Latte', 'price': '5.0'},
{'nm': 'Muffin', 'price': '3.5'},
],
'total': {'total_price': '8.5'},
});
// → '<s_menu><s_nm>Latte</s_nm><s_price>5.0</s_price><sep/>
// <s_nm>Muffin</s_nm><s_price>3.5</s_price></s_menu>
// <s_total><s_total_price>8.5</s_total_price></s_total>'
// Donut tokens → JSON
final json = DonutModel.token2json(tokens);
// → {'menu': [...], 'total': {'total_price': '8.5'}}
import 'package:dart_mupdf_donut/donut.dart';
final a = Tensor.zeros([2, 3]);
final b = Tensor.ones([2, 3]);
final c = a + b; // element-wise add
final d = c * Tensor.full([2, 3], 2.0);
// Matrix multiply
final y = Tensor.ones([2, 4]).matmul(Tensor.ones([4, 3]));
// Reshape, permute, transpose
final r = y.reshape([1, 2, 3]);
final p = r.permute([0, 2, 1]);
// Reductions
final s = y.sum(1);
final m = y.mean(0);
// Activations
final g = y.gelu();
final sm = y.softmax(1);
final linear = Linear(512, 256);
final output = linear.forward(input);
final norm = LayerNorm(256);
final normalized = norm.forward(output);
final embed = Embedding(50000, 1024);
final embedded = embed.forward([1, 42, 100]);
final attn = MultiHeadAttention(embedDim: 1024, numHeads: 16);
final attended = attn.forward(query, key, value);
final conv = Conv2d(3, 64, 4, stride: 4);
final features = conv.forward(imageTensor);
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('naver-clova-ix/donut-base-finetuned-cord-v2',
local_dir='./donut-cord-v2/')
"
import torch, json, base64
from transformers import VisionEncoderDecoderModel
model = VisionEncoderDecoderModel.from_pretrained(
"naver-clova-ix/donut-base-finetuned-cord-v2"
)
weights = {}
for name, param in model.named_parameters():
t = param.detach().cpu().float().numpy()
weights[name] = {
"shape": list(t.shape),
"dtype": "float32",
"data": base64.b64encode(t.tobytes()).decode("ascii"),
}
with open("weights.json", "w") as f:
json.dump(weights, f)
| Class |
Description |
DartMuPDF |
Entry point — openFile(), openBytes(), createPdf() |
Document |
PDF document — pages, metadata, TOC, merge, save |
Page |
Single page — text, images, links, annotations, render |
Pixmap |
Pixel buffer — convert, export PNG |
Shape |
Drawing API — lines, rects, circles, text |
TextPage |
Parsed text layer — blocks, words, search |
| Class |
Description |
DonutModel |
Main model — encoder + decoder + inference |
DonutConfig |
All hyperparameters (input size, layers, dims) |
DonutResult |
Output — .tokens, .text, .json |
DonutTokenizer |
SentencePiece BPE tokenizer |
DonutImageUtils |
Resize, normalize, pad images |
DonutWeightLoader |
Load safetensors / JSON weights |
| Class |
Description |
SwinEncoder |
Swin Transformer visual encoder |
BartDecoder |
mBART auto-regressive decoder |
SwinTransformerBlock |
Window attention block |
BartDecoderLayer |
Self-attn → cross-attn → FFN |
WindowAttention |
Shifted window attention |
PatchEmbed |
Conv2d patch embedding |
PatchMerging |
2× spatial downsample |
| Class |
Description |
Tensor |
N-dim array — matmul, softmax, GELU, broadcasting |
Linear |
y = xW^T + b |
LayerNorm |
Layer normalization |
Embedding |
Token / position lookup |
Conv2d |
2D convolution |
MultiHeadAttention |
Scaled dot-product attention |
FeedForward |
2-layer MLP + GELU |
| Class |
Description |
Point |
2D point |
Rect |
Bounding rectangle |
IRect |
Integer rectangle |
Matrix |
3×3 transformation matrix |
Quad |
Quadrilateral (4 points) |
| Model |
Task |
HuggingFace ID |
| Donut Base |
General |
naver-clova-ix/donut-base |
| CORD v2 |
Receipt parsing |
naver-clova-ix/donut-base-finetuned-cord-v2 |
| RVL-CDIP |
Doc classification |
naver-clova-ix/donut-base-finetuned-rvlcdip |
| DocVQA |
Visual QA |
naver-clova-ix/donut-base-finetuned-docvqa |
| Platform |
Status |
| Android |
✅ |
| iOS |
✅ |
| Web |
✅ |
| macOS |
✅ |
| Windows |
✅ |
| Linux |
✅ |
MIT — see LICENSE for details.
- PyMuPDF: github.com/pymupdf/PyMuPDF
- Donut: Kim et al., "OCR-free Document Understanding Transformer", ECCV 2022 — arXiv:2111.15664 | Code
- Swin Transformer: Liu et al., "Hierarchical Vision Transformer using Shifted Windows", ICCV 2021
- BART: Lewis et al., "Denoising Sequence-to-Sequence Pre-training", ACL 2020