pdf_ocr_vlm 1.2.0 copy "pdf_ocr_vlm: ^1.2.0" to clipboard
pdf_ocr_vlm: ^1.2.0 copied to clipboard

Vision-language-model OCR engine for dart_pdf_editor — add a selectable, searchable text layer over scanned PDF pages using a self-hosted SOTA OCR model (dots.ocr) or any HTTP OCR service.

pdf_ocr_vlm #

License: Apache-2.0

A pluggable OCR engine for the dart-pdf suite. It adds an invisible, selectable, searchable text layer over scanned (image-only) PDF pages, using a state-of-the-art vision-language OCR model that you self-host — or any HTTP OCR service you wrap to a small JSON contract.

dart_pdf_editor ships the OCR seam (PdfOcrEngine, PdfEditor.applyOcr) but no engine — OCR is a large native/GPU/cloud subsystem that doesn't belong inside a pure-Dart PDF toolkit. This package fills that seam over HTTP, so the heavy model runs out of process (a Docker container, a GPU box, a cloud endpoint) and your Flutter app stays thin.

┌──────────────┐   page raster (PNG)    ┌──────────────────────┐
│ dart_pdf_*   │ ─────────────────────► │  OCR service (HTTP)  │
│ applyOcr()   │                        │  e.g. dots.ocr/vLLM  │
│              │ ◄───────────────────── │  on a GPU / cloud    │
└──────────────┘   words + pixel boxes  └──────────────────────┘
        │
        ▼  inject invisible text at the recognized boxes
   selectable · searchable · copyable · extractable PDF

Why a VLM, and which one? (SOTA, mid-2026) #

Document OCR has moved from detector+recognizer pipelines (Tesseract, EasyOCR) to vision-language models that read layout, reading order, and text in one pass and return structured JSON with bounding boxes. The current leaders (open-weight, self-hostable, and returning boxes — which is what an over-the-scan text layer needs):

Model Size Notes
dots.ocr (rednote-hilab/dots.ocr) ~1.7B Layout + reading order + text in one model, 100+ languages, returns bbox+category+text JSON. This package's default preset.
PaddleOCR-VL ~0.9B Strong multilingual; OpenAI-compatible serving.
GOT-OCR 2.0 580M Runs on ~4 GB VRAM; Markdown/LaTeX output.
Qwen3-VL / DeepSeek-OCR / GLM-OCR 0.9–3B General VLMs / OCR-specialized; box quality varies.
Cloud frontier (Gemini 3 Flash, Claude, GPT) Highest accuracy, no GPU to run, per-call cost; box support varies.

This package defaults to dots.ocr on vLLM: it is open, small enough for a single consumer GPU, multilingual, and — crucially — returns per-block pixel bounding boxes that map cleanly onto the page. You can point the same engine at any of the others (see Other backends).

Sources for the landscape above: the definitive OCR-in-2026 guide, best open-source OCR tools, dots.ocr on GitHub / Hugging Face, and vLLM OCR recipes.


Install #

dependencies:
  dart_pdf_editor: ^1.2.0
  pdf_ocr_vlm: ^1.2.0

pdf_ocr_vlm works wherever Flutter runs (mobile, desktop, web) — it only does an HTTP POST, so the model can live anywhere reachable. Make sure the OCR service is CORS-enabled if you call it from a web build.


Quick start #

import 'package:dart_pdf_editor/dart_pdf_editor.dart';
import 'package:pdf_document/pdf_document.dart';
import 'package:pdf_ocr_vlm/pdf_ocr_vlm.dart';

Future<Uint8List> ocrEntirePdf(Uint8List bytes) async {
  final editor = PdfEditor(PdfDocument.open(bytes));

  // Talk to a vLLM server hosting dots.ocr (see "Run the model" below).
  final engine = VlmOcrEngine.dotsOcr(
    endpoint: Uri.parse('http://localhost:8000/v1/chat/completions'),
  );

  for (var page = 0; page < editor.document.pageCount; page++) {
    final spans = await editor.applyOcr(page, engine, pixelRatio: 2);
    debugPrint('page $page: wrote $spans text spans');
  }
  engine.close();

  return editor.save(); // the scan now has a selectable/searchable layer
}

applyOcr rasterizes the page, hands the raster to the engine, and writes each recognized word as invisible text (render mode 3) placed exactly over the scan — the page looks identical, but text can now be selected, searched, copied, and extracted. Pass visible: true to burn the layer in (useful for debugging the box alignment).

Try it in the example app #

The suite's example app (packages/dart_pdf_editor/example) wires this in: More actions ▸ Add OCR text layer… opens a dialog to supply the service endpoint, model name, and an optional API key/token (sent as Authorization: Bearer …), then OCRs every page and opens the result in a new tab. Point it at a server from Run the model.

Wire it into the editor UI #

IconButton(
  icon: const Icon(Icons.document_scanner),
  tooltip: 'OCR this page',
  onPressed: () async {
    final editor = PdfEditor(PdfDocument.open(currentBytes));
    await editor.applyOcr(
      pageIndex,
      engine,
      pixelRatio: 2.5,      // raise for small type
      minConfidence: 0.30,  // drop junk
    );
    // applyOcr mutates the editor's document in place; save the bytes and
    // re-open them in your viewer/editor as the new document.
    final ocrdBytes = editor.save();
    onDocumentReady(ocrdBytes);
  },
)

Run the model (dots.ocr on vLLM) #

The official image serves an OpenAI-compatible chat endpoint, which VlmOcrEngine.dotsOcr speaks directly — no adapter server needed.

With Docker + an NVIDIA GPU:

docker run --gpus all -p 8000:8000 \
  rednotehilab/dots.ocr:vllm-openai-v0.9.1 \
  vllm serve /workspace/weights/DotsOCR \
    --served-model-name model \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.95 \
    --chat-template-content-format string \
    --trust-remote-code

Or with a local vLLM install:

pip install -U vllm transformers
huggingface-cli download rednote-hilab/dots.ocr --local-dir ./DotsOCR
vllm serve ./DotsOCR \
  --served-model-name model \
  --gpu-memory-utilization 0.95 \
  --chat-template-content-format string \
  --trust-remote-code

Then point the engine at it:

final engine = VlmOcrEngine.dotsOcr(
  endpoint: Uri.parse('http://YOUR_GPU_HOST:8000/v1/chat/completions'),
  model: 'model',          // must match --served-model-name
  apiKey: null,            // set if you front it with an auth proxy
  // categories: {...},    // which dots.ocr layout blocks become text
  // minConfidence: 0.0,   // dots.ocr returns no per-cell score → keep 0
);

dotsOcr sends the page image plus the dots.ocr layout prompt, then reads the JSON array the model returns ([{bbox, category, text}, ...]), keeps the text-bearing blocks (Text, Title, Section-header, List-item, Caption, Footnote, Page-header, Page-footerPicture and Table are skipped by default), and maps each pixel bbox onto the page. Override categories: to include tables/formulas.

No GPU? dots.ocr also runs (slowly) on CPU via vLLM/transformers for trials, and the same preset works against a cloud-hosted dots.ocr or any OpenAI-compatible OCR VLM — just change endpoint, model, and apiKey.


The simple JSON contract (any OCR server) #

If you'd rather front your own engine (PaddleOCR, Surya, docTR, Tesseract, a custom pipeline), wrap it in a tiny HTTP service that speaks this contract and use the default constructor — no preset, no custom Dart.

RequestPOST <endpoint>, Content-Type: application/json:

{
  "image": "<base64 PNG of the page>",
  "image_format": "png",
  "width": 1224,
  "height": 1584,
  "page": 0,
  "languages": ["en"]      // present only if you pass languages:
}

Response200, a list of recognized fragments. Boxes are in raster pixels, top-left origin (the same width×height you were sent):

{
  "spans": [
    { "text": "Invoice",  "bbox": [96, 110, 320, 156], "confidence": 0.98 },
    { "text": "Total",    "bbox": [96, 980, 240, 1020], "confidence": 0.95 }
  ]
}
final engine = VlmOcrEngine(
  endpoint: Uri.parse('http://localhost:8001/ocr'),
  languages: const ['en'],
  minConfidence: 0.3,
);

The default parser is lenient: the list may be top-level or under any of spans / words / lines / results / regions / cells / data; text may be text / transcription / content; a box may be a 4-number bbox / box / bounding_box / rect, or a polygon under polygon / poly / points / quad; confidence may be confidence / score / conf (default 1.0). So most off-the-shelf OCR JSON drops in unchanged.

Reference adapter (≈30 lines, FastAPI + PaddleOCR) #

# pip install fastapi uvicorn paddleocr pillow
import base64, io
from fastapi import FastAPI, Request
from PIL import Image
from paddleocr import PaddleOCR

app = FastAPI()
ocr = PaddleOCR(use_angle_cls=True, lang="en")

@app.post("/ocr")
async def recognize(req: Request):
    body = await req.json()
    img = Image.open(io.BytesIO(base64.b64decode(body["image"]))).convert("RGB")
    import numpy as np
    result = ocr.ocr(np.array(img), cls=True)
    spans = []
    for line in (result[0] or []):
        poly, (text, conf) = line
        spans.append({"text": text, "polygon": poly, "confidence": float(conf)})
    return {"spans": spans}
# uvicorn server:app --host 0.0.0.0 --port 8001

Other backends #

requestBody and responseParser are the two seams; override either to target a different service without leaving Dart.

A cloud VLM (custom prompt + parser) #

final engine = VlmOcrEngine(
  endpoint: Uri.parse('https://api.example.com/v1/chat/completions'),
  headers: {'authorization': 'Bearer $apiKey'},
  model: 'some-vision-model',
  prompt: 'Return a JSON array of {bbox:[x0,y0,x1,y1] in pixels, text}.',
  requestBody: openAiChatRequestBody,          // reuse the chat encoder
  responseParser: (json, page) {
    // navigate choices[0].message.content yourself, then build words…
    // return List<VlmOcrWord> with pixel-space Rects.
  },
);

VlmOcrWord carries a pixel-space Rect; applyOcr maps it to PDF user space for you (PdfOcrPageImage.userSpaceRect undoes the crop box and /Rotate that the raster already baked in), so a parser never does page geometry.


API surface #

Symbol Purpose
VlmOcrEngine(...) Generic engine for the simple JSON contract.
VlmOcrEngine.dotsOcr(...) Preset for dots.ocr on an OpenAI-compatible vLLM endpoint.
VlmOcrInput The rendered page (base64 PNG + dims + hints) given to a request builder.
VlmOcrWord One recognized fragment in pixel coordinates.
defaultVlmRequestBody / defaultVlmResponseParser The simple-contract default hooks.
openAiChatRequestBody Chat-completions request encoder (image + prompt).
dotsOcrResponseParser, dotsOcrLayoutPrompt, dotsOcrTextCategories dots.ocr building blocks.
VlmOcrException Thrown on transport / status / parse failures.

applyOcr's own options live in dart_pdf_editor: pixelRatio (OCR raster resolution — 2 ≈ 144 dpi, raise for small type), minConfidence, visible, and font.


Tips & limitations #

  • Resolution drives accuracy. Start at pixelRatio: 2; small or dense type wants 2.53. Higher = larger PNG = slower request.
  • The layer is byte-encoded text. Code points outside Latin-1 still position correctly (selection/search boxes line up) but render as ? if you make the layer visible; invisible layers extract the original text.
  • Already-digital PDFs don't need OCR — they have real text already. Use this for scans and image-only pages.
  • Box granularity follows the model. dots.ocr returns block/line boxes, so selection snaps to lines, not individual words — fine for search and copy. A word-level engine (PaddleOCR/Tesseract) over the simple contract gives word boxes.
  • Network & privacy. Page rasters leave the device. For sensitive documents, self-host the model on infrastructure you control.

License #

Apache-2.0 — see LICENSE.

1
likes
0
points
356
downloads

Publisher

verified publisherbenmilanko.com

Weekly Downloads

Vision-language-model OCR engine for dart_pdf_editor — add a selectable, searchable text layer over scanned PDF pages using a self-hosted SOTA OCR model (dots.ocr) or any HTTP OCR service.

Repository (GitHub)
View/report issues

Topics

#pdf #ocr #text-recognition #scanned-documents

License

unknown (license)

Dependencies

dart_pdf_editor, flutter, http, pdf_document

More

Packages that depend on pdf_ocr_vlm