html-to-markdown

High-performance HTML to Markdown converter for Dart via flutter_rust_bridge. Pure-Dart binding (no Flutter SDK required) suitable for server and CLI workloads; Flutter apps can also consume it via the standard pub.dev workflow. Published as h2m on pub.dev because html_to_markdown and html-to-markdown are taken.

What This Package Provides

Same renderer as every binding — output matches Rust, Python, Node.js, Ruby, PHP, Go, Java, .NET, Elixir, R, Dart, Swift, Zig, C FFI, and WASM.
Structured conversion result — Markdown plus metadata, links, headings, images, tables, and warnings where the binding exposes them.
Production defaults — HTML is parsed with the Rust core, sanitized by default, and rendered without runtime-specific Markdown drift.
Dart package — flutter_rust_bridge API for Dart and Flutter projects.

Installation

dart pub add h2m

Performance Snapshot

Quick Start

Basic conversion:

With conversion options:

Architecture

The converter routes each input through one of three tiers based on a fast prescan of the byte stream:

Tier-1 — single-pass byte scanner. Handles 110+ HTML tags directly. Bails on any construct it cannot prove byte-equivalent to Tier-2.
Tier-2 — DOM walker. Picks up Tier-1 bails and inputs the classifier rejected up front.
Tier-3 — standards-conformant parser. Engaged for malformed HTML requiring full HTML5 repair.

The dispatcher is invisible to the caller. Output is byte-identical across tiers — enforced by a 116-snapshot oracle.

Capabilities

16 languages, one Rust core. Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, C ABI.
CommonMark-compatible Markdown with GFM-style tables.
Djot output: set output_format = "djot" (see Djot Output Format section below).
Real-HTML robust: unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings handled without losing content.
Metadata extraction, visitor API, inline images, configurable preprocessing presets.
Per-group regression gates in CI: every PR runs the bench harness against per-group thresholds.

API Reference

Core Function

Options

ConversionOptions – Key configuration fields:

heading_style: Heading format ("underlined" | "atx" | "atx_closed") — default: "atx"
list_indent_width: Spaces per indent level — default: 2
bullets: Bullet characters cycle — default: "-*+"
wrap: Enable text wrapping — default: false
wrap_width: Wrap at column — default: 80
code_language: Default fenced code block language — default: none
extract_metadata: Enable metadata extraction into result.metadata — default: true
output_format: Output markup format ("markdown" | "djot" | "plain") — default: "markdown"

Djot Output Format

The library supports converting HTML to Djot, a lightweight markup language similar to Markdown but with a different syntax for some elements. Set output_format to "djot" to use this format.

Syntax Differences

Element	Markdown	Djot
Strong	`text`	`text`
Emphasis	`text`	`_text_`
Strikethrough	`~~text~~`	`{-text-}`
Inserted/Added	N/A	`{+text+}`
Highlighted	N/A	`{=text=}`
Subscript	N/A	`~text~`
Superscript	N/A	`^text^`

Example Usage

Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.

Plain Text Output

Set output_format to "plain" to strip all markup and return only visible text. This bypasses the Markdown conversion pipeline entirely for maximum speed.

Plain text mode is useful for search indexing, text extraction, and feeding content to LLMs.

Metadata Extraction

The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard convert() function.

Use Cases:

SEO analysis – Extract title, description, Open Graph tags, Twitter cards
Table of contents generation – Build structured outlines from heading hierarchy
Content migration – Document all external links and resources
Accessibility audits – Check for images without alt text, empty links, invalid heading hierarchy
Link validation – Classify and validate anchor, internal, external, email, and phone links

Zero Overhead When Disabled: Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass extract_metadata: true in ConversionOptions to enable it; the result is available at result.metadata.

Example: Quick Start

Visitor Pattern

The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to convert().

Use Cases:

Custom Markdown dialects – Convert to Obsidian, Notion, or other flavors
Content filtering – Remove tracking pixels, ads, or unwanted elements
URL rewriting – Rewrite CDN URLs, add query parameters, validate links
Accessibility validation – Check alt text, heading hierarchy, link text
Analytics – Track element usage, link destinations, image sources

Supported Visitor Methods: 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.

Example: Quick Start

Examples

Part of Xberg

Xberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

Contributing

We welcome contributions! Please see our Contributing Guide for details on:

Setting up the development environment
Running tests locally
Submitting pull requests
Reporting issues

All contributions must follow our code quality standards (enforced via pre-commit hooks):

Proper test coverage (Rust 95%+, language bindings 80%+)
Formatting and linting checks
Documentation for public APIs

License

Support

If you find this library useful, consider sponsoring the project.

Have questions or run into issues? We're here to help:

GitHub Issues: github.com/xberg-io/html-to-markdown/issues
Discord Community: discord.gg/xt9WY3GnKR

html-to-markdown

What This Package Provides

Installation

Performance Snapshot

Quick Start

Architecture

Capabilities

API Reference

Core Function

Options

Djot Output Format

Syntax Differences

Example Usage

Plain Text Output

Metadata Extraction

Example: Quick Start

Visitor Pattern

Example: Quick Start

Examples

Links

Part of Xberg

Contributing

License

Support

Libraries

h2m package