html_readability

A pure Dart port of Mozilla's Readability.js
Extract the main readable content from any HTML document.

What is this? #

html_readability brings Mozilla's Readability algorithm to the Dart ecosystem as a single, zero-dependency library. Mozilla Readability is the engine behind Firefox Reader View — the feature that strips away clutter from web pages and presents just the article text. This package is a faithful Dart port of that algorithm.

Give it any HTML string and it returns the article title and clean text — no browser, no JavaScript runtime, no platform channel. It runs anywhere Dart runs: Flutter apps, server-side Dart, CLI tools, or WASM.

Why not just strip tags? #

Naively removing HTML tags produces garbage on real-world pages. Navigation menus, sidebars, ad blocks, cookie banners, share widgets, comment sections, and footer links all end up in your output. Mozilla spent years refining heuristics that reliably find the actual article content on virtually any web page. This package ports those heuristics to Dart so you don't have to reinvent them.

Quick start #

import 'package:html_readability/html_readability.dart';

final result = readability(htmlString);

result.title;        // "How the Brain Processes Language"
result.textContent;  // Clean plain text — no tags, no nav, no ads
result.htmlContent;  // Cleaned article HTML with structure preserved

That's it. One function, one result object.

How the algorithm works #

This is a port of the scoring and extraction pipeline from Mozilla's Readability.js, the same code Firefox uses when you click the Reader View icon. Here's what happens under the hood:

1. Preprocessing #

The document is cleaned of elements that never contain article content:

<script>, <style>, <noscript>, and stylesheet <link> elements are removed
Consecutive <br> tags are collapsed into paragraph breaks
<font> tags are replaced with <span> to normalize the DOM

2. Unlikely candidate removal #

Elements whose class or id attributes match patterns associated with non-content are removed. Mozilla's regex patterns identify common names like sidebar, comment, footer, banner, menu, social, sponsor, ad-break, and many more. Semantic elements (<article>, <main>, <section>) are always preserved. A companion "ok maybe" regex rescues elements containing words like article, content, body, or main from being incorrectly stripped.

3. Content scoring #

This is the heart of the algorithm. Every <p>, <td>, <pre>, and qualifying <li> element is evaluated:

Base score: 1 point per content block
Punctuation bonus: +1 per comma (a signal of real prose)
Length bonus: +1 per 100 characters, capped at 3
Class/ID weight: ±25 points based on whether the element's class and id match "positive" patterns (like article, post, entry, content) or "negative" patterns (like comment, sidebar, widget, ad)
Tag weight: <div> gets +5, <article> gets +10, <form> gets -3, headings get -5

Scores propagate upward to ancestor elements: the parent receives the full score, the grandparent receives half, and deeper ancestors receive progressively smaller fractions. This causes the innermost container that wraps the most content to accumulate the highest score — exactly the element we want.

Finally, every candidate's score is adjusted by its link density (the ratio of text inside <a> tags to total text). Navigation-heavy blocks have high link density and are penalized accordingly. Hash-only links (#anchors) receive a reduced penalty coefficient of 0.3.

4. Candidate selection #

The top 5 scoring candidates are evaluated. For each one, the algorithm:

Extracts the candidate and its qualifying siblings (siblings scoring at least 20% of the candidate's score, or sharing the same class name with low link density)
Cleans the extracted content (see step 5)
Checks if the result has enough text (≥ 500 characters)

If the top candidate produces too little content after cleaning — for example, a newsletter form that scores high from its label <p> tags — the algorithm moves to the next candidate. It also compares candidates: if a candidate produces under 1000 characters but a lower-ranked candidate has 3x more raw text, the smaller one is skipped.

If no scored candidate produces enough content, the algorithm falls back to:

The common ancestor of the top two candidates (useful for listing pages where many small articles share a parent container)
Semantic elements: <main>, [role="main"], or <article>

5. Cleaning #

The extracted content goes through several cleaning passes:

Tag removal: forms, fieldsets, iframes, inputs, buttons, footers, asides, navigation, and share/social elements are removed
Conditional cleanup: tables, divs, sections, and lists are evaluated individually. They're removed if they have high link density, too many inputs, too few paragraphs relative to images, or suspiciously short content — unless they contain enough commas (≥ 10, indicating real prose) or are inside a <figure>
Header cleanup: headings (<h1>–<h6>) with negative class weight are removed
Empty element removal: elements with no text and no media (<img>, <video>, <audio>) are recursively removed

6. Retry with relaxed heuristics #

If the first attempt produces fewer than 500 characters, the algorithm retries up to 3 more times with progressively relaxed settings:

Attempt	Unlikely stripping	Class weighting	Conditional cleaning
1	Yes	Yes	Yes
2	No	Yes	Yes
3	Yes	No	Yes
4	No	No	No

This mirrors Mozilla's flag-based retry system (FLAG_STRIP_UNLIKELYS, FLAG_WEIGHT_CLASSES, FLAG_CLEAN_CONDITIONALLY). Each retry restores the original DOM from a clone, so previous modifications don't accumulate.

7. Title extraction #

The title is extracted separately, checking in order:

<meta property="og:title"> (Open Graph)
<title> tag (with site name removal — splits on |, -, –, —, /, » and keeps the longest part)
<h1> element
The fallbackTitle parameter

API reference #

`readability()` #

ReadabilityResult readability(
  String html, {
  String? fallbackTitle,
});

Parameter	Type	Description
`html`	`String`	Raw HTML string to extract content from
`fallbackTitle`	`String?`	Title to use when none can be determined from the document

`ReadabilityResult` #

Property	Type	Description
`title`	`String`	Article title extracted from metadata or document structure
`textContent`	`String`	Article content as clean plain text (no HTML tags)
`htmlContent`	`String`	Article content as cleaned HTML with structure preserved

Testing #

This package is validated against all 130 real-world HTML fixtures from Mozilla's Readability test suite. These fixtures cover a wide range of real websites including:

NYTimes, BBC, CNN, Wikipedia, Medium, The Guardian, The Verge, Engadget, BuzzFeed, Ars Technica, Hacker News, WordPress blogs, Blogger, Tumblr, Yahoo, Washington Post, WebMD, and many more.

dart test

The fixture files (29 MB) are excluded from the published package but are available in the repository.

Regex patterns reference #

These are ported directly from Mozilla Readability.js:

Pattern	Purpose	Examples
Unlikely candidates	Elements to remove before scoring	`sidebar`, `banner`, `comment`, `footer`, `menu`, `social`, `sponsor`, `ad-break`, `popup`
OK maybe	Rescue elements from unlikely removal	`article`, `body`, `content`, `main`, `column`, `shadow`
Positive	Boost element score via class/ID	`article`, `content`, `entry`, `post`, `text`, `blog`, `story`
Negative	Penalize element score via class/ID	`comment`, `footer`, `sidebar`, `widget`, `share`, `promo`, `sponsor`, `ad`
Share elements	Remove during cleaning	`share`, `sharedaddy`

Differences from Mozilla Readability.js #

This is a faithful port but not a line-by-line translation. Key differences:

Language: Pure Dart using the html package for DOM parsing instead of browser DOM or jsdom
Enhanced candidate fallback: When the top-scoring candidate produces poor content after cleaning, lower-ranked candidates are tried automatically
Common ancestor fallback: For listing/homepage pages where content is spread across many small <article> elements, the algorithm finds their common ancestor
List content scoring: <li> elements with substantial text (≥ 80 characters, no nested <p>) are scored alongside paragraphs, improving extraction for list-heavy content like release notes and changelogs

Acknowledgements #

This package is based on Mozilla's Readability.js, the library behind Firefox Reader View. The test fixtures are sourced from the same project and are used under the Apache 2.0 license.

html_readability 0.1.0
html_readability: ^0.1.0 copied to clipboard

Metadata

html_readability

What is this? #

Why not just strip tags? #

Quick start #

How the algorithm works #

1. Preprocessing #

2. Unlikely candidate removal #

3. Content scoring #

4. Candidate selection #

5. Cleaning #

6. Retry with relaxed heuristics #

7. Title extraction #

API reference #

`readability()` #

`ReadabilityResult` #

Testing #

Regex patterns reference #

Differences from Mozilla Readability.js #

Acknowledgements #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

html_readability 0.1.0 html_readability: ^0.1.0 copied to clipboard

Metadata

html_readability

What is this? #

Why not just strip tags? #

Quick start #

How the algorithm works #

1. Preprocessing #

2. Unlikely candidate removal #

3. Content scoring #

4. Candidate selection #

5. Cleaning #

6. Retry with relaxed heuristics #

7. Title extraction #

API reference #

readability() #

ReadabilityResult #

Testing #

Regex patterns reference #

Differences from Mozilla Readability.js #

Acknowledgements #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

Topics

License

Dependencies

More

html_readability 0.1.0
html_readability: ^0.1.0 copied to clipboard

`readability()` #

`ReadabilityResult` #