html_readability 0.1.0 copy "html_readability: ^0.1.0" to clipboard
html_readability: ^0.1.0 copied to clipboard

Pure Dart port of Mozilla's Readability algorithm. Extracts the main readable content from any HTML document.

html_readability

A pure Dart port of Mozilla's Readability.js
Extract the main readable content from any HTML document.

pub.dev build license


What is this? #

html_readability brings Mozilla's Readability algorithm to the Dart ecosystem as a single, zero-dependency library. Mozilla Readability is the engine behind Firefox Reader View — the feature that strips away clutter from web pages and presents just the article text. This package is a faithful Dart port of that algorithm.

Give it any HTML string and it returns the article title and clean text — no browser, no JavaScript runtime, no platform channel. It runs anywhere Dart runs: Flutter apps, server-side Dart, CLI tools, or WASM.

Why not just strip tags? #

Naively removing HTML tags produces garbage on real-world pages. Navigation menus, sidebars, ad blocks, cookie banners, share widgets, comment sections, and footer links all end up in your output. Mozilla spent years refining heuristics that reliably find the actual article content on virtually any web page. This package ports those heuristics to Dart so you don't have to reinvent them.

Quick start #

import 'package:html_readability/html_readability.dart';

final result = readability(htmlString);

result.title;        // "How the Brain Processes Language"
result.textContent;  // Clean plain text — no tags, no nav, no ads
result.htmlContent;  // Cleaned article HTML with structure preserved

That's it. One function, one result object.

How the algorithm works #

This is a port of the scoring and extraction pipeline from Mozilla's Readability.js, the same code Firefox uses when you click the Reader View icon. Here's what happens under the hood:

1. Preprocessing #

The document is cleaned of elements that never contain article content:

  • <script>, <style>, <noscript>, and stylesheet <link> elements are removed
  • Consecutive <br> tags are collapsed into paragraph breaks
  • <font> tags are replaced with <span> to normalize the DOM

2. Unlikely candidate removal #

Elements whose class or id attributes match patterns associated with non-content are removed. Mozilla's regex patterns identify common names like sidebar, comment, footer, banner, menu, social, sponsor, ad-break, and many more. Semantic elements (<article>, <main>, <section>) are always preserved. A companion "ok maybe" regex rescues elements containing words like article, content, body, or main from being incorrectly stripped.

3. Content scoring #

This is the heart of the algorithm. Every <p>, <td>, <pre>, and qualifying <li> element is evaluated:

  • Base score: 1 point per content block
  • Punctuation bonus: +1 per comma (a signal of real prose)
  • Length bonus: +1 per 100 characters, capped at 3
  • Class/ID weight: ±25 points based on whether the element's class and id match "positive" patterns (like article, post, entry, content) or "negative" patterns (like comment, sidebar, widget, ad)
  • Tag weight: <div> gets +5, <article> gets +10, <form> gets -3, headings get -5

Scores propagate upward to ancestor elements: the parent receives the full score, the grandparent receives half, and deeper ancestors receive progressively smaller fractions. This causes the innermost container that wraps the most content to accumulate the highest score — exactly the element we want.

Finally, every candidate's score is adjusted by its link density (the ratio of text inside <a> tags to total text). Navigation-heavy blocks have high link density and are penalized accordingly. Hash-only links (#anchors) receive a reduced penalty coefficient of 0.3.

4. Candidate selection #

The top 5 scoring candidates are evaluated. For each one, the algorithm:

  • Extracts the candidate and its qualifying siblings (siblings scoring at least 20% of the candidate's score, or sharing the same class name with low link density)
  • Cleans the extracted content (see step 5)
  • Checks if the result has enough text (≥ 500 characters)

If the top candidate produces too little content after cleaning — for example, a newsletter form that scores high from its label <p> tags — the algorithm moves to the next candidate. It also compares candidates: if a candidate produces under 1000 characters but a lower-ranked candidate has 3x more raw text, the smaller one is skipped.

If no scored candidate produces enough content, the algorithm falls back to:

  1. The common ancestor of the top two candidates (useful for listing pages where many small articles share a parent container)
  2. Semantic elements: <main>, [role="main"], or <article>

5. Cleaning #

The extracted content goes through several cleaning passes:

  • Tag removal: forms, fieldsets, iframes, inputs, buttons, footers, asides, navigation, and share/social elements are removed
  • Conditional cleanup: tables, divs, sections, and lists are evaluated individually. They're removed if they have high link density, too many inputs, too few paragraphs relative to images, or suspiciously short content — unless they contain enough commas (≥ 10, indicating real prose) or are inside a <figure>
  • Header cleanup: headings (<h1><h6>) with negative class weight are removed
  • Empty element removal: elements with no text and no media (<img>, <video>, <audio>) are recursively removed

6. Retry with relaxed heuristics #

If the first attempt produces fewer than 500 characters, the algorithm retries up to 3 more times with progressively relaxed settings:

Attempt Unlikely stripping Class weighting Conditional cleaning
1 Yes Yes Yes
2 No Yes Yes
3 Yes No Yes
4 No No No

This mirrors Mozilla's flag-based retry system (FLAG_STRIP_UNLIKELYS, FLAG_WEIGHT_CLASSES, FLAG_CLEAN_CONDITIONALLY). Each retry restores the original DOM from a clone, so previous modifications don't accumulate.

7. Title extraction #

The title is extracted separately, checking in order:

  1. <meta property="og:title"> (Open Graph)
  2. <title> tag (with site name removal — splits on |, -, , , /, » and keeps the longest part)
  3. <h1> element
  4. The fallbackTitle parameter

API reference #

readability() #

ReadabilityResult readability(
  String html, {
  String? fallbackTitle,
});
Parameter Type Description
html String Raw HTML string to extract content from
fallbackTitle String? Title to use when none can be determined from the document

ReadabilityResult #

Property Type Description
title String Article title extracted from metadata or document structure
textContent String Article content as clean plain text (no HTML tags)
htmlContent String Article content as cleaned HTML with structure preserved

Testing #

This package is validated against all 130 real-world HTML fixtures from Mozilla's Readability test suite. These fixtures cover a wide range of real websites including:

NYTimes, BBC, CNN, Wikipedia, Medium, The Guardian, The Verge, Engadget, BuzzFeed, Ars Technica, Hacker News, WordPress blogs, Blogger, Tumblr, Yahoo, Washington Post, WebMD, and many more.

dart test

The fixture files (29 MB) are excluded from the published package but are available in the repository.

Regex patterns reference #

These are ported directly from Mozilla Readability.js:

Pattern Purpose Examples
Unlikely candidates Elements to remove before scoring sidebar, banner, comment, footer, menu, social, sponsor, ad-break, popup
OK maybe Rescue elements from unlikely removal article, body, content, main, column, shadow
Positive Boost element score via class/ID article, content, entry, post, text, blog, story
Negative Penalize element score via class/ID comment, footer, sidebar, widget, share, promo, sponsor, ad
Share elements Remove during cleaning share, sharedaddy

Differences from Mozilla Readability.js #

This is a faithful port but not a line-by-line translation. Key differences:

  • Language: Pure Dart using the html package for DOM parsing instead of browser DOM or jsdom
  • Enhanced candidate fallback: When the top-scoring candidate produces poor content after cleaning, lower-ranked candidates are tried automatically
  • Common ancestor fallback: For listing/homepage pages where content is spread across many small <article> elements, the algorithm finds their common ancestor
  • List content scoring: <li> elements with substantial text (≥ 80 characters, no nested <p>) are scored alongside paragraphs, improving extraction for list-heavy content like release notes and changelogs

Acknowledgements #

This package is based on Mozilla's Readability.js, the library behind Firefox Reader View. The test fixtures are sourced from the same project and are used under the Apache 2.0 license.

0
likes
130
points
89
downloads

Documentation

API reference

Publisher

unverified uploader

Weekly Downloads

Pure Dart port of Mozilla's Readability algorithm. Extracts the main readable content from any HTML document.

Repository (GitHub)
View/report issues

Topics

#html #readability #text-extraction #web-scraping

License

MIT (license)

Dependencies

html

More

Packages that depend on html_readability