Trafilatura Dart: Discover and Extract Text Data on the Web #

Introduction #

Trafilatura Dart is a cutting-edge Dart package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments.

This is a Dart port of the popular Python trafilatura library, migrated by kamranxdev (Kamran Khan).

Features #

Advanced web crawling and text discovery:
- Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
- Smart crawling and URL management (filtering and deduplication)
Parallel processing of online and offline input:
- Live URLs, efficient and polite processing of download queues
- Previously downloaded HTML files and parsed HTML trees
Robust and configurable extraction of key elements:
- Main text (common patterns and generic algorithms)
- Metadata (title, author, date, site name, categories and tags)
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks
- Optional elements: comments, links, images, tables
Multiple output formats:
- TXT and Markdown
- CSV
- JSON
- HTML, XML and XML-TEI
Actively maintained Dart implementation:
- Compatible with Dart 3.0+ and Flutter
- Null-safe code

Platform Support #

Trafilatura Dart works across all Dart and Flutter platforms:

✅ Fully Supported Platforms #

Platform	Status	Notes
Dart VM	✅ Full support	All features available
Flutter Android	✅ Full support	Mobile web scraping
Flutter iOS	✅ Full support	Mobile web scraping
Flutter Web	✅ Full support	Browser-based extraction
Flutter Desktop (Windows)	✅ Full support	Native desktop apps
Flutter Desktop (macOS)	✅ Full support	Native desktop apps
Flutter Desktop (Linux)	✅ Full support	Native desktop apps
Command-line	✅ Full support	Via `dart pub global activate`

📦 Dependency Platform Support #

All dependencies are pure Dart packages with full cross-platform support:

Package	Version	Platforms	Purpose
`html`	^0.15.4	All platforms	HTML parsing and DOM manipulation
`xml`	^6.4.0	All platforms	XML generation and parsing
`http`	^1.1.0	All platforms	HTTP client for web requests
`crypto`	^3.0.3	All platforms	Hashing (Simhash, MD5)
`charset`	^2.0.1	All platforms	Character encoding detection
`intl`	^0.18.1	All platforms	Date parsing & formatting
`args`	^2.4.2	All platforms	CLI argument parsing
`convert`	^3.1.1	All platforms	Data encoding/decoding
`collection`	^1.18.0	All platforms	Collection utilities
`path`	^1.8.3	All platforms	File path manipulation

Note: No platform-specific or native dependencies required. The package works identically across all platforms.

Installation #

As a dependency #

Add to your pubspec.yaml:

dependencies:
  trafilatura: ^1.0.0

Then run:

dart pub get

Global CLI installation #

dart pub global activate trafilatura

Usage #

As a Dart library #

import 'package:trafilatura/trafilatura.dart';

void main() async {
  // Extract text from HTML string
  const html = '''
    <html>
      <body>
        <article>
          <h1>Article Title</h1>
          <p>This is the main content of the article.</p>
        </article>
      </body>
    </html>
  ''';
  
  final text = extract(html);
  print(text);
  
  // Extract with options
  final result = extract(
    html,
    includeFormatting: true,
    includeLinks: true,
    includeImages: true,
    outputFormat: 'xml',
  );
  
  // Extract metadata
  final metadata = extractMetadata(html);
  print('Title: ${metadata?.title}');
  print('Author: ${metadata?.author}');
  
  // Download and extract from URL
  final content = await fetchAndExtract('https://example.org');
  print(content);
}

Bare extraction (returns structured data) #

final result = bareExtraction(html);
print(result.text);
print(result.title);
print(result.author);
print(result.date);

Output formats #

// Plain text (default)
final text = extract(html);

// JSON output
final json = extract(html, outputFormat: 'json');

// XML output
final xml = extract(html, outputFormat: 'xml');

// XML-TEI output
final tei = extract(html, outputFormat: 'xmltei');

// CSV output
final csv = extract(html, outputFormat: 'csv');

Command-line usage #

# Extract from URL
trafilatura -u https://example.org

# Extract from file
trafilatura -i input.html

# Process directory
trafilatura --input-dir ./pages --output-dir ./output

# Include formatting
trafilatura -u https://example.org --formatting --links

# Output as JSON
trafilatura -u https://example.org -f json

# Discover feed URLs
trafilatura --feed https://example.org

# Discover sitemap URLs
trafilatura --sitemap https://example.org

# Crawl website
trafilatura --crawl https://example.org --limit 100

CLI Options #

-i, --input-file         Name of input file for batch processing
    --input-dir          Read files from a specified directory
-u, --URL                Custom URL download
    --parallel           Number of parallel downloads (default: 4)
-o, --output-dir         Write results to specified directory
    --feed               Look for feeds and/or pass a feed URL
    --sitemap            Look for sitemaps for the given website
    --crawl              Crawl a fixed number of pages
-f, --fast               Fast extraction without fallback
    --formatting         Include text formatting
    --links              Include links with targets
    --images             Include image sources
    --no-comments        Don't output comments
    --no-tables          Don't output table elements
    --target-language    Target language (ISO 639-1 code)
    --output-format      Output format (txt, json, xml, xmltei, csv)

Feed and Sitemap Discovery #

// Find feed URLs in HTML
final feeds = findFeedUrls(html, baseUrl);

// Extract URLs from feed
final feedUrls = extractFeedUrls(feedContent);

// Find sitemap URLs
final sitemaps = getSitemapUrls('https://example.org');

// Extract URLs from sitemap
final urls = extractSitemapUrls(sitemapContent);

Deduplication #

// Content store for duplicate detection
final store = ContentStore(threshold: 0.9);

store.add('Content of first document');

if (store.isDuplicate('Similar content')) {
  print('Duplicate detected');
}

// URL deduplication
final uniqueUrls = deduplicateUrls(urlList);

Configuration #

// Use custom configuration
final config = Extractor(
  minOutputSize: 100,
  minExtractedSize: 50,
  includeComments: true,
  includeTables: true,
  includeFormatting: true,
);

final result = extract(html, config: config);

Running Tests #

# Run all tests
dart test

# Run specific test file
dart test test/unit_test.dart

# Run with coverage
dart test --coverage=coverage

Development #

# Get dependencies
dart pub get

# Analyze code
dart analyze

# Format code
dart format lib/ bin/ test/

# Run the CLI
dart run bin/trafilatura.dart --help

License #

This package is distributed under the Apache 2.0 license.

Contributing #

Contributions of all kinds are welcome! Please read the Contributing guide for more information.

Dart Port #

This Dart port was created and is maintained by kamranxdev (Kamran Khan).

The port migrates the original Python implementation to Dart, leveraging Dart-native packages and idioms while preserving the core extraction algorithms and functionality.

Key Dart Dependencies #

Package	Purpose
`html`	HTML parsing
`xml`	XML generation and parsing
`http`	HTTP client for downloads
`crypto`	Hashing for deduplication
`args`	Command-line argument parsing
`intl`	Internationalization
`charset`	Character encoding detection

Original Project #

Based on the original Python trafilatura library by Adrien Barbaresi:

Citation #

If you use this library in academic work, please cite the original paper and this Dart port:

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

@software{khan-trafilatura-dart,
  title = {{Trafilatura Dart: A Dart Port of Trafilatura}},
  author = "Khan, Kamran",
  url = "https://github.com/kamranxdev/trafilatura-dart",
  year = 2026,
  version = {1.0.0},
}

trafilatura 1.0.0
trafilatura: ^1.0.0 copied to clipboard

Metadata

Trafilatura Dart: Discover and Extract Text Data on the Web #

Introduction #

Features #

Platform Support #

✅ Fully Supported Platforms #

📦 Dependency Platform Support #

Installation #

As a dependency #

Global CLI installation #

Usage #

As a Dart library #

Bare extraction (returns structured data) #

Output formats #

Command-line usage #

CLI Options #

Feed and Sitemap Discovery #

Deduplication #

Configuration #

Running Tests #

Development #

License #

Contributing #

Dart Port #

Key Dart Dependencies #

Original Project #

Citation #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

License

Dependencies

More

trafilatura 1.0.0 trafilatura: ^1.0.0 copied to clipboard

Metadata

Trafilatura Dart: Discover and Extract Text Data on the Web #

Introduction #

Features #

Platform Support #

✅ Fully Supported Platforms #

📦 Dependency Platform Support #

Installation #

As a dependency #

Global CLI installation #

Usage #

As a Dart library #

Bare extraction (returns structured data) #

Output formats #

Command-line usage #

CLI Options #

Feed and Sitemap Discovery #

Deduplication #

Configuration #

Running Tests #

Development #

License #

Contributing #

Dart Port #

Key Dart Dependencies #

Original Project #

Citation #

← Metadata

Documentation

Publisher

Weekly Downloads

Metadata

License

Dependencies

More

trafilatura 1.0.0
trafilatura: ^1.0.0 copied to clipboard