trafilatura 1.0.1
trafilatura: ^1.0.1 copied to clipboard
Dart port of Trafilatura - A library for web scraping, text extraction, and metadata extraction from HTML documents. Ported from the original Python library by kamranxdev (Kamran Khan).
Trafilatura Dart: Discover and Extract Text Data on the Web #
Introduction #
Trafilatura Dart is a cutting-edge Dart package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. It includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments.
This is a Dart port of the popular Python trafilatura library, migrated by kamranxdev (Kamran Khan).
Features #
-
Advanced web crawling and text discovery:
- Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
- Smart crawling and URL management (filtering and deduplication)
-
Parallel processing of online and offline input:
- Live URLs, efficient and polite processing of download queues
- Previously downloaded HTML files and parsed HTML trees
-
Robust and configurable extraction of key elements:
- Main text (common patterns and generic algorithms)
- Metadata (title, author, date, site name, categories and tags)
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks
- Optional elements: comments, links, images, tables
-
Multiple output formats:
- TXT and Markdown
- CSV
- JSON
- HTML, XML and XML-TEI
-
Actively maintained Dart implementation:
- Compatible with Dart 3.0+ and Flutter
- Null-safe code
Platform Support #
Trafilatura Dart works across all Dart and Flutter platforms:
✅ Fully Supported Platforms #
| Platform | Status | Notes |
|---|---|---|
| Dart VM | ✅ Full support | All features available |
| Flutter Android | ✅ Full support | Mobile web scraping |
| Flutter iOS | ✅ Full support | Mobile web scraping |
| Flutter Web | ✅ Full support | Browser-based extraction |
| Flutter Desktop (Windows) | ✅ Full support | Native desktop apps |
| Flutter Desktop (macOS) | ✅ Full support | Native desktop apps |
| Flutter Desktop (Linux) | ✅ Full support | Native desktop apps |
| Command-line | ✅ Full support | Via dart pub global activate |
📦 Dependency Platform Support #
All dependencies are pure Dart packages with full cross-platform support:
| Package | Version | Platforms | Purpose |
|---|---|---|---|
html |
^0.15.4 | All platforms | HTML parsing and DOM manipulation |
xml |
^6.4.0 | All platforms | XML generation and parsing |
http |
^1.1.0 | All platforms | HTTP client for web requests |
crypto |
^3.0.3 | All platforms | Hashing (Simhash, MD5) |
charset |
^2.0.1 | All platforms | Character encoding detection |
intl |
^0.18.1 | All platforms | Date parsing & formatting |
args |
^2.4.2 | All platforms | CLI argument parsing |
convert |
^3.1.1 | All platforms | Data encoding/decoding |
collection |
^1.18.0 | All platforms | Collection utilities |
path |
^1.8.3 | All platforms | File path manipulation |
Note: No platform-specific or native dependencies required. The package works identically across all platforms.
Installation #
As a dependency #
Add to your pubspec.yaml:
dependencies:
trafilatura: ^1.0.0
Then run:
dart pub get
Global CLI installation #
dart pub global activate trafilatura
Usage #
As a Dart library #
import 'package:trafilatura/trafilatura.dart';
void main() async {
// Extract text from HTML string
const html = '''
<html>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article.</p>
</article>
</body>
</html>
''';
final text = extract(html);
print(text);
// Extract with options
final result = extract(
html,
includeFormatting: true,
includeLinks: true,
includeImages: true,
outputFormat: 'xml',
);
// Extract metadata
final metadata = extractMetadata(html);
print('Title: ${metadata?.title}');
print('Author: ${metadata?.author}');
// Download and extract from URL
final content = await fetchAndExtract('https://example.org');
print(content);
}
Bare extraction (returns structured data) #
final result = bareExtraction(html);
print(result.text);
print(result.title);
print(result.author);
print(result.date);
Output formats #
// Plain text (default)
final text = extract(html);
// JSON output
final json = extract(html, outputFormat: 'json');
// XML output
final xml = extract(html, outputFormat: 'xml');
// XML-TEI output
final tei = extract(html, outputFormat: 'xmltei');
// CSV output
final csv = extract(html, outputFormat: 'csv');
Command-line usage #
# Extract from URL
trafilatura -u https://example.org
# Extract from file
trafilatura -i input.html
# Process directory
trafilatura --input-dir ./pages --output-dir ./output
# Include formatting
trafilatura -u https://example.org --formatting --links
# Output as JSON
trafilatura -u https://example.org -f json
# Discover feed URLs
trafilatura --feed https://example.org
# Discover sitemap URLs
trafilatura --sitemap https://example.org
# Crawl website
trafilatura --crawl https://example.org --limit 100
CLI Options #
-i, --input-file Name of input file for batch processing
--input-dir Read files from a specified directory
-u, --URL Custom URL download
--parallel Number of parallel downloads (default: 4)
-o, --output-dir Write results to specified directory
--feed Look for feeds and/or pass a feed URL
--sitemap Look for sitemaps for the given website
--crawl Crawl a fixed number of pages
-f, --fast Fast extraction without fallback
--formatting Include text formatting
--links Include links with targets
--images Include image sources
--no-comments Don't output comments
--no-tables Don't output table elements
--target-language Target language (ISO 639-1 code)
--output-format Output format (txt, json, xml, xmltei, csv)
Feed and Sitemap Discovery #
// Find feed URLs in HTML
final feeds = findFeedUrls(html, baseUrl);
// Extract URLs from feed
final feedUrls = extractFeedUrls(feedContent);
// Find sitemap URLs
final sitemaps = getSitemapUrls('https://example.org');
// Extract URLs from sitemap
final urls = extractSitemapUrls(sitemapContent);
Deduplication #
// Content store for duplicate detection
final store = ContentStore(threshold: 0.9);
store.add('Content of first document');
if (store.isDuplicate('Similar content')) {
print('Duplicate detected');
}
// URL deduplication
final uniqueUrls = deduplicateUrls(urlList);
Configuration #
// Use custom configuration
final config = Extractor(
minOutputSize: 100,
minExtractedSize: 50,
includeComments: true,
includeTables: true,
includeFormatting: true,
);
final result = extract(html, config: config);
Running Tests #
# Run all tests
dart test
# Run specific test file
dart test test/unit_test.dart
# Run with coverage
dart test --coverage=coverage
Development #
# Get dependencies
dart pub get
# Analyze code
dart analyze
# Format code
dart format lib/ bin/ test/
# Run the CLI
dart run bin/trafilatura.dart --help
License #
This package is distributed under the Apache 2.0 license.
Contributing #
Contributions of all kinds are welcome! Please read the Contributing guide for more information.
Dart Port #
This Dart port was created and is maintained by kamranxdev (Kamran Khan).
The port migrates the original Python implementation to Dart, leveraging Dart-native packages and idioms while preserving the core extraction algorithms and functionality.
Key Dart Dependencies #
| Package | Purpose |
|---|---|
html |
HTML parsing |
xml |
XML generation and parsing |
http |
HTTP client for downloads |
crypto |
Hashing for deduplication |
args |
Command-line argument parsing |
intl |
Internationalization |
charset |
Character encoding detection |
Original Project #
Based on the original Python trafilatura library by Adrien Barbaresi:
Citation #
If you use this library in academic work, please cite the original paper and this Dart port:
@inproceedings{barbaresi-2021-trafilatura,
title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
author = "Barbaresi, Adrien",
booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics",
pages = "122--131",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.15",
year = 2021,
}
@software{khan-trafilatura-dart,
title = {{Trafilatura Dart: A Dart Port of Trafilatura}},
author = "Khan, Kamran",
url = "https://github.com/kamranxdev/trafilatura",
year = 2026,
version = {1.0.0},
}