Dart Web Crawler

A powerful and parallel web crawler for Dart, inspired by Scrapy. This package allows you to scrape websites efficiently using WebCrawler instances and export data in multiple formats.

Features

  • Parallel Web Crawlers: Scrape multiple pages concurrently for high efficiency.
  • Customizable Pipelines: Export scraped data to CSV, XML, or JSON.
  • File Downloading: Built-in support for downloading files during scraping.
  • Flexible Parsing: Easily extract and process data from web pages.

Installation

Add the package to your pubspec.yaml:

dependencies:
  dart_web_crawler: latest_version

Then, run:

dart pub get

Usage

Define a Web Crawler

import 'package:dart_web_crawler/dart_web_crawler.dart';

class PastebinItem implements JsonItem {
  final String title;
  final Uri link;
  final DateTime createdAt;

  PastebinItem({
    required this.title,
    required this.link,
    required this.createdAt,
  });

  @override
  Map<String, dynamic> toJson() {
    return {
      "createdAt": createdAt.toIso8601String(),
      "title": title,
      "link": link.toString(),
    };
  }
}

class PastebinCrawler extends StaticWebCrawler<PastebinItem> {
  PastebinCrawler() : super(name: "PasteBin");

  @override
  Stream<CrawlRequest> getUrls() async* {
    yield CrawlRequest.get(Uri.parse("https://pastebin.com/archive"));
  }

  @override
  parseResponse(CrawlResponse response, CrawlDocument crawlDocument) async* {
    if (crawlDocument is! HTMLDocument) return;

    final document = crawlDocument.document;
    // the first row is the headers, we skip it
    final tableRows = document.querySelectorAll("tbody > tr").skip(1);

    for (final row in tableRows) {
      final [cell1, cell2, cell3] = row.querySelectorAll("td");

      yield ParsedData(
        PastebinItem(
          link: Uri.parse(cell1.querySelector("a")!.attributes["href"]!),
          title: cell1.text,
          createdAt: DateTime.now(),
        ),
      );
    }
  }
}

Future<void> main() async {
  final crawler = PastebinCrawler();

  final pipelines = {
    PastebinItem: [JSONPipeline<PastebinItem>(outputFile: "pastebin.json")],
  };

  final scraper = Girasol(crawlers: [crawler], pipelines: pipelines);

  await scraper.execute();
}

Export Data

Configure pipelines to export scraped data:

final pipelines = {
  PastebinItem: [
    CSVFilePipeline<PastebinItem>(outputFile: "output.csv"), 
    JSONPipeline<PastebinItem>(outputFile: "output.json")
  ],
};

Download Files

final pipelines = {
  FileDownloadPipeline(storageFolder: Directory("downloads/"))
};

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

MIT License

Libraries

girasol