girasol 0.0.2 copy "girasol: ^0.0.2" to clipboard
girasol: ^0.0.2 copied to clipboard

A web crawler written in Dart totally asynchronous

Dart Web Crawler #

A powerful and parallel web crawler for Dart, inspired by Scrapy. This package allows you to scrape websites efficiently using WebCrawler instances and export data in multiple formats.

Features #

  • Parallel Web Crawlers: Scrape multiple pages concurrently for high efficiency.
  • Customizable Pipelines: Export scraped data to CSV, XML, or JSON.
  • File Downloading: Built-in support for downloading files during scraping.
  • Flexible Parsing: Easily extract and process data from web pages.

Installation #

Add the package to your pubspec.yaml:

dependencies:
  dart_web_crawler: latest_version
copied to clipboard

Then, run:

dart pub get
copied to clipboard

Usage #

Define a Web Crawler #

import 'package:dart_web_crawler/dart_web_crawler.dart';

class PastebinItem implements JsonItem {
  final String title;
  final Uri link;
  final DateTime createdAt;

  PastebinItem({
    required this.title,
    required this.link,
    required this.createdAt,
  });

  @override
  Map<String, dynamic> toJson() {
    return {
      "createdAt": createdAt.toIso8601String(),
      "title": title,
      "link": link.toString(),
    };
  }
}

class PastebinCrawler extends StaticWebCrawler<PastebinItem> {
  PastebinCrawler() : super(name: "PasteBin");

  @override
  Stream<CrawlRequest> getUrls() async* {
    yield CrawlRequest.get(Uri.parse("https://pastebin.com/archive"));
  }

  @override
  parseResponse(CrawlResponse response, CrawlDocument crawlDocument) async* {
    if (crawlDocument is! HTMLDocument) return;

    final document = crawlDocument.document;
    // the first row is the headers, we skip it
    final tableRows = document.querySelectorAll("tbody > tr").skip(1);

    for (final row in tableRows) {
      final [cell1, cell2, cell3] = row.querySelectorAll("td");

      yield ParsedData(
        PastebinItem(
          link: Uri.parse(cell1.querySelector("a")!.attributes["href"]!),
          title: cell1.text,
          createdAt: DateTime.now(),
        ),
      );
    }
  }
}

Future<void> main() async {
  final crawler = PastebinCrawler();

  final pipelines = {
    PastebinItem: [JSONPipeline<PastebinItem>(outputFile: "pastebin.json")],
  };

  final scraper = Girasol(crawlers: [crawler], pipelines: pipelines);

  await scraper.execute();
}
copied to clipboard

Export Data #

Configure pipelines to export scraped data:

final pipelines = {
  PastebinItem: [
    CSVFilePipeline<PastebinItem>(outputFile: "output.csv"), 
    JSONPipeline<PastebinItem>(outputFile: "output.json")
  ],
};
copied to clipboard

Download Files #

final pipelines = {
  FileDownloadPipeline(storageFolder: Directory("downloads/"))
};
copied to clipboard

Contributing #

Contributions are welcome! Feel free to open issues or submit pull requests.

License #

MIT License

3
likes
150
points
133
downloads

Publisher

unverified uploader

Weekly Downloads

2024.09.05 - 2025.03.20

A web crawler written in Dart totally asynchronous

Repository (GitHub)

Topics

#network #http #scraper #crawler

Documentation

API reference

License

MIT (license)

Dependencies

es_compression, html, http, intl, path, xml

More

Packages that depend on girasol