WebScraper class

A web scraper that uses proxies to avoid detection and blocking

Available extensions

Constructors

WebScraper.new({required ProxyManager proxyManager, ProxyHttpClient? httpClient, String? defaultUserAgent, Map<String, String>? defaultHeaders, int defaultTimeout = 30000, int maxRetries = 3, AdaptiveScrapingStrategy? adaptiveStrategy, SiteReputationTracker? reputationTracker, ScrapingLogger? logger, RobotsTxtHandler? robotsTxtHandler, StreamingHtmlParser? streamingParser, ContentValidator? contentValidator, StructuredDataValidator? structuredDataValidator, SelectorValidator? selectorValidator, RateLimiter? rateLimiter, RequestQueue? requestQueue, StructuredDataExtractor? structuredDataExtractor, ContentDetector? contentDetector, TextExtractor? textExtractor, HeadlessBrowser? headlessBrowser, LazyLoadDetector? lazyLoadDetector, LazyLoadHandler? lazyLoadHandler, PaginationHandler? paginationHandler, bool respectRobotsTxt = true})
Creates a new WebScraper with the given parameters

Properties

contentDetector ContentDetector
Gets the content detector
no setter
hashCode int
The hash code for this object.
no setterinherited
headlessBrowser HeadlessBrowser
Gets the headless browser
no setter
lazyLoadHandler LazyLoadHandler
Gets the lazy load handler
no setter
logger ScrapingLogger
Gets the scraping logger
no setter
paginationHandler PaginationHandler
Gets the pagination handler
no setter
proxyManager ProxyManager
The proxy manager for getting proxies
final
rateLimiter RateLimiter
Gets the rate limiter
no setter
reputationTracker SiteReputationTracker
Gets the site reputation tracker
no setter
requestQueue RequestQueue
Gets the request queue
no setter
runtimeType Type
A representation of the runtime type of the object.
no setterinherited
textExtractor TextExtractor
Gets the text extractor
no setter

Methods

close() → void
Closes the HTTP client and other resources
createCacheManager({String namespace = 'web_scraper', Logger? logger}) DataCacheManager

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Creates a data cache manager for caching scraping results
createDataChunker({int chunkSize = DataChunker.defaultChunkSize, Logger? logger}) DataChunker

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Creates a data chunker for handling large datasets
createTaskScheduler({TaskSchedulerConfig? config, ResourceMonitor? resourceMonitor, Logger? logger}) TaskScheduler

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Creates a task scheduler for parallel scraping
detectMainContent(String html) ContentDetectionResult

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Detects the main content area of a webpage
extractArticleInfo(String html) List<Map<String, dynamic>>
Extracts article information from HTML
extractContentWithPagination({required String url, required PaginationConfig paginationConfig, LazyLoadConfig? lazyLoadConfig, TextExtractionOptions textExtractionOptions = const TextExtractionOptions(), Map<String, String>? headers, int? timeout, int? retries}) Future<List<TextExtractionResult>>

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Extracts the main content from multiple pages with pagination
extractData({required String html, required String selector, String? attribute, bool asText = true, bool validateContent = true, bool validateSelector = true}) List<String>
Parses HTML content and extracts data using CSS selectors
extractDataStream({required String url, required String selector, String? attribute, bool asText = true, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) Stream<String>
Extracts data from a URL using streaming for memory efficiency
extractProductInfo(String html) List<Map<String, dynamic>>
Extracts product information from HTML
extractSchemaType({required String html, required String schemaType, List<StructuredDataType> preferredTypes = const [StructuredDataType.jsonLd, StructuredDataType.microdata, StructuredDataType.rdfa]}) Map<String, dynamic>?
Extracts data of a specific schema type from HTML
extractStructuredData({required String html, required Map<String, String> selectors, Map<String, String?>? attributes, bool validateContent = true, bool validateSelectors = true, List<String> requiredFields = const []}) List<Map<String, String>>
Parses HTML content and extracts structured data using CSS selectors
extractStructuredDataStream({required String url, required Map<String, String> selectors, Map<String, String?>? attributes, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) Stream<Map<String, String>>
Extracts structured data from a URL using streaming for memory efficiency
extractStructuredMetadata({required String html, StructuredDataType? type}) List<StructuredDataExtractionResult>
Extracts structured data from HTML using JSON-LD, Microdata, RDFa, etc.
extractText(String html, {TextExtractionOptions options = const TextExtractionOptions()}) TextExtractionResult

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Extracts clean, readable text from HTML
fetchFromProblematicSite({required String url, Map<String, String>? headers, int? timeout = 60000, int? retries = 5}) Future<String>

Available on WebScraper, provided by the WebScraperExtension extension

Fetches HTML content from a problematic site using specialized techniques
fetchHtml({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) Future<String>
Fetches HTML content from the given URL
fetchHtmlStream({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) Future<Stream<List<int>>>
Fetches HTML content as a stream from the given URL
fetchHtmlWithCache({required String url, required DataCacheManager cacheManager, Map<String, String>? headers, int? timeout, int? retries, DataCacheOptions cacheOptions = const DataCacheOptions()}) Future<String>

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Fetches HTML with caching
fetchHtmlWithLazyLoading({required String url, LazyLoadConfig config = const LazyLoadConfig(), Map<String, String>? headers}) Future<LazyLoadResult>

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Fetches HTML content with lazy loading support
fetchHtmlWithRateLimiting({required String url, Map<String, String>? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) Future<String>
Fetches HTML content with rate limiting
fetchHtmlWithRetry({required String url, Map<String, String>? headers, int? timeout, int? retries, int initialBackoffMs = 500, double backoffMultiplier = 1.5, int maxBackoffMs = 10000}) Future<String>

Available on WebScraper, provided by the WebScraperExtension extension

Fetches HTML content with enhanced error handling and retry logic
fetchJson({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) Future<Map<String, dynamic>>
Fetches JSON content from the given URL
fetchJsonWithRateLimiting({required String url, Map<String, String>? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) Future<Map<String, dynamic>>
Fetches JSON content with rate limiting
noSuchMethod(Invocation invocation) → dynamic
Invoked when a nonexistent method or property is accessed.
inherited
scrapeInParallel<T>({required List<String> urls, required Future<T> extractor(String html, String url), required TaskScheduler scheduler, Map<String, String>? headers, int? timeout, int? retries, TaskPriority priority = TaskPriority.normal, int maxRetries = 3}) Future<List<T>>

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Scrapes multiple URLs in parallel
scrapeWithChunking<T>({required String url, required DataChunker dataChunker, required FutureOr<T> processor(String chunk, T? previousResult), Map<String, String>? headers, int? timeout, int? retries, T? initialResult}) Future<T>

Available on WebScraper, provided by the WebScraperPerformanceExtension extension

Scrapes a URL with chunked processing for large HTML documents
scrapeWithLazyLoadingAndPagination<T>({required String url, required PaginationConfig paginationConfig, required LazyLoadConfig lazyLoadConfig, required Future<T> extractor(String html, String pageUrl), Map<String, String>? headers, int? timeout, int? retries}) Future<PaginationResult<T>>

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Fetches HTML content with both lazy loading and pagination support
scrapeWithPagination<T>({required String url, required PaginationConfig config, required Future<T> extractor(String html, String pageUrl), Map<String, String>? headers, int? timeout, int? retries}) Future<PaginationResult<T>>

Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension

Scrapes multiple pages with pagination
setDomainRateLimit({required String domain, int? requestsPerMinute, int? requestsPerHour, int? requestsPerDay}) → void
Sets the rate limit for a specific domain
toString() String
A string representation of this object.
inherited

Operators

operator ==(Object other) bool
The equality operator.
inherited