WebScraper class

A web scraper that uses proxies to avoid detection and blocking

Available extensions

Constructors

WebScraper.new({required ProxyManager proxyManager, ProxyHttpClient? httpClient, String? defaultUserAgent, Map<String, String>? defaultHeaders, int defaultTimeout = 30000, int maxRetries = 3, AdaptiveScrapingStrategy? adaptiveStrategy, SiteReputationTracker? reputationTracker, ScrapingLogger? logger, RobotsTxtHandler? robotsTxtHandler, StreamingHtmlParser? streamingParser, ContentValidator? contentValidator, StructuredDataValidator? structuredDataValidator, SelectorValidator? selectorValidator, RateLimiter? rateLimiter, RequestQueue? requestQueue, StructuredDataExtractor? structuredDataExtractor, ContentDetector? contentDetector, TextExtractor? textExtractor, HeadlessBrowser? headlessBrowser, LazyLoadDetector? lazyLoadDetector, LazyLoadHandler? lazyLoadHandler, PaginationHandler? paginationHandler, bool respectRobotsTxt = true}): Creates a new WebScraper with the given parameters

Properties

contentDetector → ContentDetector: Gets the content detector
no setter
hashCode → int: The hash code for this object.
no setterinherited
headlessBrowser → HeadlessBrowser: Gets the headless browser
no setter
lazyLoadHandler → LazyLoadHandler: Gets the lazy load handler
no setter
logger → ScrapingLogger: Gets the scraping logger
no setter
paginationHandler → PaginationHandler: Gets the pagination handler
no setter
proxyManager → ProxyManager: The proxy manager for getting proxies
final
rateLimiter → RateLimiter: Gets the rate limiter
no setter
reputationTracker → SiteReputationTracker: Gets the site reputation tracker
no setter
requestQueue → RequestQueue: Gets the request queue
no setter
runtimeType → Type: A representation of the runtime type of the object.
no setterinherited
textExtractor → TextExtractor: Gets the text extractor
no setter

Methods

close() → void: Closes the HTTP client and other resources
createCacheManager({String namespace = 'web_scraper', Logger? logger}) → DataCacheManager: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a data cache manager for caching scraping results
createDataChunker({int chunkSize = DataChunker.defaultChunkSize, Logger? logger}) → DataChunker: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a data chunker for handling large datasets
createTaskScheduler({TaskSchedulerConfig? config, ResourceMonitor? resourceMonitor, Logger? logger}) → TaskScheduler: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a task scheduler for parallel scraping
detectMainContent(String html) → ContentDetectionResult: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Detects the main content area of a webpage
extractArticleInfo(String html) → List<Map<String, dynamic>>: Extracts article information from HTML
extractContentWithPagination({required String url, required PaginationConfig paginationConfig, LazyLoadConfig? lazyLoadConfig, TextExtractionOptions textExtractionOptions = const TextExtractionOptions(), Map<String, String>? headers, int? timeout, int? retries}) → Future<List<TextExtractionResult>>: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Extracts the main content from multiple pages with pagination
extractData({required String html, required String selector, String? attribute, bool asText = true, bool validateContent = true, bool validateSelector = true}) → List<String>: Parses HTML content and extracts data using CSS selectors
extractDataStream({required String url, required String selector, String? attribute, bool asText = true, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) → Stream<String>: Extracts data from a URL using streaming for memory efficiency
extractProductInfo(String html) → List<Map<String, dynamic>>: Extracts product information from HTML
extractSchemaType({required String html, required String schemaType, List<StructuredDataType> preferredTypes = const [StructuredDataType.jsonLd, StructuredDataType.microdata, StructuredDataType.rdfa]}) → Map<String, dynamic>?: Extracts data of a specific schema type from HTML
extractStructuredData({required String html, required Map<String, String> selectors, Map<String, String?>? attributes, bool validateContent = true, bool validateSelectors = true, List<String> requiredFields = const []}) → List<Map<String, String>>: Parses HTML content and extracts structured data using CSS selectors
extractStructuredDataStream({required String url, required Map<String, String> selectors, Map<String, String?>? attributes, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) → Stream<Map<String, String>>: Extracts structured data from a URL using streaming for memory efficiency
extractStructuredMetadata({required String html, StructuredDataType? type}) → List<StructuredDataExtractionResult>: Extracts structured data from HTML using JSON-LD, Microdata, RDFa, etc.
extractText(String html, {TextExtractionOptions options = const TextExtractionOptions()}) → TextExtractionResult: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Extracts clean, readable text from HTML
fetchFromProblematicSite({required String url, Map<String, String>? headers, int? timeout = 60000, int? retries = 5}) → Future<String>: Available on WebScraper, provided by the WebScraperExtension extension
Fetches HTML content from a problematic site using specialized techniques
fetchHtml({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<String>: Fetches HTML content from the given URL
fetchHtmlStream({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<Stream<List<int>>>: Fetches HTML content as a stream from the given URL
fetchHtmlWithCache({required String url, required DataCacheManager cacheManager, Map<String, String>? headers, int? timeout, int? retries, DataCacheOptions cacheOptions = const DataCacheOptions()}) → Future<String>: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Fetches HTML with caching
fetchHtmlWithLazyLoading({required String url, LazyLoadConfig config = const LazyLoadConfig(), Map<String, String>? headers}) → Future<LazyLoadResult>: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Fetches HTML content with lazy loading support
fetchHtmlWithRateLimiting({required String url, Map<String, String>? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) → Future<String>: Fetches HTML content with rate limiting
fetchHtmlWithRetry({required String url, Map<String, String>? headers, int? timeout, int? retries, int initialBackoffMs = 500, double backoffMultiplier = 1.5, int maxBackoffMs = 10000}) → Future<String>: Available on WebScraper, provided by the WebScraperExtension extension
Fetches HTML content with enhanced error handling and retry logic
fetchJson({required String url, Map<String, String>? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<Map<String, dynamic>>: Fetches JSON content from the given URL
fetchJsonWithRateLimiting({required String url, Map<String, String>? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) → Future<Map<String, dynamic>>: Fetches JSON content with rate limiting
noSuchMethod(Invocation invocation) → dynamic: Invoked when a nonexistent method or property is accessed.
inherited
scrapeInParallel<T>({required List<String> urls, required Future<T> extractor(String html, String url), required TaskScheduler scheduler, Map<String, String>? headers, int? timeout, int? retries, TaskPriority priority = TaskPriority.normal, int maxRetries = 3}) → Future<List<T>>: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Scrapes multiple URLs in parallel
scrapeWithChunking<T>({required String url, required DataChunker dataChunker, required FutureOr<T> processor(String chunk, T? previousResult), Map<String, String>? headers, int? timeout, int? retries, T? initialResult}) → Future<T>: Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Scrapes a URL with chunked processing for large HTML documents
scrapeWithLazyLoadingAndPagination<T>({required String url, required PaginationConfig paginationConfig, required LazyLoadConfig lazyLoadConfig, required Future<T> extractor(String html, String pageUrl), Map<String, String>? headers, int? timeout, int? retries}) → Future<PaginationResult<T>>: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Fetches HTML content with both lazy loading and pagination support
scrapeWithPagination<T>({required String url, required PaginationConfig config, required Future<T> extractor(String html, String pageUrl), Map<String, String>? headers, int? timeout, int? retries}) → Future<PaginationResult<T>>: Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Scrapes multiple pages with pagination
setDomainRateLimit({required String domain, int? requestsPerMinute, int? requestsPerHour, int? requestsPerDay}) → void: Sets the rate limit for a specific domain
toString() → String: A string representation of this object.
inherited

Operators

operator ==(Object other) → bool: The equality operator.
inherited

WebScraper class

Constructors

Properties

Methods

Operators

web_scraper library