WebScraper class
A web scraper that uses proxies to avoid detection and blocking
- Available extensions
Constructors
-
WebScraper.new({required ProxyManager proxyManager, ProxyHttpClient? httpClient, String? defaultUserAgent, Map<
String, String> ? defaultHeaders, int defaultTimeout = 30000, int maxRetries = 3, AdaptiveScrapingStrategy? adaptiveStrategy, SiteReputationTracker? reputationTracker, ScrapingLogger? logger, RobotsTxtHandler? robotsTxtHandler, StreamingHtmlParser? streamingParser, ContentValidator? contentValidator, StructuredDataValidator? structuredDataValidator, SelectorValidator? selectorValidator, RateLimiter? rateLimiter, RequestQueue? requestQueue, StructuredDataExtractor? structuredDataExtractor, ContentDetector? contentDetector, TextExtractor? textExtractor, HeadlessBrowser? headlessBrowser, LazyLoadDetector? lazyLoadDetector, LazyLoadHandler? lazyLoadHandler, PaginationHandler? paginationHandler, bool respectRobotsTxt = true}) - Creates a new WebScraper with the given parameters
Properties
- contentDetector → ContentDetector
-
Gets the content detector
no setter
- hashCode → int
-
The hash code for this object.
no setterinherited
- headlessBrowser → HeadlessBrowser
-
Gets the headless browser
no setter
- lazyLoadHandler → LazyLoadHandler
-
Gets the lazy load handler
no setter
- logger → ScrapingLogger
-
Gets the scraping logger
no setter
- paginationHandler → PaginationHandler
-
Gets the pagination handler
no setter
- proxyManager → ProxyManager
-
The proxy manager for getting proxies
final
- rateLimiter → RateLimiter
-
Gets the rate limiter
no setter
- reputationTracker → SiteReputationTracker
-
Gets the site reputation tracker
no setter
- requestQueue → RequestQueue
-
Gets the request queue
no setter
- runtimeType → Type
-
A representation of the runtime type of the object.
no setterinherited
- textExtractor → TextExtractor
-
Gets the text extractor
no setter
Methods
-
close(
) → void - Closes the HTTP client and other resources
-
createCacheManager(
{String namespace = 'web_scraper', Logger? logger}) → DataCacheManager -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a data cache manager for caching scraping results -
createDataChunker(
{int chunkSize = DataChunker.defaultChunkSize, Logger? logger}) → DataChunker -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a data chunker for handling large datasets -
createTaskScheduler(
{TaskSchedulerConfig? config, ResourceMonitor? resourceMonitor, Logger? logger}) → TaskScheduler -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Creates a task scheduler for parallel scraping -
detectMainContent(
String html) → ContentDetectionResult -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Detects the main content area of a webpage -
extractArticleInfo(
String html) → List< Map< String, dynamic> > - Extracts article information from HTML
-
extractContentWithPagination(
{required String url, required PaginationConfig paginationConfig, LazyLoadConfig? lazyLoadConfig, TextExtractionOptions textExtractionOptions = const TextExtractionOptions(), Map< String, String> ? headers, int? timeout, int? retries}) → Future<List< TextExtractionResult> > -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Extracts the main content from multiple pages with pagination -
extractData(
{required String html, required String selector, String? attribute, bool asText = true, bool validateContent = true, bool validateSelector = true}) → List< String> - Parses HTML content and extracts data using CSS selectors
-
extractDataStream(
{required String url, required String selector, String? attribute, bool asText = true, Map< String, String> ? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) → Stream<String> - Extracts data from a URL using streaming for memory efficiency
-
extractProductInfo(
String html) → List< Map< String, dynamic> > - Extracts product information from HTML
-
extractSchemaType(
{required String html, required String schemaType, List< StructuredDataType> preferredTypes = const [StructuredDataType.jsonLd, StructuredDataType.microdata, StructuredDataType.rdfa]}) → Map<String, dynamic> ? - Extracts data of a specific schema type from HTML
-
extractStructuredData(
{required String html, required Map< String, String> selectors, Map<String, String?> ? attributes, bool validateContent = true, bool validateSelectors = true, List<String> requiredFields = const []}) → List<Map< String, String> > - Parses HTML content and extracts structured data using CSS selectors
-
extractStructuredDataStream(
{required String url, required Map< String, String> selectors, Map<String, String?> ? attributes, Map<String, String> ? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false, int chunkSize = 1024 * 1024}) → Stream<Map< String, String> > - Extracts structured data from a URL using streaming for memory efficiency
-
extractStructuredMetadata(
{required String html, StructuredDataType? type}) → List< StructuredDataExtractionResult> - Extracts structured data from HTML using JSON-LD, Microdata, RDFa, etc.
-
extractText(
String html, {TextExtractionOptions options = const TextExtractionOptions()}) → TextExtractionResult -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Extracts clean, readable text from HTML -
fetchFromProblematicSite(
{required String url, Map< String, String> ? headers, int? timeout = 60000, int? retries = 5}) → Future<String> -
Available on WebScraper, provided by the WebScraperExtension extension
Fetches HTML content from a problematic site using specialized techniques -
fetchHtml(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<String> - Fetches HTML content from the given URL
-
fetchHtmlStream(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<Stream< List< >int> > - Fetches HTML content as a stream from the given URL
-
fetchHtmlWithCache(
{required String url, required DataCacheManager cacheManager, Map< String, String> ? headers, int? timeout, int? retries, DataCacheOptions cacheOptions = const DataCacheOptions()}) → Future<String> -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Fetches HTML with caching -
fetchHtmlWithLazyLoading(
{required String url, LazyLoadConfig config = const LazyLoadConfig(), Map< String, String> ? headers}) → Future<LazyLoadResult> -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Fetches HTML content with lazy loading support -
fetchHtmlWithRateLimiting(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) → Future<String> - Fetches HTML content with rate limiting
-
fetchHtmlWithRetry(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, int initialBackoffMs = 500, double backoffMultiplier = 1.5, int maxBackoffMs = 10000}) → Future<String> -
Available on WebScraper, provided by the WebScraperExtension extension
Fetches HTML content with enhanced error handling and retry logic -
fetchJson(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, bool ignoreRobotsTxt = false}) → Future<Map< String, dynamic> > - Fetches JSON content from the given URL
-
fetchJsonWithRateLimiting(
{required String url, Map< String, String> ? headers, int? timeout, int? retries, RequestPriority priority = RequestPriority.normal, bool ignoreRobotsTxt = false}) → Future<Map< String, dynamic> > - Fetches JSON content with rate limiting
-
noSuchMethod(
Invocation invocation) → dynamic -
Invoked when a nonexistent method or property is accessed.
inherited
-
scrapeInParallel<
T> ({required List< String> urls, required Future<T> extractor(String html, String url), required TaskScheduler scheduler, Map<String, String> ? headers, int? timeout, int? retries, TaskPriority priority = TaskPriority.normal, int maxRetries = 3}) → Future<List< T> > -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Scrapes multiple URLs in parallel -
scrapeWithChunking<
T> ({required String url, required DataChunker dataChunker, required FutureOr< T> processor(String chunk, T? previousResult), Map<String, String> ? headers, int? timeout, int? retries, T? initialResult}) → Future<T> -
Available on WebScraper, provided by the WebScraperPerformanceExtension extension
Scrapes a URL with chunked processing for large HTML documents -
scrapeWithLazyLoadingAndPagination<
T> ({required String url, required PaginationConfig paginationConfig, required LazyLoadConfig lazyLoadConfig, required Future< T> extractor(String html, String pageUrl), Map<String, String> ? headers, int? timeout, int? retries}) → Future<PaginationResult< T> > -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Fetches HTML content with both lazy loading and pagination support -
scrapeWithPagination<
T> ({required String url, required PaginationConfig config, required Future< T> extractor(String html, String pageUrl), Map<String, String> ? headers, int? timeout, int? retries}) → Future<PaginationResult< T> > -
Available on WebScraper, provided by the WebScraperIntelligentScrapingExtension extension
Scrapes multiple pages with pagination -
setDomainRateLimit(
{required String domain, int? requestsPerMinute, int? requestsPerHour, int? requestsPerDay}) → void - Sets the rate limit for a specific domain
-
toString(
) → String -
A string representation of this object.
inherited
Operators
-
operator ==(
Object other) → bool -
The equality operator.
inherited