parse method

Future<Map<String, Object>> parse({
  1. required Data scrapedData,
  2. required ScraperConfig scraperConfig,
  3. bool debug = false,
  4. ProxyAPIConfig? overrideProxyAPIConfig,
})

Main entry point for parsing scraped HTML data.

This method orchestrates the entire parsing process:

  1. Builds a parent-to-children relationship map from all parsers
  2. Identifies root parsers (those with '_root' as parent)
  3. Executes parsers in hierarchical order
  4. Applies transformations and cleaning to extracted data
  5. Returns the final structured data

Parameters:

  • scrapedData: The scraped HTML data to parse as Data object containing url and Document object.
  • scraperConfig: Configuration containing parser definitions
  • debug: Enable debug logging for troubleshooting
  • overrideProxyAPIConfig: Custom proxy API configuration (overrides http parser requests)

Returns:

  • Map containing extracted data with parser IDs as keys

Implementation

Future<Map<String, Object>> parse({
  required Data scrapedData,
  required ScraperConfig scraperConfig,
  bool debug = false,
  ProxyAPIConfig? overrideProxyAPIConfig,
}) async {
  /// Start performance monitoring
  final Stopwatch stopwatch = Stopwatch()..start();

  printLog('Parser: Using scraper config...', debug, color: LogColor.blue);

  /// Get all parsers from the configuration
  final List<Parser> allParsers = scraperConfig.parsers.toList();

  /// Build parent-to-children relationship map for hierarchical parsing
  final Map<String, List<Parser>> parentToChildren =
      _buildParentToChildrenMap(allParsers);

  /// Identify root parsers (those that start the parsing chain)
  final List<Parser> rootParsers = parentToChildren['_root']?.toList() ?? [];

  /// Initialize with the source URL
  extractedData['url'] = scrapedData.url;

  /// Execute the parsing hierarchy starting with root parsers
  final Map<String, Object> parsedData = await _distributeParsers(
    parentToChildren: parentToChildren,
    parsers: rootParsers,
    parentData: scrapedData,
    overrideProxyAPIConfig: overrideProxyAPIConfig,
    debug: debug,
  );

  /// Ensure URL is always present in the final result
  parsedData.putIfAbsent('url', () => scrapedData.url.toString());

  /// Log parsing performance
  stopwatch.stop();
  printLog(
    'Parsing took ${stopwatch.elapsedMilliseconds} ms.',
    debug,
    color: LogColor.green,
  );

  return parsedData;
}