AI WebScraper

A powerful AI-powered web scraper for Dart that combines traditional web scraping with AI-based content extraction. Extract structured data from websites using OpenAI GPT or Google Gemini.

Features

🤖 Multiple AI Providers: Support for OpenAI GPT and Google Gemini
🌐 Smart Web Scraping: HTTP requests with HTML parsing and JavaScript rendering fallback
📋 Schema-Based Extraction: Define JSON structures for consistent data extraction
⚡ Batch Processing: Process multiple URLs concurrently with configurable limits
🛡️ Type Safety: Full Dart type safety with comprehensive error handling
🔄 Automatic Fallback: Falls back from HTTP to JavaScript scraping when needed

Quick Start

Installation

Add ai_webscraper to your pubspec.yaml:

dependencies:
  ai_webscraper: ^0.1.0

Then run:

dart pub get

Basic Usage

import 'package:ai_webscraper/ai_webscraper.dart';

void main() async {
  // Initialize the scraper with OpenAI
  final scraper = AIWebScraper(
    aiProvider: AIProvider.openai,
    apiKey: 'your-openai-api-key',
  );

  // Define what data you want to extract
  final schema = {
    'title': 'string',
    'description': 'string',
    'price': 'number',
  };

  // Extract data from a single URL
  final result = await scraper.extractFromUrl(
    url: 'https://example-store.com/product/123',
    schema: schema,
  );

  if (result.success) {
    print('Extracted data: ${result.data}');
    print('Scraping took: ${result.scrapingTime.inMilliseconds}ms');
  } else {
    print('Error: ${result.error}');
  }
}

Advanced Usage

Using Google Gemini

final scraper = AIWebScraper(
  aiProvider: AIProvider.gemini,
  apiKey: 'your-gemini-api-key',
);

Batch Processing

final urls = [
  'https://store1.com/product/1',
  'https://store2.com/product/2',
  'https://store3.com/product/3',
];

final results = await scraper.extractFromUrls(
  urls: urls,
  schema: {
    'name': 'string',
    'price': 'number',
    'availability': 'boolean',
  },
  concurrency: 3, // Process 3 URLs simultaneously
);

for (final result in results) {
  if (result.success) {
    print('${result.url}: ${result.data}');
  } else {
    print('Failed ${result.url}: ${result.error}');
  }
}

JavaScript-Heavy Websites

For websites that require JavaScript rendering:

final result = await scraper.extractFromUrl(
  url: 'https://spa-website.com',
  schema: schema,
  useJavaScript: true, // Forces JavaScript rendering
);

Custom Prompts

Enhance extraction accuracy with custom prompts tailored to your specific use case:

final result = await scraper.extractFromUrl(
  url: 'https://ecommerce-site.com/product',
  schema: {
    'productName': 'string',
    'price': 'number',
    'inStock': 'boolean',
    'reviews': 'array',
  },
  customInstructions: '''
  Focus on e-commerce product information:
  - Extract the main product title, not category names
  - Look for current price, ignore crossed-out old prices
  - Check availability status or stock information
  - Extract customer review summaries or ratings
  - Ignore shipping or return policy information
  ''',
);

You can customize prompts for different domains:

// For event websites
customInstructions: '''
Extract event details with focus on:
- Event title and description
- Date, time, and venue information
- Ticket prices and registration links
- Organizer or speaker information
''',

// For job listings
customInstructions: '''
Focus on job posting information:
- Job title and company name
- Salary range and benefits
- Required skills and experience
- Application deadline and process
''',

// For news articles
customInstructions: '''
Extract news article content:
- Headline and article summary
- Publication date and author
- Main content without ads or navigation
- Related tags or categories
''',

Error Handling

try {
  final result = await scraper.extractFromUrl(
    url: 'https://example.com',
    schema: {'title': 'string'},
  );

  if (result.success) {
    // Handle successful extraction
    print('Data: ${result.data}');
  } else {
    // Handle extraction failure
    print('Extraction failed: ${result.error}');
  }
} catch (e) {
  // Handle unexpected errors
  print('Unexpected error: $e');
}

Schema Types

Define your data extraction schema using these supported types:

string - Text content
number - Numeric values (int or double)
boolean - True/false values
array - Lists of items
object - Nested objects
date - Date/time values
url - Web URLs
email - Email addresses

Complex Schema Example

final schema = {
  'title': 'string',
  'price': 'number',
  'inStock': 'boolean',
  'images': 'array',
  'specifications': 'object',
  'publishDate': 'date',
  'contactEmail': 'email',
  'productUrl': 'url',
};

Examples

E-commerce Product Scraping

final result = await scraper.extractFromUrl(
  url: 'https://example-store.com/product/123',
  schema: {
    'name': 'string',
    'price': 'number',
    'description': 'string',
    'inStock': 'boolean',
    'rating': 'number',
    'images': 'array',
  },
);

if (result.success) {
  final product = result.data!;
  print('Product: ${product['name']}');
  print('Price: \$${product['price']}');
  print('Available: ${product['inStock']}');
}

News Article Extraction

final result = await scraper.extractFromUrl(
  url: 'https://news-site.com/article/123',
  schema: {
    'headline': 'string',
    'author': 'string',
    'publishDate': 'date',
    'content': 'string',
    'tags': 'array',
  },
  useJavaScript: true,
);

Real Estate Listings

final results = await scraper.extractFromUrls(
  urls: propertyUrls,
  schema: {
    'address': 'string',
    'price': 'number',
    'bedrooms': 'number',
    'bathrooms': 'number',
    'squareFeet': 'number',
    'description': 'string',
    'images': 'array',
  },
  concurrency: 2,
);

Configuration

Timeout Settings

final scraper = AIWebScraper(
  aiProvider: AIProvider.openai,
  apiKey: 'your-api-key',
  timeout: Duration(seconds: 60), // Custom timeout
);

AI Provider Comparison

Feature	OpenAI GPT	Google Gemini
Speed	Fast	Fast
Accuracy	High	High
Cost	Pay per token	Pay per request
Rate Limits	High	Moderate

Error Handling

The package provides comprehensive error handling:

Network Errors: Timeout, connection issues
AI API Errors: Invalid keys, rate limits, service unavailable
Parsing Errors: Invalid HTML, malformed responses
Schema Errors: Invalid schema definitions
JavaScript Errors: Puppeteer failures, rendering issues

Performance Tips

Use appropriate concurrency: Start with 2-3 concurrent requests
Batch similar requests: Group URLs from the same domain
Choose the right AI provider: OpenAI for speed, Gemini for cost-effectiveness
Use HTTP scraping first: Only use JavaScript rendering when necessary
Implement caching: Cache results for frequently accessed URLs

Requirements

Dart SDK: >=3.0.0 <4.0.0
Platform: Server-side Dart applications
APIs: OpenAI API key and/or Google AI API key

Getting API Keys

OpenAI API Key

Visit OpenAI Platform
Create an account or sign in
Navigate to API Keys section
Create a new API key

Google Gemini API Key

Visit Google AI Studio
Create a project or select existing one
Generate an API key
Enable the Generative AI API

Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

📧 Email: support@aiwebscraper.dev
🐛 Issues: GitHub Issues
📖 Documentation: API Documentation

Changelog

See CHANGELOG.md for a detailed list of changes and versions.

Made with ❤️ for the Dart community

AI WebScraper

Features

Quick Start

Installation

Basic Usage

Advanced Usage

Using Google Gemini

Batch Processing

JavaScript-Heavy Websites

Custom Prompts

Error Handling

Schema Types

Complex Schema Example

Examples

E-commerce Product Scraping

News Article Extraction

Real Estate Listings

Configuration

Timeout Settings

AI Provider Comparison

Error Handling

Performance Tips

Requirements

Getting API Keys

OpenAI API Key

Google Gemini API Key

Contributing

License

Support

Changelog

Libraries

ai_webscraper package