AI WebScraper #

A powerful AI-powered web scraper for Dart that combines traditional web scraping with AI-based content extraction. Extract structured data from websites using OpenAI GPT or Google Gemini.

Features #

🤖 Multiple AI Providers: Support for OpenAI GPT and Google Gemini
🌐 Smart Web Scraping: HTTP requests with HTML parsing and JavaScript rendering fallback
📋 Schema-Based Extraction: Define JSON structures for consistent data extraction
⚡ Batch Processing: Process multiple URLs concurrently with configurable limits
🛡️ Type Safety: Full Dart type safety with comprehensive error handling
🔄 Automatic Fallback: Falls back from HTTP to JavaScript scraping when needed

Quick Start #

Installation #

Add ai_webscraper to your pubspec.yaml:

dependencies:
  ai_webscraper: ^0.1.0

copied to clipboard

Then run:

dart pub get

copied to clipboard

Basic Usage #

import 'package:ai_webscraper/ai_webscraper.dart';

void main() async {
  // Initialize the scraper with OpenAI
  final scraper = AIWebScraper(
    aiProvider: AIProvider.openai,
    apiKey: 'your-openai-api-key',
  );

  // Define what data you want to extract
  final schema = {
    'title': 'string',
    'description': 'string',
    'price': 'number',
  };

  // Extract data from a single URL
  final result = await scraper.extractFromUrl(
    url: 'https://example-store.com/product/123',
    schema: schema,
  );

  if (result.success) {
    print('Extracted data: ${result.data}');
    print('Scraping took: ${result.scrapingTime.inMilliseconds}ms');
  } else {
    print('Error: ${result.error}');
  }
}

copied to clipboard

Advanced Usage #

Using Google Gemini #

final scraper = AIWebScraper(
  aiProvider: AIProvider.gemini,
  apiKey: 'your-gemini-api-key',
);

copied to clipboard

Batch Processing #

final urls = [
  'https://store1.com/product/1',
  'https://store2.com/product/2',
  'https://store3.com/product/3',
];

final results = await scraper.extractFromUrls(
  urls: urls,
  schema: {
    'name': 'string',
    'price': 'number',
    'availability': 'boolean',
  },
  concurrency: 3, // Process 3 URLs simultaneously
);

for (final result in results) {
  if (result.success) {
    print('${result.url}: ${result.data}');
  } else {
    print('Failed ${result.url}: ${result.error}');
  }
}

copied to clipboard

JavaScript-Heavy Websites #

For websites that require JavaScript rendering:

final result = await scraper.extractFromUrl(
  url: 'https://spa-website.com',
  schema: schema,
  useJavaScript: true, // Forces JavaScript rendering
);

copied to clipboard

Custom Prompts #

Enhance extraction accuracy with custom prompts tailored to your specific use case:

final result = await scraper.extractFromUrl(
  url: 'https://ecommerce-site.com/product',
  schema: {
    'productName': 'string',
    'price': 'number',
    'inStock': 'boolean',
    'reviews': 'array',
  },
  customInstructions: '''
  Focus on e-commerce product information:
  - Extract the main product title, not category names
  - Look for current price, ignore crossed-out old prices
  - Check availability status or stock information
  - Extract customer review summaries or ratings
  - Ignore shipping or return policy information
  ''',
);

copied to clipboard

You can customize prompts for different domains:

// For event websites
customInstructions: '''
Extract event details with focus on:
- Event title and description
- Date, time, and venue information
- Ticket prices and registration links
- Organizer or speaker information
''',

// For job listings
customInstructions: '''
Focus on job posting information:
- Job title and company name
- Salary range and benefits
- Required skills and experience
- Application deadline and process
''',

// For news articles
customInstructions: '''
Extract news article content:
- Headline and article summary
- Publication date and author
- Main content without ads or navigation
- Related tags or categories
''',

copied to clipboard

Error Handling #

try {
  final result = await scraper.extractFromUrl(
    url: 'https://example.com',
    schema: {'title': 'string'},
  );

  if (result.success) {
    // Handle successful extraction
    print('Data: ${result.data}');
  } else {
    // Handle extraction failure
    print('Extraction failed: ${result.error}');
  }
} catch (e) {
  // Handle unexpected errors
  print('Unexpected error: $e');
}

copied to clipboard

Schema Types #

Define your data extraction schema using these supported types:

string - Text content
number - Numeric values (int or double)
boolean - True/false values
array - Lists of items
object - Nested objects
date - Date/time values
url - Web URLs
email - Email addresses

Complex Schema Example #

final schema = {
  'title': 'string',
  'price': 'number',
  'inStock': 'boolean',
  'images': 'array',
  'specifications': 'object',
  'publishDate': 'date',
  'contactEmail': 'email',
  'productUrl': 'url',
};

copied to clipboard

Examples #

E-commerce Product Scraping #

final result = await scraper.extractFromUrl(
  url: 'https://example-store.com/product/123',
  schema: {
    'name': 'string',
    'price': 'number',
    'description': 'string',
    'inStock': 'boolean',
    'rating': 'number',
    'images': 'array',
  },
);

if (result.success) {
  final product = result.data!;
  print('Product: ${product['name']}');
  print('Price: \$${product['price']}');
  print('Available: ${product['inStock']}');
}

copied to clipboard

News Article Extraction #

final result = await scraper.extractFromUrl(
  url: 'https://news-site.com/article/123',
  schema: {
    'headline': 'string',
    'author': 'string',
    'publishDate': 'date',
    'content': 'string',
    'tags': 'array',
  },
  useJavaScript: true,
);

copied to clipboard

Real Estate Listings #

final results = await scraper.extractFromUrls(
  urls: propertyUrls,
  schema: {
    'address': 'string',
    'price': 'number',
    'bedrooms': 'number',
    'bathrooms': 'number',
    'squareFeet': 'number',
    'description': 'string',
    'images': 'array',
  },
  concurrency: 2,
);

copied to clipboard

Configuration #

Timeout Settings #

final scraper = AIWebScraper(
  aiProvider: AIProvider.openai,
  apiKey: 'your-api-key',
  timeout: Duration(seconds: 60), // Custom timeout
);

copied to clipboard

AI Provider Comparison #

Feature	OpenAI GPT	Google Gemini
Speed	Fast	Fast
Accuracy	High	High
Cost	Pay per token	Pay per request
Rate Limits	High	Moderate

Error Handling #

The package provides comprehensive error handling:

Network Errors: Timeout, connection issues
AI API Errors: Invalid keys, rate limits, service unavailable
Parsing Errors: Invalid HTML, malformed responses
Schema Errors: Invalid schema definitions
JavaScript Errors: Puppeteer failures, rendering issues

Performance Tips #

Use appropriate concurrency: Start with 2-3 concurrent requests
Batch similar requests: Group URLs from the same domain
Choose the right AI provider: OpenAI for speed, Gemini for cost-effectiveness
Use HTTP scraping first: Only use JavaScript rendering when necessary
Implement caching: Cache results for frequently accessed URLs

Requirements #

Dart SDK: >=3.0.0 <4.0.0
Platform: Server-side Dart applications
APIs: OpenAI API key and/or Google AI API key

Getting API Keys #

OpenAI API Key #

Visit OpenAI Platform
Create an account or sign in
Navigate to API Keys section
Create a new API key

Google Gemini API Key #

Visit Google AI Studio
Create a project or select existing one
Generate an API key
Enable the Generative AI API

Contributing #

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

License #

This project is licensed under the MIT License - see the LICENSE file for details.

Support #

📧 Email: support@aiwebscraper.dev
🐛 Issues: GitHub Issues
📖 Documentation: API Documentation

Changelog #

See CHANGELOG.md for a detailed list of changes and versions.

Made with ❤️ for the Dart community

ai_webscraper 0.2.1
ai_webscraper: ^0.2.1 copied to clipboard

Metadata

AI WebScraper #

Features #

Quick Start #

Installation #

Basic Usage #

Advanced Usage #

Using Google Gemini #

Batch Processing #

JavaScript-Heavy Websites #

Custom Prompts #

Error Handling #

Schema Types #

Complex Schema Example #

Examples #

E-commerce Product Scraping #

News Article Extraction #

Real Estate Listings #

Configuration #

Timeout Settings #

AI Provider Comparison #

Error Handling #

Performance Tips #

Requirements #

Getting API Keys #

OpenAI API Key #

Google Gemini API Key #

Contributing #

License #

Support #

Changelog #

← Metadata

Publisher

Weekly Downloads

Metadata

Documentation

License

Dependencies

More

ai_webscraper 0.2.1 ai_webscraper: ^0.2.1 copied to clipboard

Metadata

AI WebScraper #

Features #

Quick Start #

Installation #

Basic Usage #

Advanced Usage #

Using Google Gemini #

Batch Processing #

JavaScript-Heavy Websites #

Custom Prompts #

Error Handling #

Schema Types #

Complex Schema Example #

Examples #

E-commerce Product Scraping #

News Article Extraction #

Real Estate Listings #

Configuration #

Timeout Settings #

AI Provider Comparison #

Error Handling #

Performance Tips #

Requirements #

Getting API Keys #

OpenAI API Key #

Google Gemini API Key #

Contributing #

License #

Support #

Changelog #

← Metadata

Publisher

Weekly Downloads

Metadata

Documentation

License

Dependencies

More

ai_webscraper 0.2.1
ai_webscraper: ^0.2.1 copied to clipboard