ai_webscraper 0.2.1 copy "ai_webscraper: ^0.2.1" to clipboard
ai_webscraper: ^0.2.1 copied to clipboard

A comprehensive AI-powered web scraper for Dart with OpenAI and Google Gemini integration, featuring response caching and enhanced logging

AI WebScraper #

pub package License: MIT

A powerful AI-powered web scraper for Dart that combines traditional web scraping with AI-based content extraction. Extract structured data from websites using OpenAI GPT or Google Gemini.

Features #

  • 🤖 Multiple AI Providers: Support for OpenAI GPT and Google Gemini
  • 🌐 Smart Web Scraping: HTTP requests with HTML parsing and JavaScript rendering fallback
  • 📋 Schema-Based Extraction: Define JSON structures for consistent data extraction
  • Batch Processing: Process multiple URLs concurrently with configurable limits
  • 🛡️ Type Safety: Full Dart type safety with comprehensive error handling
  • 🔄 Automatic Fallback: Falls back from HTTP to JavaScript scraping when needed

Quick Start #

Installation #

Add ai_webscraper to your pubspec.yaml:

dependencies:
  ai_webscraper: ^0.1.0
copied to clipboard

Then run:

dart pub get
copied to clipboard

Basic Usage #

import 'package:ai_webscraper/ai_webscraper.dart';

void main() async {
  // Initialize the scraper with OpenAI
  final scraper = AIWebScraper(
    aiProvider: AIProvider.openai,
    apiKey: 'your-openai-api-key',
  );

  // Define what data you want to extract
  final schema = {
    'title': 'string',
    'description': 'string',
    'price': 'number',
  };

  // Extract data from a single URL
  final result = await scraper.extractFromUrl(
    url: 'https://example-store.com/product/123',
    schema: schema,
  );

  if (result.success) {
    print('Extracted data: ${result.data}');
    print('Scraping took: ${result.scrapingTime.inMilliseconds}ms');
  } else {
    print('Error: ${result.error}');
  }
}
copied to clipboard

Advanced Usage #

Using Google Gemini #

final scraper = AIWebScraper(
  aiProvider: AIProvider.gemini,
  apiKey: 'your-gemini-api-key',
);
copied to clipboard

Batch Processing #

final urls = [
  'https://store1.com/product/1',
  'https://store2.com/product/2',
  'https://store3.com/product/3',
];

final results = await scraper.extractFromUrls(
  urls: urls,
  schema: {
    'name': 'string',
    'price': 'number',
    'availability': 'boolean',
  },
  concurrency: 3, // Process 3 URLs simultaneously
);

for (final result in results) {
  if (result.success) {
    print('${result.url}: ${result.data}');
  } else {
    print('Failed ${result.url}: ${result.error}');
  }
}
copied to clipboard

JavaScript-Heavy Websites #

For websites that require JavaScript rendering:

final result = await scraper.extractFromUrl(
  url: 'https://spa-website.com',
  schema: schema,
  useJavaScript: true, // Forces JavaScript rendering
);
copied to clipboard

Custom Prompts #

Enhance extraction accuracy with custom prompts tailored to your specific use case:

final result = await scraper.extractFromUrl(
  url: 'https://ecommerce-site.com/product',
  schema: {
    'productName': 'string',
    'price': 'number',
    'inStock': 'boolean',
    'reviews': 'array',
  },
  customInstructions: '''
  Focus on e-commerce product information:
  - Extract the main product title, not category names
  - Look for current price, ignore crossed-out old prices
  - Check availability status or stock information
  - Extract customer review summaries or ratings
  - Ignore shipping or return policy information
  ''',
);
copied to clipboard

You can customize prompts for different domains:

// For event websites
customInstructions: '''
Extract event details with focus on:
- Event title and description
- Date, time, and venue information
- Ticket prices and registration links
- Organizer or speaker information
''',

// For job listings
customInstructions: '''
Focus on job posting information:
- Job title and company name
- Salary range and benefits
- Required skills and experience
- Application deadline and process
''',

// For news articles
customInstructions: '''
Extract news article content:
- Headline and article summary
- Publication date and author
- Main content without ads or navigation
- Related tags or categories
''',
copied to clipboard

Error Handling #

try {
  final result = await scraper.extractFromUrl(
    url: 'https://example.com',
    schema: {'title': 'string'},
  );

  if (result.success) {
    // Handle successful extraction
    print('Data: ${result.data}');
  } else {
    // Handle extraction failure
    print('Extraction failed: ${result.error}');
  }
} catch (e) {
  // Handle unexpected errors
  print('Unexpected error: $e');
}
copied to clipboard

Schema Types #

Define your data extraction schema using these supported types:

  • string - Text content
  • number - Numeric values (int or double)
  • boolean - True/false values
  • array - Lists of items
  • object - Nested objects
  • date - Date/time values
  • url - Web URLs
  • email - Email addresses

Complex Schema Example #

final schema = {
  'title': 'string',
  'price': 'number',
  'inStock': 'boolean',
  'images': 'array',
  'specifications': 'object',
  'publishDate': 'date',
  'contactEmail': 'email',
  'productUrl': 'url',
};
copied to clipboard

Examples #

E-commerce Product Scraping #

final result = await scraper.extractFromUrl(
  url: 'https://example-store.com/product/123',
  schema: {
    'name': 'string',
    'price': 'number',
    'description': 'string',
    'inStock': 'boolean',
    'rating': 'number',
    'images': 'array',
  },
);

if (result.success) {
  final product = result.data!;
  print('Product: ${product['name']}');
  print('Price: \$${product['price']}');
  print('Available: ${product['inStock']}');
}
copied to clipboard

News Article Extraction #

final result = await scraper.extractFromUrl(
  url: 'https://news-site.com/article/123',
  schema: {
    'headline': 'string',
    'author': 'string',
    'publishDate': 'date',
    'content': 'string',
    'tags': 'array',
  },
  useJavaScript: true,
);
copied to clipboard

Real Estate Listings #

final results = await scraper.extractFromUrls(
  urls: propertyUrls,
  schema: {
    'address': 'string',
    'price': 'number',
    'bedrooms': 'number',
    'bathrooms': 'number',
    'squareFeet': 'number',
    'description': 'string',
    'images': 'array',
  },
  concurrency: 2,
);
copied to clipboard

Configuration #

Timeout Settings #

final scraper = AIWebScraper(
  aiProvider: AIProvider.openai,
  apiKey: 'your-api-key',
  timeout: Duration(seconds: 60), // Custom timeout
);
copied to clipboard

AI Provider Comparison #

Feature OpenAI GPT Google Gemini
Speed Fast Fast
Accuracy High High
Cost Pay per token Pay per request
Rate Limits High Moderate

Error Handling #

The package provides comprehensive error handling:

  • Network Errors: Timeout, connection issues
  • AI API Errors: Invalid keys, rate limits, service unavailable
  • Parsing Errors: Invalid HTML, malformed responses
  • Schema Errors: Invalid schema definitions
  • JavaScript Errors: Puppeteer failures, rendering issues

Performance Tips #

  1. Use appropriate concurrency: Start with 2-3 concurrent requests
  2. Batch similar requests: Group URLs from the same domain
  3. Choose the right AI provider: OpenAI for speed, Gemini for cost-effectiveness
  4. Use HTTP scraping first: Only use JavaScript rendering when necessary
  5. Implement caching: Cache results for frequently accessed URLs

Requirements #

  • Dart SDK: >=3.0.0 <4.0.0
  • Platform: Server-side Dart applications
  • APIs: OpenAI API key and/or Google AI API key

Getting API Keys #

OpenAI API Key #

  1. Visit OpenAI Platform
  2. Create an account or sign in
  3. Navigate to API Keys section
  4. Create a new API key

Google Gemini API Key #

  1. Visit Google AI Studio
  2. Create a project or select existing one
  3. Generate an API key
  4. Enable the Generative AI API

Contributing #

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

License #

This project is licensed under the MIT License - see the LICENSE file for details.

Support #

Changelog #

See CHANGELOG.md for a detailed list of changes and versions.


Made with ❤️ for the Dart community

3
likes
150
points
347
downloads

Publisher

unverified uploader

Weekly Downloads

2024.10.07 - 2025.09.01

A comprehensive AI-powered web scraper for Dart with OpenAI and Google Gemini integration, featuring response caching and enhanced logging

Repository (GitHub)
View/report issues

Documentation

API reference

License

MIT (license)

Dependencies

crypto, dotenv, google_generative_ai, html, http, meta, puppeteer

More

Packages that depend on ai_webscraper