AI WebScraper
A powerful AI-powered web scraper for Dart that combines traditional web scraping with AI-based content extraction. Extract structured data from websites using OpenAI GPT or Google Gemini.
Features
- 🤖 Multiple AI Providers: Support for OpenAI GPT and Google Gemini
- 🌐 Smart Web Scraping: HTTP requests with HTML parsing and JavaScript rendering fallback
- 📋 Schema-Based Extraction: Define JSON structures for consistent data extraction
- ⚡ Batch Processing: Process multiple URLs concurrently with configurable limits
- 🛡️ Type Safety: Full Dart type safety with comprehensive error handling
- 🔄 Automatic Fallback: Falls back from HTTP to JavaScript scraping when needed
Quick Start
Installation
Add ai_webscraper
to your pubspec.yaml
:
dependencies:
ai_webscraper: ^0.1.0
Then run:
dart pub get
Basic Usage
import 'package:ai_webscraper/ai_webscraper.dart';
void main() async {
// Initialize the scraper with OpenAI
final scraper = AIWebScraper(
aiProvider: AIProvider.openai,
apiKey: 'your-openai-api-key',
);
// Define what data you want to extract
final schema = {
'title': 'string',
'description': 'string',
'price': 'number',
};
// Extract data from a single URL
final result = await scraper.extractFromUrl(
url: 'https://example-store.com/product/123',
schema: schema,
);
if (result.success) {
print('Extracted data: ${result.data}');
print('Scraping took: ${result.scrapingTime.inMilliseconds}ms');
} else {
print('Error: ${result.error}');
}
}
Advanced Usage
Using Google Gemini
final scraper = AIWebScraper(
aiProvider: AIProvider.gemini,
apiKey: 'your-gemini-api-key',
);
Batch Processing
final urls = [
'https://store1.com/product/1',
'https://store2.com/product/2',
'https://store3.com/product/3',
];
final results = await scraper.extractFromUrls(
urls: urls,
schema: {
'name': 'string',
'price': 'number',
'availability': 'boolean',
},
concurrency: 3, // Process 3 URLs simultaneously
);
for (final result in results) {
if (result.success) {
print('${result.url}: ${result.data}');
} else {
print('Failed ${result.url}: ${result.error}');
}
}
JavaScript-Heavy Websites
For websites that require JavaScript rendering:
final result = await scraper.extractFromUrl(
url: 'https://spa-website.com',
schema: schema,
useJavaScript: true, // Forces JavaScript rendering
);
Custom Prompts
Enhance extraction accuracy with custom prompts tailored to your specific use case:
final result = await scraper.extractFromUrl(
url: 'https://ecommerce-site.com/product',
schema: {
'productName': 'string',
'price': 'number',
'inStock': 'boolean',
'reviews': 'array',
},
customInstructions: '''
Focus on e-commerce product information:
- Extract the main product title, not category names
- Look for current price, ignore crossed-out old prices
- Check availability status or stock information
- Extract customer review summaries or ratings
- Ignore shipping or return policy information
''',
);
You can customize prompts for different domains:
// For event websites
customInstructions: '''
Extract event details with focus on:
- Event title and description
- Date, time, and venue information
- Ticket prices and registration links
- Organizer or speaker information
''',
// For job listings
customInstructions: '''
Focus on job posting information:
- Job title and company name
- Salary range and benefits
- Required skills and experience
- Application deadline and process
''',
// For news articles
customInstructions: '''
Extract news article content:
- Headline and article summary
- Publication date and author
- Main content without ads or navigation
- Related tags or categories
''',
Error Handling
try {
final result = await scraper.extractFromUrl(
url: 'https://example.com',
schema: {'title': 'string'},
);
if (result.success) {
// Handle successful extraction
print('Data: ${result.data}');
} else {
// Handle extraction failure
print('Extraction failed: ${result.error}');
}
} catch (e) {
// Handle unexpected errors
print('Unexpected error: $e');
}
Schema Types
Define your data extraction schema using these supported types:
string
- Text contentnumber
- Numeric values (int or double)boolean
- True/false valuesarray
- Lists of itemsobject
- Nested objectsdate
- Date/time valuesurl
- Web URLsemail
- Email addresses
Complex Schema Example
final schema = {
'title': 'string',
'price': 'number',
'inStock': 'boolean',
'images': 'array',
'specifications': 'object',
'publishDate': 'date',
'contactEmail': 'email',
'productUrl': 'url',
};
Examples
E-commerce Product Scraping
final result = await scraper.extractFromUrl(
url: 'https://example-store.com/product/123',
schema: {
'name': 'string',
'price': 'number',
'description': 'string',
'inStock': 'boolean',
'rating': 'number',
'images': 'array',
},
);
if (result.success) {
final product = result.data!;
print('Product: ${product['name']}');
print('Price: \$${product['price']}');
print('Available: ${product['inStock']}');
}
News Article Extraction
final result = await scraper.extractFromUrl(
url: 'https://news-site.com/article/123',
schema: {
'headline': 'string',
'author': 'string',
'publishDate': 'date',
'content': 'string',
'tags': 'array',
},
useJavaScript: true,
);
Real Estate Listings
final results = await scraper.extractFromUrls(
urls: propertyUrls,
schema: {
'address': 'string',
'price': 'number',
'bedrooms': 'number',
'bathrooms': 'number',
'squareFeet': 'number',
'description': 'string',
'images': 'array',
},
concurrency: 2,
);
Configuration
Timeout Settings
final scraper = AIWebScraper(
aiProvider: AIProvider.openai,
apiKey: 'your-api-key',
timeout: Duration(seconds: 60), // Custom timeout
);
AI Provider Comparison
Feature | OpenAI GPT | Google Gemini |
---|---|---|
Speed | Fast | Fast |
Accuracy | High | High |
Cost | Pay per token | Pay per request |
Rate Limits | High | Moderate |
Error Handling
The package provides comprehensive error handling:
- Network Errors: Timeout, connection issues
- AI API Errors: Invalid keys, rate limits, service unavailable
- Parsing Errors: Invalid HTML, malformed responses
- Schema Errors: Invalid schema definitions
- JavaScript Errors: Puppeteer failures, rendering issues
Performance Tips
- Use appropriate concurrency: Start with 2-3 concurrent requests
- Batch similar requests: Group URLs from the same domain
- Choose the right AI provider: OpenAI for speed, Gemini for cost-effectiveness
- Use HTTP scraping first: Only use JavaScript rendering when necessary
- Implement caching: Cache results for frequently accessed URLs
Requirements
- Dart SDK:
>=3.0.0 <4.0.0
- Platform: Server-side Dart applications
- APIs: OpenAI API key and/or Google AI API key
Getting API Keys
OpenAI API Key
- Visit OpenAI Platform
- Create an account or sign in
- Navigate to API Keys section
- Create a new API key
Google Gemini API Key
- Visit Google AI Studio
- Create a project or select existing one
- Generate an API key
- Enable the Generative AI API
Contributing
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- 📧 Email: support@aiwebscraper.dev
- 🐛 Issues: GitHub Issues
- 📖 Documentation: API Documentation
Changelog
See CHANGELOG.md for a detailed list of changes and versions.
Made with ❤️ for the Dart community
Libraries
- ai_webscraper
- A basic AI-powered web scraper for Dart
- main