ai_webscraper 0.2.1
ai_webscraper: ^0.2.1 copied to clipboard
A comprehensive AI-powered web scraper for Dart with OpenAI and Google Gemini integration, featuring response caching and enhanced logging
AI WebScraper #
A powerful AI-powered web scraper for Dart that combines traditional web scraping with AI-based content extraction. Extract structured data from websites using OpenAI GPT or Google Gemini.
Features #
- 🤖 Multiple AI Providers: Support for OpenAI GPT and Google Gemini
- 🌐 Smart Web Scraping: HTTP requests with HTML parsing and JavaScript rendering fallback
- 📋 Schema-Based Extraction: Define JSON structures for consistent data extraction
- ⚡ Batch Processing: Process multiple URLs concurrently with configurable limits
- 🛡️ Type Safety: Full Dart type safety with comprehensive error handling
- 🔄 Automatic Fallback: Falls back from HTTP to JavaScript scraping when needed
Quick Start #
Installation #
Add ai_webscraper
to your pubspec.yaml
:
dependencies:
ai_webscraper: ^0.1.0
Then run:
dart pub get
Basic Usage #
import 'package:ai_webscraper/ai_webscraper.dart';
void main() async {
// Initialize the scraper with OpenAI
final scraper = AIWebScraper(
aiProvider: AIProvider.openai,
apiKey: 'your-openai-api-key',
);
// Define what data you want to extract
final schema = {
'title': 'string',
'description': 'string',
'price': 'number',
};
// Extract data from a single URL
final result = await scraper.extractFromUrl(
url: 'https://example-store.com/product/123',
schema: schema,
);
if (result.success) {
print('Extracted data: ${result.data}');
print('Scraping took: ${result.scrapingTime.inMilliseconds}ms');
} else {
print('Error: ${result.error}');
}
}
Advanced Usage #
Using Google Gemini #
final scraper = AIWebScraper(
aiProvider: AIProvider.gemini,
apiKey: 'your-gemini-api-key',
);
Batch Processing #
final urls = [
'https://store1.com/product/1',
'https://store2.com/product/2',
'https://store3.com/product/3',
];
final results = await scraper.extractFromUrls(
urls: urls,
schema: {
'name': 'string',
'price': 'number',
'availability': 'boolean',
},
concurrency: 3, // Process 3 URLs simultaneously
);
for (final result in results) {
if (result.success) {
print('${result.url}: ${result.data}');
} else {
print('Failed ${result.url}: ${result.error}');
}
}
JavaScript-Heavy Websites #
For websites that require JavaScript rendering:
final result = await scraper.extractFromUrl(
url: 'https://spa-website.com',
schema: schema,
useJavaScript: true, // Forces JavaScript rendering
);
Custom Prompts #
Enhance extraction accuracy with custom prompts tailored to your specific use case:
final result = await scraper.extractFromUrl(
url: 'https://ecommerce-site.com/product',
schema: {
'productName': 'string',
'price': 'number',
'inStock': 'boolean',
'reviews': 'array',
},
customInstructions: '''
Focus on e-commerce product information:
- Extract the main product title, not category names
- Look for current price, ignore crossed-out old prices
- Check availability status or stock information
- Extract customer review summaries or ratings
- Ignore shipping or return policy information
''',
);
You can customize prompts for different domains:
// For event websites
customInstructions: '''
Extract event details with focus on:
- Event title and description
- Date, time, and venue information
- Ticket prices and registration links
- Organizer or speaker information
''',
// For job listings
customInstructions: '''
Focus on job posting information:
- Job title and company name
- Salary range and benefits
- Required skills and experience
- Application deadline and process
''',
// For news articles
customInstructions: '''
Extract news article content:
- Headline and article summary
- Publication date and author
- Main content without ads or navigation
- Related tags or categories
''',
Error Handling #
try {
final result = await scraper.extractFromUrl(
url: 'https://example.com',
schema: {'title': 'string'},
);
if (result.success) {
// Handle successful extraction
print('Data: ${result.data}');
} else {
// Handle extraction failure
print('Extraction failed: ${result.error}');
}
} catch (e) {
// Handle unexpected errors
print('Unexpected error: $e');
}
Schema Types #
Define your data extraction schema using these supported types:
string
- Text contentnumber
- Numeric values (int or double)boolean
- True/false valuesarray
- Lists of itemsobject
- Nested objectsdate
- Date/time valuesurl
- Web URLsemail
- Email addresses
Complex Schema Example #
final schema = {
'title': 'string',
'price': 'number',
'inStock': 'boolean',
'images': 'array',
'specifications': 'object',
'publishDate': 'date',
'contactEmail': 'email',
'productUrl': 'url',
};
Examples #
E-commerce Product Scraping #
final result = await scraper.extractFromUrl(
url: 'https://example-store.com/product/123',
schema: {
'name': 'string',
'price': 'number',
'description': 'string',
'inStock': 'boolean',
'rating': 'number',
'images': 'array',
},
);
if (result.success) {
final product = result.data!;
print('Product: ${product['name']}');
print('Price: \$${product['price']}');
print('Available: ${product['inStock']}');
}
News Article Extraction #
final result = await scraper.extractFromUrl(
url: 'https://news-site.com/article/123',
schema: {
'headline': 'string',
'author': 'string',
'publishDate': 'date',
'content': 'string',
'tags': 'array',
},
useJavaScript: true,
);
Real Estate Listings #
final results = await scraper.extractFromUrls(
urls: propertyUrls,
schema: {
'address': 'string',
'price': 'number',
'bedrooms': 'number',
'bathrooms': 'number',
'squareFeet': 'number',
'description': 'string',
'images': 'array',
},
concurrency: 2,
);
Configuration #
Timeout Settings #
final scraper = AIWebScraper(
aiProvider: AIProvider.openai,
apiKey: 'your-api-key',
timeout: Duration(seconds: 60), // Custom timeout
);
AI Provider Comparison #
Feature | OpenAI GPT | Google Gemini |
---|---|---|
Speed | Fast | Fast |
Accuracy | High | High |
Cost | Pay per token | Pay per request |
Rate Limits | High | Moderate |
Error Handling #
The package provides comprehensive error handling:
- Network Errors: Timeout, connection issues
- AI API Errors: Invalid keys, rate limits, service unavailable
- Parsing Errors: Invalid HTML, malformed responses
- Schema Errors: Invalid schema definitions
- JavaScript Errors: Puppeteer failures, rendering issues
Performance Tips #
- Use appropriate concurrency: Start with 2-3 concurrent requests
- Batch similar requests: Group URLs from the same domain
- Choose the right AI provider: OpenAI for speed, Gemini for cost-effectiveness
- Use HTTP scraping first: Only use JavaScript rendering when necessary
- Implement caching: Cache results for frequently accessed URLs
Requirements #
- Dart SDK:
>=3.0.0 <4.0.0
- Platform: Server-side Dart applications
- APIs: OpenAI API key and/or Google AI API key
Getting API Keys #
OpenAI API Key #
- Visit OpenAI Platform
- Create an account or sign in
- Navigate to API Keys section
- Create a new API key
Google Gemini API Key #
- Visit Google AI Studio
- Create a project or select existing one
- Generate an API key
- Enable the Generative AI API
Contributing #
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
License #
This project is licensed under the MIT License - see the LICENSE file for details.
Support #
- 📧 Email: support@aiwebscraper.dev
- 🐛 Issues: GitHub Issues
- 📖 Documentation: API Documentation
Changelog #
See CHANGELOG.md for a detailed list of changes and versions.
Made with ❤️ for the Dart community