
ScrapeGen π is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.
- π€ AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing
- βοΈ Configurable Web Scraping: Supports depth control and flexible extraction rules
- π Structured Data Modeling: Uses Pydantic for well-defined data structures
- π‘οΈ Robust Error Handling: Implements retry mechanisms and detailed error reporting
- π§ Customizable Scraping Configurations: Adjust settings dynamically based on needs
- π Comprehensive URL Handling: Supports both relative and absolute URLs
- π¦ Modular Architecture: Ensures clear separation of concerns for maintainability
pip install scrapegen
- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
- requests
- beautifulsoup4
- langchain
- langchain-google-genai
- pydantic
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo
# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")
# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""
# Scrape with custom prompt and model
companies_data = scraper.scrape(
url=url,
prompt=custom_prompt,
base_model=CompaniesInfo
)
# Display extracted data
for company in companies_data.companies:
print(f"π’ {company.company_name}")
print(f"π§ Technologies: {', '.join(company.core_technologies)}")
print(f"π Focus Areas: {', '.join(company.industry_focus)}")
from scrapegen import ScrapeConfig
config = ScrapeConfig(
max_pages=20, # Max pages to scrape per depth level
max_subpages=2, # Max subpages to scrape per page
max_depth=1, # Max depth to follow links
timeout=30, # Request timeout in seconds
retries=3, # Number of retry attempts
user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
headers=None # Additional HTTP headers
)
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config , verbose=False)
# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
Define Pydantic models to structure extracted data:
from pydantic import BaseModel
from typing import Optional, List
class CustomDataModel(BaseModel):
title: str
description: Optional[str]
date: str
tags: List[str]
class CustomDataCollection(BaseModel):
items: List[CustomDataModel]
# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash
basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""
tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""
multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
- Name
- Mission statement
- Key executives
2. Technical Capabilities:
- Core technologies
- Development stack
- Infrastructure
3. Market Position:
- Competitors
- Market share
- Growth metrics
"""
competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""
green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""
innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""
-
Be Specific: Clearly define required fields and formats
"Format output as JSON with 'company_name', 'employees', 'revenue' keys"
-
Add Context:
"Analyze content from CEO interviews for strategic priorities"
-
Define Output Structure:
"Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
-
Set Priorities:
"Focus on technical specifications over marketing content"
ScrapeGen provides specific exception classes for detailed error handling:
- β ScrapeGenError: Base exception class
- βοΈ ConfigurationError: Errors related to scraper configuration
- π·οΈ ScrapingError: Issues encountered during web scraping
- π ExtractionError: Problems with AI-driven data extraction
Example usage:
try:
data = scraper.scrape(
url=url,
prompt=complex_prompt,
base_model=MarketAnalysis
)
except ExtractionError as e:
print(f"π Extraction failed with custom prompt: {e}")
print(f"π§ Prompt used: {complex_prompt}")
except ScrapingError as e:
print(f"π Scraping error: {str(e)}")
ScrapeGen follows a modular design for scalability and maintainability:
- π·οΈ WebsiteScraper: Handles core web scraping logic
- π InfoExtractorAi: Performs AI-driven content extraction
- π€ LlmManager: Manages interactions with language models
- π UrlParser: Parses and normalizes URLs
- π₯ ContentExtractor: Extracts structured data from HTML elements
- β³ Use delays between requests
- π Respect robots.txt guidelines
- βοΈ Configure max_pages and max_depth responsibly
- π Wrap scraping operations in try-except blocks
- π Implement proper logging for debugging
- π Handle network timeouts and retries effectively
- π₯οΈ Monitor memory usage for large-scale operations
- π Implement pagination for large datasets
- β±οΈ Adjust timeout settings based on expected response times
Contributions are welcome! π Feel free to submit a Pull Request to improve ScrapeGen.
- π€ AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing.
- βοΈ Configurable Web Scraping: Supports depth control and flexible extraction rules.
- π Structured Data Modeling: Uses Pydantic for well-defined data structures.
- π‘οΈ Robust Error Handling: Implements retry mechanisms and detailed error reporting.
- π§ Customizable Scraping Configurations: Adjust settings dynamically based on needs.
- π Comprehensive URL Handling: Supports both relative and absolute URLs.
- π¦ Modular Architecture: Ensures clear separation of concerns for maintainability.
pip install scrapegen # Package name may vary
- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
- requests
- beautifulsoup4
- langchain
- langchain-google-genai
- pydantic
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo
# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")
# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""
# Scrape with custom prompt and model
companies_data = scraper.scrape(
url=url,
prompt=custom_prompt,
base_model=CompaniesInfo
)
# Display extracted data
for company in companies_data.companies:
print(f"π’ {company.company_name}")
print(f"π§ Technologies: {', '.join(company.core_technologies)}")
print(f"π Focus Areas: {', '.join(company.industry_focus)}")
from scrapegen import ScrapeConfig
config = ScrapeConfig(
max_pages=20, # Max pages to scrape per depth level
max_subpages=2, # Max subpages to scrape per page
max_depth=1, # Max depth to follow links
timeout=30, # Request timeout in seconds
retries=3, # Number of retry attempts
user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
headers=None # Additional HTTP headers
)
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)
# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
Define Pydantic models to structure extracted data:
from pydantic import BaseModel
from typing import Optional, List
class CustomDataModel(BaseModel):
title: str
description: Optional[str]
date: str
tags: List[str]
class CustomDataCollection(BaseModel):
items: List[CustomDataModel]
# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash
basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""
tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""
multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
- Name
- Mission statement
- Key executives
2. Technical Capabilities:
- Core technologies
- Development stack
- Infrastructure
3. Market Position:
- Competitors
- Market share
- Growth metrics
"""
competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""
green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""
innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""
-
Be Specific: Clearly define required fields and formats
"Format output as JSON with 'company_name', 'employees', 'revenue' keys"
-
Add Context:
"Analyze content from CEO interviews for strategic priorities"
-
Define Output Structure:
"Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
-
Set Priorities:
"Focus on technical specifications over marketing content"
ScrapeGen provides specific exception classes for detailed error handling:
- β ScrapeGenError: Base exception class
- βοΈ ConfigurationError: Errors related to scraper configuration
- π·οΈ ScrapingError: Issues encountered during web scraping
- π ExtractionError: Problems with AI-driven data extraction
Example usage:
try:
data = scraper.scrape(
url=url,
prompt=complex_prompt,
base_model=MarketAnalysis
)
except ExtractionError as e:
print(f"π Extraction failed with custom prompt: {e}")
print(f"π§ Prompt used: {complex_prompt}")
except ScrapingError as e:
print(f"π Scraping error: {str(e)}")
ScrapeGen follows a modular design for scalability and maintainability:
- π·οΈ WebsiteScraper: Handles core web scraping logic
- π InfoExtractorAi: Performs AI-driven content extraction
- π€ LlmManager: Manages interactions with language models
- π UrlParser: Parses and normalizes URLs
- π₯ ContentExtractor: Extracts structured data from HTML elements
- β³ Use delays between requests
- π Respect robots.txt guidelines
- βοΈ Configure max_pages and max_depth responsibly
- π Wrap scraping operations in try-except blocks
- π Implement proper logging for debugging
- π Handle network timeouts and retries effectively
- π₯οΈ Monitor memory usage for large-scale operations
- π Implement pagination for large datasets
- β±οΈ Adjust timeout settings based on expected response times
Contributions are welcome! π Feel free to submit a Pull Request to improve ScrapeGen.