Skip to content

AffanShaikhsurab/scrapegen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ScrapeGen

Logo

ScrapeGen πŸš€ is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.

✨ Features

  • πŸ€– AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing
  • βš™οΈ Configurable Web Scraping: Supports depth control and flexible extraction rules
  • πŸ“Š Structured Data Modeling: Uses Pydantic for well-defined data structures
  • πŸ›‘οΈ Robust Error Handling: Implements retry mechanisms and detailed error reporting
  • πŸ”§ Customizable Scraping Configurations: Adjust settings dynamically based on needs
  • 🌐 Comprehensive URL Handling: Supports both relative and absolute URLs
  • πŸ“¦ Modular Architecture: Ensures clear separation of concerns for maintainability

πŸ“₯ Installation

pip install scrapegen

πŸ“Œ Requirements

  • Python 3.7+
  • Google API Key (for Gemini models)
  • Required Python packages:
    • requests
    • beautifulsoup4
    • langchain
    • langchain-google-genai
    • pydantic

πŸš€ Quick Start with Custom Prompts

from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""

# Scrape with custom prompt and model
companies_data = scraper.scrape(
    url=url,
    prompt=custom_prompt,
    base_model=CompaniesInfo
)

# Display extracted data
for company in companies_data.companies:
    print(f"🏒 {company.company_name}")
    print(f"πŸ”§ Technologies: {', '.join(company.core_technologies)}")
    print(f"πŸ“ˆ Focus Areas: {', '.join(company.industry_focus)}")

βš™οΈ Configuration

πŸ”Ή ScrapeConfig Options

from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)

πŸ”„ Updating Configuration

scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config , verbose=False)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)

πŸ“Œ Custom Data Models

Define Pydantic models to structure extracted data:

from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)

πŸ€– Supported Gemini Models

  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-1.5-flash

πŸ†• Custom Prompt Engineering Guide

1️⃣ Basic Prompt Structure

basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""

2️⃣ Tech-Focused Extraction

tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""

3️⃣ Multi-Level Extraction

multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
   - Name
   - Mission statement
   - Key executives
2. Technical Capabilities:
   - Core technologies
   - Development stack
   - Infrastructure
3. Market Position:
   - Competitors
   - Market share
   - Growth metrics
"""

πŸ“Œ Specialized Prompt Examples

πŸ” Competitive Analysis Prompt

competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""

🌱 Sustainability Focused Prompt

green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""

πŸ’‘ Innovation Tracking Prompt

innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""

πŸ› οΈ Prompt Optimization Tips

  1. Be Specific: Clearly define required fields and formats

    "Format output as JSON with 'company_name', 'employees', 'revenue' keys"
  2. Add Context:

    "Analyze content from CEO interviews for strategic priorities"
  3. Define Output Structure:

    "Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
  4. Set Priorities:

    "Focus on technical specifications over marketing content"

⚠️ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

  • ❗ ScrapeGenError: Base exception class
  • βš™οΈ ConfigurationError: Errors related to scraper configuration
  • πŸ•·οΈ ScrapingError: Issues encountered during web scraping
  • πŸ” ExtractionError: Problems with AI-driven data extraction

Example usage:

try:
    data = scraper.scrape(
        url=url,
        prompt=complex_prompt,
        base_model=MarketAnalysis
    )
except ExtractionError as e:
    print(f"πŸ” Extraction failed with custom prompt: {e}")
    print(f"🧠 Prompt used: {complex_prompt}")
except ScrapingError as e:
    print(f"🌐 Scraping error: {str(e)}")

πŸ—οΈ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

  1. πŸ•·οΈ WebsiteScraper: Handles core web scraping logic
  2. πŸ“‘ InfoExtractorAi: Performs AI-driven content extraction
  3. πŸ€– LlmManager: Manages interactions with language models
  4. πŸ”— UrlParser: Parses and normalizes URLs
  5. πŸ“₯ ContentExtractor: Extracts structured data from HTML elements

βœ… Best Practices

1️⃣ Rate Limiting

  • ⏳ Use delays between requests
  • πŸ“œ Respect robots.txt guidelines
  • βš–οΈ Configure max_pages and max_depth responsibly

2️⃣ Error Handling

  • πŸ”„ Wrap scraping operations in try-except blocks
  • πŸ“‹ Implement proper logging for debugging
  • πŸ” Handle network timeouts and retries effectively

3️⃣ Resource Management

  • πŸ–₯️ Monitor memory usage for large-scale operations
  • πŸ“š Implement pagination for large datasets
  • ⏱️ Adjust timeout settings based on expected response times

🀝 Contributing

Contributions are welcome! πŸŽ‰ Feel free to submit a Pull Request to improve ScrapeGen.

✨ Features

  • πŸ€– AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing.
  • βš™οΈ Configurable Web Scraping: Supports depth control and flexible extraction rules.
  • πŸ“Š Structured Data Modeling: Uses Pydantic for well-defined data structures.
  • πŸ›‘οΈ Robust Error Handling: Implements retry mechanisms and detailed error reporting.
  • πŸ”§ Customizable Scraping Configurations: Adjust settings dynamically based on needs.
  • 🌐 Comprehensive URL Handling: Supports both relative and absolute URLs.
  • πŸ“¦ Modular Architecture: Ensures clear separation of concerns for maintainability.

πŸ“₯ Installation

pip install scrapegen  # Package name may vary

πŸ“Œ Requirements

  • Python 3.7+
  • Google API Key (for Gemini models)
  • Required Python packages:
    • requests
    • beautifulsoup4
    • langchain
    • langchain-google-genai
    • pydantic

πŸš€ Quick Start with Custom Prompts

from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""

# Scrape with custom prompt and model
companies_data = scraper.scrape(
    url=url,
    prompt=custom_prompt,
    base_model=CompaniesInfo
)

# Display extracted data
for company in companies_data.companies:
    print(f"🏒 {company.company_name}")
    print(f"πŸ”§ Technologies: {', '.join(company.core_technologies)}")
    print(f"πŸ“ˆ Focus Areas: {', '.join(company.industry_focus)}")

βš™οΈ Configuration

πŸ”Ή ScrapeConfig Options

from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)

πŸ”„ Updating Configuration

scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)

πŸ“Œ Custom Data Models

Define Pydantic models to structure extracted data:

from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)

πŸ€– Supported Gemini Models

  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-1.5-flash

πŸ†• Custom Prompt Engineering Guide

1️⃣ Basic Prompt Structure

basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""

2️⃣ Tech-Focused Extraction

tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""

3️⃣ Multi-Level Extraction

multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
   - Name
   - Mission statement
   - Key executives
2. Technical Capabilities:
   - Core technologies
   - Development stack
   - Infrastructure
3. Market Position:
   - Competitors
   - Market share
   - Growth metrics
"""

πŸ“Œ Specialized Prompt Examples

πŸ” Competitive Analysis Prompt

competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""

🌱 Sustainability Focused Prompt

green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""

πŸ’‘ Innovation Tracking Prompt

innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""

πŸ› οΈ Prompt Optimization Tips

  1. Be Specific: Clearly define required fields and formats

    "Format output as JSON with 'company_name', 'employees', 'revenue' keys"
  2. Add Context:

    "Analyze content from CEO interviews for strategic priorities"
  3. Define Output Structure:

    "Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
  4. Set Priorities:

    "Focus on technical specifications over marketing content"

⚠️ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

  • ❗ ScrapeGenError: Base exception class
  • βš™οΈ ConfigurationError: Errors related to scraper configuration
  • πŸ•·οΈ ScrapingError: Issues encountered during web scraping
  • πŸ” ExtractionError: Problems with AI-driven data extraction

Example usage:

try:
    data = scraper.scrape(
        url=url,
        prompt=complex_prompt,
        base_model=MarketAnalysis
    )
except ExtractionError as e:
    print(f"πŸ” Extraction failed with custom prompt: {e}")
    print(f"🧠 Prompt used: {complex_prompt}")
except ScrapingError as e:
    print(f"🌐 Scraping error: {str(e)}")

πŸ—οΈ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

  1. πŸ•·οΈ WebsiteScraper: Handles core web scraping logic
  2. πŸ“‘ InfoExtractorAi: Performs AI-driven content extraction
  3. πŸ€– LlmManager: Manages interactions with language models
  4. πŸ”— UrlParser: Parses and normalizes URLs
  5. πŸ“₯ ContentExtractor: Extracts structured data from HTML elements

βœ… Best Practices

1️⃣ Rate Limiting

  • ⏳ Use delays between requests
  • πŸ“œ Respect robots.txt guidelines
  • βš–οΈ Configure max_pages and max_depth responsibly

2️⃣ Error Handling

  • πŸ”„ Wrap scraping operations in try-except blocks
  • πŸ“‹ Implement proper logging for debugging
  • πŸ” Handle network timeouts and retries effectively

3️⃣ Resource Management

  • πŸ–₯️ Monitor memory usage for large-scale operations
  • πŸ“š Implement pagination for large datasets
  • ⏱️ Adjust timeout settings based on expected response times

🀝 Contributing

Contributions are welcome! πŸŽ‰ Feel free to submit a Pull Request to improve ScrapeGen.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages