Skip to content

Latest commit

 

History

History
298 lines (212 loc) · 7.17 KB

File metadata and controls

298 lines (212 loc) · 7.17 KB

Contributing to Scraper MCP

Thank you for your interest in contributing to Scraper MCP! This document provides guidelines and instructions for contributing to the project.

Table of Contents

Code of Conduct

We are committed to providing a welcoming and inclusive environment. Please be respectful and constructive in all interactions.

Getting Started

Prerequisites

  • Python 3.12 or higher
  • uv package manager
  • Docker and Docker Compose (for testing deployment)
  • Git

Fork and Clone

  1. Fork the repository on GitHub

  2. Clone your fork locally:

    git clone git@github.com:YOUR_USERNAME/scraper-mcp.git
    cd scraper-mcp
  3. Add the upstream repository:

    git remote add upstream git@github.com:carrotly-ai/scraper-mcp.git

Development Setup

Install Dependencies

# Install the package with development dependencies
uv pip install -e ".[dev]"

Run the Server Locally

# Run with default settings (stdio transport)
python -m scraper_mcp

# Run with HTTP transport
python -m scraper_mcp streamable-http 0.0.0.0 8000

Access the Dashboard

Open http://localhost:8000/ in your browser to access the monitoring dashboard, playground, and configuration interface.

Development Workflow

Create a Feature Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix

Make Your Changes

  1. Write your code following the Code Standards
  2. Add or update tests as needed
  3. Update documentation (README, docstrings, comments)
  4. Run tests and linting locally

Keep Your Branch Updated

git fetch upstream
git rebase upstream/main

Code Standards

Python Style

We use Ruff for linting and formatting:

# Check for linting issues
ruff check .

# Auto-fix linting issues
ruff check . --fix

# Format code
ruff format .

Type Hints

We use type hints throughout the codebase. Run type checking with:

mypy src/

All public functions and methods should have complete type annotations.

Code Organization

  • Provider Pattern: New scraping backends should implement the ScraperProvider interface
  • Utilities: HTML processing utilities go in src/scraper_mcp/utils.py
  • Models: Use Pydantic v2 models for all data structures
  • Async/Await: Use async patterns consistently throughout

Documentation

  • Add docstrings to all public functions, classes, and methods
  • Update README.md for user-facing changes
  • Update CLAUDE.md for development guidance changes
  • Include inline comments for complex logic

Commit Messages

Use Conventional Commits format:

feat: add support for JavaScript rendering
fix: resolve timeout issue with slow sites
docs: update proxy configuration examples
refactor: simplify retry logic
test: add tests for batch operations
chore: update dependencies

Keep commits focused and atomic. Each commit should represent a single logical change.

Testing

Run Tests

# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_server.py

# Run specific test class
pytest tests/test_server.py::TestScrapeUrlTool

# Run with verbose output
pytest -v

# Run without coverage report
pytest --no-cov

Writing Tests

  • Use pytest with pytest-asyncio for async tests
  • Use pytest-mock for mocking
  • Place test fixtures in tests/conftest.py
  • Aim for >90% code coverage
  • Test both success and error cases
  • Test edge cases and boundary conditions

Test Structure

import pytest
from unittest.mock import Mock, patch

@pytest.mark.asyncio
async def test_feature_name(provider: RequestsProvider) -> None:
    """Test description."""
    # Arrange
    mock_response = Mock()
    mock_response.status_code = 200

    # Act
    with patch.object(provider.session, "get", return_value=mock_response):
        result = await provider.scrape("https://example.com")

    # Assert
    assert result.status_code == 200

Submitting Changes

Before Submitting

  1. Run all checks:

    # Format code
    ruff format .
    
    # Fix linting issues
    ruff check . --fix
    
    # Type check
    mypy src/
    
    # Run tests
    pytest
  2. Update documentation if needed

  3. Add tests for new features or bug fixes

  4. Rebase on latest main:

    git fetch upstream
    git rebase upstream/main

Create a Pull Request

  1. Push your branch to your fork:

    git push origin feature/your-feature-name
  2. Go to the repository on GitHub

  3. Click "New Pull Request"

  4. Select your fork and branch

  5. Fill out the PR template:

    • Title: Brief description (50 chars max)
    • Description:
      • What does this PR do?
      • Why is this change needed?
      • How was it tested?
      • Any breaking changes?
    • Link related issues: Use "Closes #123" or "Fixes #123"

PR Review Process

  • Maintainers will review your PR
  • Address feedback and push updates
  • Once approved, maintainers will merge your PR
  • Your contribution will be credited in the release notes

Areas for Contribution

We welcome contributions in these areas:

High Priority

  • New Scraping Providers: Implement ScraperProvider for Playwright, Selenium, or Scrapy
  • Performance Optimizations: Improve caching, concurrency, or memory usage
  • Documentation: Improve examples, tutorials, or API documentation
  • Test Coverage: Add tests for edge cases or untested code paths

Feature Ideas

  • Authentication Support: Add support for authenticated requests (OAuth, cookies, headers)
  • Screenshot Capture: Add tools for capturing page screenshots
  • Rate Limiting: Implement per-domain rate limiting
  • Request Pooling: Connection pooling for improved performance
  • Webhook Support: Trigger scrapes via webhooks
  • Scheduled Scraping: Cron-like scheduling for periodic scrapes
  • Export Formats: Add JSON, XML, or CSV export options
  • Browser Fingerprinting: Advanced anti-detection techniques
  • Sitemap Support: Parse and scrape from XML sitemaps
  • Mobile User Agents: Better mobile scraping support

Bug Fixes

  • Check the Issues page for open bugs
  • Look for issues labeled good first issue or help wanted

Documentation

  • Improve README examples
  • Add tutorials or guides
  • Document common use cases
  • Translate documentation (if applicable)

Questions?

  • Open an Issue for questions
  • Check existing issues and discussions first
  • Be specific and provide context

License

By contributing, you agree that your contributions will be licensed under the MIT License.


Thank you for contributing to Scraper MCP! 🎉