Thank you for your interest in contributing to Scraper MCP! This document provides guidelines and instructions for contributing to the project.
- Code of Conduct
- Getting Started
- Development Setup
- Development Workflow
- Code Standards
- Testing
- Submitting Changes
- Areas for Contribution
We are committed to providing a welcoming and inclusive environment. Please be respectful and constructive in all interactions.
- Python 3.12 or higher
- uv package manager
- Docker and Docker Compose (for testing deployment)
- Git
-
Fork the repository on GitHub
-
Clone your fork locally:
git clone git@github.com:YOUR_USERNAME/scraper-mcp.git cd scraper-mcp -
Add the upstream repository:
git remote add upstream git@github.com:carrotly-ai/scraper-mcp.git
# Install the package with development dependencies
uv pip install -e ".[dev]"# Run with default settings (stdio transport)
python -m scraper_mcp
# Run with HTTP transport
python -m scraper_mcp streamable-http 0.0.0.0 8000Open http://localhost:8000/ in your browser to access the monitoring dashboard, playground, and configuration interface.
git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix- Write your code following the Code Standards
- Add or update tests as needed
- Update documentation (README, docstrings, comments)
- Run tests and linting locally
git fetch upstream
git rebase upstream/mainWe use Ruff for linting and formatting:
# Check for linting issues
ruff check .
# Auto-fix linting issues
ruff check . --fix
# Format code
ruff format .We use type hints throughout the codebase. Run type checking with:
mypy src/All public functions and methods should have complete type annotations.
- Provider Pattern: New scraping backends should implement the
ScraperProviderinterface - Utilities: HTML processing utilities go in
src/scraper_mcp/utils.py - Models: Use Pydantic v2 models for all data structures
- Async/Await: Use async patterns consistently throughout
- Add docstrings to all public functions, classes, and methods
- Update README.md for user-facing changes
- Update CLAUDE.md for development guidance changes
- Include inline comments for complex logic
Use Conventional Commits format:
feat: add support for JavaScript rendering
fix: resolve timeout issue with slow sites
docs: update proxy configuration examples
refactor: simplify retry logic
test: add tests for batch operations
chore: update dependencies
Keep commits focused and atomic. Each commit should represent a single logical change.
# Run all tests with coverage
pytest
# Run specific test file
pytest tests/test_server.py
# Run specific test class
pytest tests/test_server.py::TestScrapeUrlTool
# Run with verbose output
pytest -v
# Run without coverage report
pytest --no-cov- Use
pytestwithpytest-asynciofor async tests - Use
pytest-mockfor mocking - Place test fixtures in
tests/conftest.py - Aim for >90% code coverage
- Test both success and error cases
- Test edge cases and boundary conditions
import pytest
from unittest.mock import Mock, patch
@pytest.mark.asyncio
async def test_feature_name(provider: RequestsProvider) -> None:
"""Test description."""
# Arrange
mock_response = Mock()
mock_response.status_code = 200
# Act
with patch.object(provider.session, "get", return_value=mock_response):
result = await provider.scrape("https://example.com")
# Assert
assert result.status_code == 200-
Run all checks:
# Format code ruff format . # Fix linting issues ruff check . --fix # Type check mypy src/ # Run tests pytest
-
Update documentation if needed
-
Add tests for new features or bug fixes
-
Rebase on latest main:
git fetch upstream git rebase upstream/main
-
Push your branch to your fork:
git push origin feature/your-feature-name
-
Go to the repository on GitHub
-
Click "New Pull Request"
-
Select your fork and branch
-
Fill out the PR template:
- Title: Brief description (50 chars max)
- Description:
- What does this PR do?
- Why is this change needed?
- How was it tested?
- Any breaking changes?
- Link related issues: Use "Closes #123" or "Fixes #123"
- Maintainers will review your PR
- Address feedback and push updates
- Once approved, maintainers will merge your PR
- Your contribution will be credited in the release notes
We welcome contributions in these areas:
- New Scraping Providers: Implement
ScraperProviderfor Playwright, Selenium, or Scrapy - Performance Optimizations: Improve caching, concurrency, or memory usage
- Documentation: Improve examples, tutorials, or API documentation
- Test Coverage: Add tests for edge cases or untested code paths
- Authentication Support: Add support for authenticated requests (OAuth, cookies, headers)
- Screenshot Capture: Add tools for capturing page screenshots
- Rate Limiting: Implement per-domain rate limiting
- Request Pooling: Connection pooling for improved performance
- Webhook Support: Trigger scrapes via webhooks
- Scheduled Scraping: Cron-like scheduling for periodic scrapes
- Export Formats: Add JSON, XML, or CSV export options
- Browser Fingerprinting: Advanced anti-detection techniques
- Sitemap Support: Parse and scrape from XML sitemaps
- Mobile User Agents: Better mobile scraping support
- Check the Issues page for open bugs
- Look for issues labeled
good first issueorhelp wanted
- Improve README examples
- Add tutorials or guides
- Document common use cases
- Translate documentation (if applicable)
- Open an Issue for questions
- Check existing issues and discussions first
- Be specific and provide context
By contributing, you agree that your contributions will be licensed under the MIT License.
Thank you for contributing to Scraper MCP! 🎉