A robust, thread-safe sitemap generator that crawls websites and creates XML sitemaps. Supports both local execution and AWS Lambda deployment.
- Thread-safe crawling with configurable worker threads
- Breadth-first crawling for better performance
- Robust error handling with comprehensive logging
- Environment-based configuration (no more hardcoded values)
- AWS S3 integration for cloud deployment
- Robots.txt compliance with fallback handling
- URL normalization and deduplication
- Time-based filtering for content freshness
- Request timeout handling to prevent hanging
- Python 3.11 or higher
- Dependencies listed in
requirements.txt
- Clone the repository:
git clone https://github.com/cnkang/sitemap-generator.git
cd sitemap-generator- Install dependencies:
pip install -r requirements.txt- Configure environment variables:
cp .env.example .env
# Edit .env with your settingsThe generator uses environment variables for configuration. Copy .env.example to .env and modify:
DOMAIN: Target domain (e.g., 'www.example.com')START_URL: Starting URL for crawlingOUTPUT_FILENAME: Output sitemap filenameMAX_DEPTH: Maximum crawling depth (default: 10)MAX_WORKERS: Number of concurrent threads (default: 5)
USE_TIME_FILTER: Filter by modification time (true/false)RESPECT_ROBOTS_TXT: Follow robots.txt rules (true/false)REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)USER_AGENT: Custom user agent string
RUN_LOCALLY: Set to 'false' for S3 uploadS3_BUCKET: Target S3 bucket nameS3_KEY: S3 object key for the sitemap
python sitemap_generator.pyThe script includes a lambda_handler function for AWS Lambda deployment. Set RUN_LOCALLY=false to enable S3 upload.
# Basic setup for crawling example.com
DOMAIN=example.com
START_URL=https://example.com
OUTPUT_FILENAME=sitemap.xml
MAX_DEPTH=5
MAX_WORKERS=3- Start small: Begin with
MAX_DEPTH=3andMAX_WORKERS=3for testing - Respect rate limits: Don't set
MAX_WORKERStoo high to avoid overwhelming servers - Monitor logs: The script provides detailed logging for troubleshooting
- Test robots.txt: Ensure your crawler respects the target site's crawling policies
- Use time filters: Enable
USE_TIME_FILTERto focus on recently updated content
This version includes several optimizations over the original:
- Fixed syntax errors and improved code structure
- Modern Python 3.11+ features with union syntax and built-in generics
- Added comprehensive error handling with proper logging
- Implemented environment-based configuration instead of hardcoded values
- Improved thread safety and performance
- Added URL normalization to prevent duplicates
- Better robots.txt handling with fallback options
- Enhanced AWS integration with proper error handling
- Type-safe code with modern type annotations