Sitemap Generator

A robust, thread-safe sitemap generator that crawls websites and creates XML sitemaps. Supports both local execution and AWS Lambda deployment.

Features

Thread-safe crawling with configurable worker threads
Breadth-first crawling for better performance
Robust error handling with comprehensive logging
Environment-based configuration (no more hardcoded values)
AWS S3 integration for cloud deployment
Robots.txt compliance with fallback handling
URL normalization and deduplication
Time-based filtering for content freshness
Request timeout handling to prevent hanging

Requirements

Python 3.11 or higher
Dependencies listed in requirements.txt

Installation

Clone the repository:

git clone https://github.com/cnkang/sitemap-generator.git
cd sitemap-generator

Install dependencies:

pip install -r requirements.txt

Configure environment variables:

cp .env.example .env
# Edit .env with your settings

Configuration

The generator uses environment variables for configuration. Copy .env.example to .env and modify:

Basic Configuration

DOMAIN: Target domain (e.g., 'www.example.com')
START_URL: Starting URL for crawling
OUTPUT_FILENAME: Output sitemap filename
MAX_DEPTH: Maximum crawling depth (default: 10)
MAX_WORKERS: Number of concurrent threads (default: 5)

Advanced Options

USE_TIME_FILTER: Filter by modification time (true/false)
RESPECT_ROBOTS_TXT: Follow robots.txt rules (true/false)
REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)
USER_AGENT: Custom user agent string

AWS Configuration (for Lambda deployment)

RUN_LOCALLY: Set to 'false' for S3 upload
S3_BUCKET: Target S3 bucket name
S3_KEY: S3 object key for the sitemap

Usage

Local Execution

python sitemap_generator.py

AWS Lambda Deployment

The script includes a lambda_handler function for AWS Lambda deployment. Set RUN_LOCALLY=false to enable S3 upload.

Example Configuration

# Basic setup for crawling example.com
DOMAIN=example.com
START_URL=https://example.com
OUTPUT_FILENAME=sitemap.xml
MAX_DEPTH=5
MAX_WORKERS=3

Best Practices

Start small: Begin with MAX_DEPTH=3 and MAX_WORKERS=3 for testing
Respect rate limits: Don't set MAX_WORKERS too high to avoid overwhelming servers
Monitor logs: The script provides detailed logging for troubleshooting
Test robots.txt: Ensure your crawler respects the target site's crawling policies
Use time filters: Enable USE_TIME_FILTER to focus on recently updated content

Improvements Made

This version includes several optimizations over the original:

Fixed syntax errors and improved code structure
Modern Python 3.11+ features with union syntax and built-in generics
Added comprehensive error handling with proper logging
Implemented environment-based configuration instead of hardcoded values
Improved thread safety and performance
Added URL normalization to prevent duplicates
Better robots.txt handling with fallback options
Enhanced AWS integration with proper error handling
Type-safe code with modern type annotations

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
.env.example		.env.example
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
config.py		config.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
sitemap_generator.py		sitemap_generator.py
test_compatibility.py		test_compatibility.py
test_generator.py		test_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sitemap Generator

Features

Requirements

Installation

Configuration

Basic Configuration

Advanced Options

AWS Configuration (for Lambda deployment)

Usage

Local Execution

AWS Lambda Deployment

Example Configuration

Best Practices

Improvements Made

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

cnkang/sitemap-generator

Folders and files

Latest commit

History

Repository files navigation

Sitemap Generator

Features

Requirements

Installation

Configuration

Basic Configuration

Advanced Options

AWS Configuration (for Lambda deployment)

Usage

Local Execution

AWS Lambda Deployment

Example Configuration

Best Practices

Improvements Made

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages