| Go Version | License | 
|---|---|
A concurrent, high-performance web crawler built in Go, designed for scalability and resilience. This project features a sophisticated architecture using Redis for caching and queuing, Cassandra for metadata storage, and Prometheus for real-time performance monitoring.
This crawler is built on a distributed, microservice-inspired architecture designed for high throughput. The core of the system is a concurrent worker pool that processes tasks from a multi-priority queue in Redis. To ensure politeness and avoid overwhelming servers, a token-bucket rate limiter (implemented as a Redis Lua script) is used.
The storage layer is designed for scale, using Cassandra for metadata, S3 for raw content, and a Bloom Filter in Redis to prevent re-crawling duplicate URLs. The entire system is instrumented with Prometheus metrics, which can be visualized in Grafana to provide real-time insight into the crawler's performance.
🏛️ Click to Expand: Detailed Architecture & Design Decisions
The core logic is separated into two main processes: fetching and storing, each handled by a dedicated type of worker.
| 
       Fetch Process Flow  
     | 
    
       Store Process Flow  
     | 
  
The key design decisions were made to ensure scalability, resilience, and high performance:
- 
High Concurrency with a Worker Pool: The core of the crawler is a pool of goroutine workers. This design maximizes CPU utilization by processing multiple pages in parallel and allows the application to be scaled horizontally by simply increasing the number of worker instances.
 - 
Distributed Caching & Queuing with Redis: Redis was chosen for its high-performance, in-memory data structures.
- Task Queues: Redis Streams are used to implement a durable, multi-priority task queue. This allows for reliable task distribution and ensures that high-priority work (like processing sitemaps) is handled first.
 - Duplicate Prevention: A Bloom Filter, a memory-efficient probabilistic data structure, is used to keep track of all visited URLs. This dramatically reduces redundant work and saves storage resources.
 
 - 
Scalable Storage Layer: The storage system is designed to handle a massive volume of write operations.
- Metadata (Cassandra): A Cassandra cluster was chosen for storing page metadata. Its masterless architecture and high write throughput are ideal for a write-heavy application like a web crawler.
 - Content (S3): The raw HTML content of each page is saved to an S3-compatible blob store, which provides cheap, durable, and highly available storage for large objects.
 
 - 
Polite Crawling: To ensure the crawler is a good citizen of the web, a sophisticated politeness manager was implemented. It respects
robots.txtrules (cached in Redis) and uses a token-bucket rate limiting algorithm. This algorithm is implemented in a Redis Lua script to guarantee atomic check-and-decrement operations, preventing race conditions in a highly concurrent environment. - 
Real-Time Monitoring: The application exposes detailed performance metrics via a
/metricsendpoint. This allows for real-time monitoring with Prometheus and visualization in Grafana, providing crucial insights into the crawler's health and performance. 
The worker pool uses an intelligent, priority-based scheduling system to ensure that high-priority tasks are handled first, while also maximizing worker utilization. Each worker is initialized with a primary queue to listen to, but the pool can dynamically assign tasks from other queues.
The logic is as follows: When a worker requests a new task, the worker pool first attempts to pull a task from the worker's assigned primary queue. If that queue is empty, the pool will scan all queues in a fixed fallback order (High > Medium > Store > Retry) and assign the first available task it finds. This "work-stealing" approach ensures that no worker sits idle as long as there is work to be done anywhere in the system.
To ensure the crawler is a good citizen of the web, it uses an event-driven politeness flow with a clean separation of concerns. The worker, not the politeness manager, is responsible for initiating the fetch of a new robots.txt file.
The flow is as follows: A worker asks the politeness manager for permission to crawl a URL. The manager executes a single, atomic Lua script on Redis that checks the rate-limit tokens and returns the cached robots.txt data if available. If the Redis key for the domain does not exist, the manager returns a redis.Nil error. The worker interprets this error as a signal to create a new, high-priority fetch_rules task and sends it back to the queue. This decoupled, event-driven design prevents the politeness manager from making network calls itself and keeps the system robust.
The crawler is capable of processing thousands of pages in minutes on a single machine. The primary bottleneck is intentionally the network I/O and politeness delays, not the application's processing logic.
Note on Performance: The metrics shown above were generated by running the crawler on a single machine against a diverse set of websites. Your actual results may vary depending on your hardware, network conditions, and the websites you choose to crawl.
This project is fully containerized using Docker and Docker Compose for easy setup.
- Docker and Docker Compose
 
- 
Clone the repository:
git clone [https://github.com/NesterovYehor/Crawler.git](https://github.com/NesterovYehor/Crawler.git) cd Crawler - 
Build and run the services:
docker compose up --build
This will start the crawler, Redis, Cassandra, Prometheus, and Grafana containers.
 - 
View the metrics:
- The crawler's metrics are exposed at 
http://localhost:2112/metrics. - The Prometheus UI is available at 
http://localhost:9090. - The Grafana dashboard is available at 
http://localhost:3000(login withadmin/admin). 
 - The crawler's metrics are exposed at 
 
The application is configured via the config.yaml file. Key options include worker pool size, concurrency limits, and database connection details. Environment variables are used within the docker-compose.yml file to correctly wire the services together.

