A Go-based web scraping service that analyzes webpages and provides detailed information about their structure and content.
The API provides endpoints for analyzing webpages and retrieving metrics.
GET /api/v1/analyze: Analyze a webpage by providing a URLGET /api/v1/system/metrics: Get Prometheus metrics
- Go 1.24.3 - The Go programming language
- Google Chrome - Required for headless browser automation
- Docker (optional) - For running the complete setup with monitoring and logging
- Node.js (optional) - For commitlint and other development tools
- Go 1.24.3 - Programming language
- Gin - Web framework
- go-rod - Browser automation library for web scraping
- Zap - Structured logging
- Viper - Configuration management
- OpenTelemetry - Distributed tracing (WIP)
- Gin-contrib/cache - Response caching
- Docker - Containerization
- Prometheus - Metrics collection
- Grafana - Metrics and logs visualization
- Air - Live reload for development
- The frontend client is available in a separate repository and is included in the Docker Compose setup
- Clone the repository:
git clone https://github.com/mrmihi/web-analyzer.git cd web-analyzer
- Run the development server with hot reloading:
make dev
-
Start the complete setup with Docker Compose depending on your OS:
make sandbox-linux make sandbox-windows
-
Stop all services:
make teardown-linux make teardown-windows
-
Access the services:
- Client: http://localhost:5173
- Scraper API: http://localhost:8080
- Grafana: http://localhost:3000
Send a GET request to /api/v1/analyze/ with a URL query parameter:
curl --location 'http://localhost:8080/api/v1/analyze/?url=https://mrmihi.dev'Example response:
{
"html_version": "HTML 5",
"title": "Example Domain",
"headings": {
"h1": 1,
"h2": 0,
"h3": 0,
"h4": 0,
"h5": 0,
"h6": 0
},
"internal_links": 1,
"external_links": 1,
"inaccessible_links": 0,
"login_form": false
}cmd/- HTTP Server initializationcommon/- Common utilities and error handlingconfig/- Configuration managementdto/- Data Transfer Objectshandlers/- HTTP handlers and controllersinfra/- Infrastructure configurationdocker-compose.yml- Docker Compose configuration
internal/- Internal packageslogger/- Logging utilitiesscraper/- Web scraping functionality
middleware/- Middleware functionsroutes/- HTTP routes and server setupservices/- Business logic servicestests/- Tests cases
-
Webpage Analysis
- HTML version detection
- Page title extraction
- Heading counts (h1-h6)
- Internal and external link counting
- Login form detection
-
Monitoring and Observability
- Prometheus metrics
- Grafana dashboards
- Structured logging with Zap
- Distributed tracing with OpenTelemetry (WIP)
-
Robust Error Handling
- Request validation
- Timeout handling
- Graceful shutdown
- Problem: Controlling a headless browser for web scraping can be resource-intensive and prone to timeouts.
- Solution: Implemented resource blocking for non-essential content (images, stylesheets, etc.) and added a 120-second timeout for analysis operations to ensure the API remains responsive.
- Problem: Analyzing all links on a webpage could lead to excessive resource usage for pages with many links.
- Solution: Implemented concurrent processing of links using goroutines to improve performance while maintaining accuracy.
-
Authentication: Add authentication for API access.
-
More Analysis Features: Add more webpage analysis features such as:
- Image analysis
- SEO metrics
- Performance metrics
- Accessibility checks
This project is licensed under the MIT License - see the LICENSE file for details.