Scraper Service

A Go-based web scraping service that analyzes webpages and provides detailed information about their structure and content.

API Documentation

The API provides endpoints for analyzing webpages and retrieving metrics.

Main Endpoints:

GET /api/v1/analyze: Analyze a webpage by providing a URL
GET /api/v1/system/metrics: Get Prometheus metrics

Prerequisites

Go 1.24.3 - The Go programming language
Google Chrome - Required for headless browser automation
Docker (optional) - For running the complete setup with monitoring and logging
Node.js (optional) - For commitlint and other development tools

Technologies Used

Backend

Go 1.24.3 - Programming language
Gin - Web framework
go-rod - Browser automation library for web scraping
Zap - Structured logging
Viper - Configuration management
OpenTelemetry - Distributed tracing (WIP)
Gin-contrib/cache - Response caching

DevOps & Monitoring

Docker - Containerization
Prometheus - Metrics collection
Grafana - Metrics and logs visualization
Air - Live reload for development

Frontend

The frontend client is available in a separate repository and is included in the Docker Compose setup

Getting Started

Installation

Clone the repository:

git clone https://github.com/mrmihi/web-analyzer.git
cd web-analyzer

Development

Run the development server with hot reloading:
```
make dev
```

Docker Setup

Start the complete setup with Docker Compose depending on your OS:
```
make sandbox-linux
make sandbox-windows
```

Stop all services:

make teardown-linux
make teardown-windows

Access the services:
- Client: http://localhost:5173
- Scraper API: http://localhost:8080
- Grafana: http://localhost:3000

Usage

Analyzing a Webpage

Send a GET request to /api/v1/analyze/ with a URL query parameter:

curl --location 'http://localhost:8080/api/v1/analyze/?url=https://mrmihi.dev'

Example response:

{
  "html_version": "HTML 5",
  "title": "Example Domain",
  "headings": {
    "h1": 1,
    "h2": 0,
    "h3": 0,
    "h4": 0,
    "h5": 0,
    "h6": 0
  },
  "internal_links": 1,
  "external_links": 1,
  "inaccessible_links": 0,
  "login_form": false
}

Project Structure

cmd/ - HTTP Server initialization
common/ - Common utilities and error handling
config/ - Configuration management
dto/ - Data Transfer Objects
handlers/ - HTTP handlers and controllers
infra/ - Infrastructure configuration
- docker-compose.yml - Docker Compose configuration
internal/ - Internal packages
- logger/ - Logging utilities
- scraper/ - Web scraping functionality
middleware/ - Middleware functions
routes/ - HTTP routes and server setup
services/ - Business logic services
tests/ - Tests cases

Main Features

Webpage Analysis
- HTML version detection
- Page title extraction
- Heading counts (h1-h6)
- Internal and external link counting
- Login form detection
Monitoring and Observability
- Prometheus metrics
- Grafana dashboards
- Structured logging with Zap
- Distributed tracing with OpenTelemetry (WIP)
Robust Error Handling
- Request validation
- Timeout handling
- Graceful shutdown

Challenges and Solutions

Challenge 1: Headless Browser Automation

Problem: Controlling a headless browser for web scraping can be resource-intensive and prone to timeouts.
Solution: Implemented resource blocking for non-essential content (images, stylesheets, etc.) and added a 120-second timeout for analysis operations to ensure the API remains responsive.

Challenge 2: Link Analysis

Problem: Analyzing all links on a webpage could lead to excessive resource usage for pages with many links.
Solution: Implemented concurrent processing of links using goroutines to improve performance while maintaining accuracy.

Possible Improvements

Authentication: Add authentication for API access.
More Analysis Features: Add more webpage analysis features such as:
- Image analysis
- SEO metrics
- Performance metrics
- Accessibility checks

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
api		api
cmd		cmd
common		common
config		config
docs		docs
dto		dto
handlers		handlers
infra		infra
internal		internal
middleware		middleware
services		services
tests/integration		tests/integration
.air.toml		.air.toml
.env.sample		.env.sample
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper Service

API Documentation

Main Endpoints:

Prerequisites

Technologies Used

Backend

DevOps & Monitoring

Frontend

Getting Started

Installation

Development

Docker Setup

Usage

Analyzing a Webpage

Project Structure

Main Features

Challenges and Solutions

Challenge 1: Headless Browser Automation

Challenge 2: Link Analysis

Possible Improvements

License

About

Uh oh!

Releases

Packages

Languages

License

mrmihi/web-analyzer

Folders and files

Latest commit

History

Repository files navigation

Scraper Service

API Documentation

Main Endpoints:

Prerequisites

Technologies Used

Backend

DevOps & Monitoring

Frontend

Getting Started

Installation

Development

Docker Setup

Usage

Analyzing a Webpage

Project Structure

Main Features

Challenges and Solutions

Challenge 1: Headless Browser Automation

Challenge 2: Link Analysis

Possible Improvements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages