Skip to content

Latest commit

Β 

History

History
64 lines (45 loc) Β· 2.11 KB

File metadata and controls

64 lines (45 loc) Β· 2.11 KB

Hacker News

Quick Start

The Docker Compose file defines the garage-meta and garage-data volumes as external volumes, so they need to be created manually once before starting the stack:

docker volume create garage-meta
docker volume create garage-data
docker compose build hn-producer
docker-compose up -d

Don't forget to create your access key, secret key and buckets before launching a notebook, in the Garage UI interface !


Garage UI

Open in browser: http://localhost:3909/

Spark UI

Open in browser: http://localhost:8080/ui

Kafka UI

Open in browser: http://localhost:8082

  • View topics: hn-stories, hn-comments

Query Delta Lake (Jupyter Notebook)

jupyter notebook explore_data.ipynb

πŸ—οΈ Architecture

HN API β†’ Kafka Producer β†’ Kafka Topics
                             ↓
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  BRONZE Layer  β”‚  ← Spark + Delta Lake
                    β”‚  (Raw Data)    β”‚     β€’ Kafka β†’ Delta
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β€’ ACID writes
                             ↓
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  SILVER Layer  β”‚  ← Spark + Delta Lake
                    β”‚  (Clean Data)  β”‚     β€’ HTML cleaning
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β€’ Quality scoring

Data Schema

Bronze Layer (Raw from Kafka)

Stories: id, by, title, url, score, descendants, time, type, text, kids, _kafka_offset, _kafka_partition, _bronze_ingested_at

Comments: id, by, parent, story_id, text, time, type, kids, deleted, dead, _kafka_offset, _kafka_partition, _bronze_ingested_at

Silver Layer (Cleaned)

Stories: id, author, title, url, score, comment_count, timestamp, text_raw, text_clean, has_url, has_text, type

Comments: id, author, story_id, parent, timestamp, text_raw, text_clean, has_text, word_count, char_count, has_replies, is_deleted, is_dead, quality_score, type