Feat/simple gold (#17)

Angry-Jay · h00dieB0y · web-flow · commit 8959d294cba2 · 2026-01-22T15:36:18.000+01:00
* feat: Add Silver to Gold transformation notebook with Spark NLP integration

Co-authored-by: Angry-Jay &lt;jinozebian@gmail.com&gt;

* feat: Enhance Silver to Gold transformation with keyword extraction and improved NER pipeline

Co-authored-by: Angry-Jay &lt;jinozebian@gmail.com&gt;

* feat: Enhance documentation in Silver to Gold notebook with detailed objectives and explanations for each pipeline

Co-authored-by: Angry-Jay &lt;jinozebian@gmail.com&gt;

* feat: Simplify and enhance documentation in Silver to Gold notebook for clarity and conciseness

* refactor: Remove comments_enriched from Silver layer (YAGNI)

- Remove redundant join in Silver (comments + stories)
- Join is now done in Gold SparkSQL when needed
- Simplify Silver to only 2 tables: stories, comments

* feat: Update Docker configuration to change Spark service condition and enhance notebook dependencies

* feat: Update Garage access keys and enhance notebook metadata for consistency

* feat: Add reminder for creating access key, secret key, and buckets in Garage UI before launching a notebook

* refactor: Consolidate Spark session configuration and credentials in Silver to Gold notebook

* feat: Refactor Spark session configuration and enhance sentiment analysis with Universal Sentence Encoder

* fix: Update Garage access keys for consistency and remove unused savefig calls

* fix: Replace size function with length for keyword filtering in notebook

* feat: Consolidate Spark NLP imports and add author ranking analysis with window function

* chore: Add TODO comments for replacing Garage credentials in notebooks

* Update README.md

* Update notebooks/03_silver_to_gold.ipynb

* Update notebooks/03_silver_to_gold.ipynb

* Update notebooks/03_silver_to_gold.ipynb

* Update notebooks/03_silver_to_gold.ipynb

* Update notebooks/03_silver_to_gold.ipynb
---------

Co-authored-by: Manne Emile KITSOUKOU &lt;emilemannekitsoukou@gmail.com&gt;
Co-authored-by: Angry-Jay &lt;jinozebian@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ docker volume create garage-data
 docker-compose up -d
 ```
 
+Don't forget to create your access key, secret key and buckets before launching a notebook, in the Garage UI interface !
 
 ---
 
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -62,7 +62,7 @@ services:
       kafka:
         condition: service_healthy
       spark:
-        condition: service_healthy
+        condition: service_started
     ports:
       - "8888:8888"
       - "4040:4040"
@@ -86,20 +86,14 @@ services:
     ports:
       - '8080:8080'
       - '7077:7077'
-    healthcheck:
-      test: ["CMD-SHELL", "nc -z localhost 7077 || exit 1"]
-      interval: 10s
-      timeout: 5s
-      retries: 12
-      start_period: 30s
     networks:
       - hackernews-network
 
   spark-worker:
     image: grosinosky/spark:3.5.3
     depends_on:
       spark:
-        condition: service_healthy
+        condition: service_started
     environment:
       - SPARK_MODE=worker
       - SPARK_MASTER_URL=spark://spark:7077
@@ -164,4 +158,4 @@ volumes:
 
 networks:
   hackernews-network:
-    driver: bridge
+    driver: bridge
diff --git a/notebooks/01_kafka_to_bronze.ipynb b/notebooks/01_kafka_to_bronze.ipynb
@@ -26,10 +26,11 @@
     "from pyspark.sql import SQLContext\n",
     "\n",
     "# Configuration\n",
+    "# TODO : A remplacer par vos propres identifiants Garage\n",
     "KAFKA_SERVERS = \"kafka:9092\"\n",
     "GARAGE_ENDPOINT = \"http://garage:3900\"\n",
-    "GARAGE_ACCESS_KEY = \"GKa25124b4fd82613c063217f3\"\n",
-    "GARAGE_SECRET_KEY = \"008126399688f9b1efc3a3093079b066e4c6471fa256b52788da0c927194147e\"\n",
+    "GARAGE_ACCESS_KEY = \"GK907b22f51dc0d0c5164474f2\"\n",
+    "GARAGE_SECRET_KEY = \"6cf587853042d92d2cf6bb85b7c46a6a2400a47822e9baae32f9be0b7c5c9663\"\n",
     "\n",
     "# Spark config avec cluster (inspiré TP8)\n",
     "conf = SparkConf() \\\n",
@@ -292,13 +293,21 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
    "name": "python",
-   "version": "3.12.0"
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/02_bronze_to_silver.ipynb b/notebooks/02_bronze_to_silver.ipynb
@@ -3,143 +3,240 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "# Bronze → Silver Layer"
+   "source": [
+    "# Bronze → Silver Layer"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 1. Configuration Spark"
+   "source": [
+    "## 1. Configuration Spark"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from pyspark.sql import SparkSession\n\nGARAGE_ENDPOINT = \"http://garage:3900\"\nGARAGE_ACCESS_KEY = \"GKa25124b4fd82613c063217f3\"\nGARAGE_SECRET_KEY = \"008126399688f9b1efc3a3093079b066e4c6471fa256b52788da0c927194147e\"\n\nBRONZE_PATH = \"s3a://bronze/hackernews\"\nSILVER_PATH = \"s3a://silver/hackernews\"\n\nspark = SparkSession.builder \\\n    .appName(\"BronzeToSilver\") \\\n    .master(\"spark://spark:7077\") \\\n    .config(\"spark.jars.packages\", \n            \"org.apache.hadoop:hadoop-aws:3.3.4,\"\n            \"com.amazonaws:aws-java-sdk-bundle:1.12.262,\"\n            \"io.delta:delta-spark_2.12:3.3.0\") \\\n    .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n    .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n    .config(\"spark.hadoop.fs.s3a.multiobjectdelete.enable\", \"false\") \\\n    .config(\"spark.sql.shuffle.partitions\", \"10\") \\\n    .getOrCreate()\n\nhadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()\nhadoop_conf.set(\"fs.s3a.endpoint\", GARAGE_ENDPOINT)\nhadoop_conf.set(\"fs.s3a.access.key\", GARAGE_ACCESS_KEY)\nhadoop_conf.set(\"fs.s3a.secret.key\", GARAGE_SECRET_KEY)\nhadoop_conf.set(\"fs.s3a.endpoint.region\", \"garage\")\nhadoop_conf.set(\"fs.s3a.path.style.access\", \"true\")\nhadoop_conf.set(\"fs.s3a.impl\", \"org.apache.hadoop.fs.s3a.S3AFileSystem\")\nhadoop_conf.set(\"fs.s3a.connection.ssl.enabled\", \"false\")"
+   "source": [
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "# TODO : A remplacer par vos propres identifiants Garage\n",
+    "GARAGE_ENDPOINT = \"http://garage:3900\"\n",
+    "GARAGE_ACCESS_KEY = \"GK907b22f51dc0d0c5164474f2\"\n",
+    "GARAGE_SECRET_KEY = \"6cf587853042d92d2cf6bb85b7c46a6a2400a47822e9baae32f9be0b7c5c9663\"\n",
+    "\n",
+    "BRONZE_PATH = \"s3a://bronze/hackernews\"\n",
+    "SILVER_PATH = \"s3a://silver/hackernews\"\n",
+    "\n",
+    "spark = SparkSession.builder \\\n",
+    "    .appName(\"BronzeToSilver\") \\\n",
+    "    .master(\"spark://spark:7077\") \\\n",
+    "    .config(\"spark.jars.packages\", \n",
+    "            \"org.apache.hadoop:hadoop-aws:3.3.4,\"\n",
+    "            \"com.amazonaws:aws-java-sdk-bundle:1.12.262,\"\n",
+    "            \"io.delta:delta-spark_2.12:3.3.0\") \\\n",
+    "    .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n",
+    "    .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n",
+    "    .config(\"spark.hadoop.fs.s3a.multiobjectdelete.enable\", \"false\") \\\n",
+    "    .config(\"spark.sql.shuffle.partitions\", \"10\") \\\n",
+    "    .getOrCreate()\n",
+    "\n",
+    "hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()\n",
+    "hadoop_conf.set(\"fs.s3a.endpoint\", GARAGE_ENDPOINT)\n",
+    "hadoop_conf.set(\"fs.s3a.access.key\", GARAGE_ACCESS_KEY)\n",
+    "hadoop_conf.set(\"fs.s3a.secret.key\", GARAGE_SECRET_KEY)\n",
+    "hadoop_conf.set(\"fs.s3a.endpoint.region\", \"garage\")\n",
+    "hadoop_conf.set(\"fs.s3a.path.style.access\", \"true\")\n",
+    "hadoop_conf.set(\"fs.s3a.impl\", \"org.apache.hadoop.fs.s3a.S3AFileSystem\")\n",
+    "hadoop_conf.set(\"fs.s3a.connection.ssl.enabled\", \"false\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 2. Création bucket Silver"
+   "source": [
+    "## 2. Création bucket Silver"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "# Bucket \"silver\" à créer manuellement via Garage CLI/WebUI si nécessaire"
+   "source": [
+    "# Bucket \"silver\" à créer manuellement via Garage CLI/WebUI si nécessaire"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 3. Lecture Bronze"
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": "stories_bronze = spark.read.format(\"delta\").load(f\"{BRONZE_PATH}/stories\")\ncomments_bronze = spark.read.format(\"delta\").load(f\"{BRONZE_PATH}/comments\")\n\nprint(f\"Stories: {stories_bronze.count()}, Comments: {comments_bronze.count()}\")"
+   "source": [
+    "## 3. Lecture Bronze"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "stories_bronze.printSchema()"
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": "## 4. Fonctions de nettoyage"
+   "source": [
+    "stories_bronze = spark.read.format(\"delta\").load(f\"{BRONZE_PATH}/stories\")\n",
+    "comments_bronze = spark.read.format(\"delta\").load(f\"{BRONZE_PATH}/comments\")\n",
+    "\n",
+    "print(f\"Stories: {stories_bronze.count()}, Comments: {comments_bronze.count()}\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from pyspark.sql.functions import col, when, regexp_replace, regexp_extract, length, trim, coalesce, lit\n\ndef clean_html(column):\n    c = col(column)\n    c = regexp_replace(c, r\"<[^>]+>\", \" \")\n    c = regexp_replace(c, r\"\\s+\", \" \")\n\n    html_entities = {\n        r\"&#x27;\": \"'\",\n        r\"&#x2F;\": \"/\",\n        r\"&quot;\": '\"',\n        r\"&amp;\": \"&\",\n        r\"&lt;\": \"<\",\n        r\"&gt;\": \">\"\n    }\n    for k, v in html_entities.items():\n        c = regexp_replace(c, k, v)\n\n    return when(col(column).isNull(), lit(\"\")).otherwise(trim(c))\n\ndef extract_domain(column):\n    return regexp_extract(col(column), r\"https?://(?:www\\.)?([^/]+)\", 1)"
+   "source": [
+    "stories_bronze.printSchema()"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 5. Nettoyage Stories"
+   "source": [
+    "## 4. Fonctions de nettoyage"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "stories_silver = stories_bronze \\\n    .filter(col(\"id\").isNotNull()) \\\n    .dropDuplicates([\"id\"]) \\\n    .withColumn(\"text_clean\", clean_html(\"text\")) \\\n    .withColumn(\"domain\", extract_domain(\"url\")) \\\n    .select(\"id\", \"by\", \"title\", \"url\", \"domain\", \"score\", \"descendants\", \n            \"text_clean\", \"timestamp\", \"_ingested_at\")\n\nstories_silver.show(3, truncate=40)"
+   "source": [
+    "from pyspark.sql.functions import col, when, regexp_replace, regexp_extract, length, trim, coalesce, lit\n",
+    "\n",
+    "def clean_html(column):\n",
+    "    c = col(column)\n",
+    "    c = regexp_replace(c, r\"<[^>]+>\", \" \")\n",
+    "    c = regexp_replace(c, r\"\\s+\", \" \")\n",
+    "\n",
+    "    html_entities = {\n",
+    "        r\"&#x27;\": \"'\",\n",
+    "        r\"&#x2F;\": \"/\",\n",
+    "        r\"&quot;\": '\"',\n",
+    "        r\"&amp;\": \"&\",\n",
+    "        r\"&lt;\": \"<\",\n",
+    "        r\"&gt;\": \">\"\n",
+    "    }\n",
+    "    for k, v in html_entities.items():\n",
+    "        c = regexp_replace(c, k, v)\n",
+    "\n",
+    "    return when(col(column).isNull(), lit(\"\")).otherwise(trim(c))\n",
+    "\n",
+    "def extract_domain(column):\n",
+    "    return regexp_extract(col(column), r\"https?://(?:www\\.)?([^/]+)\", 1)"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 6. Nettoyage Comments"
+   "source": [
+    "## 5. Nettoyage Stories"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "comments_silver = comments_bronze \\\n    .filter(col(\"id\").isNotNull()) \\\n    .filter(coalesce(col(\"deleted\"), lit(False)) == False) \\\n    .filter(coalesce(col(\"dead\"), lit(False)) == False) \\\n    .dropDuplicates([\"id\"]) \\\n    .withColumn(\"text_clean\", clean_html(\"text\")) \\\n    .filter(length(col(\"text_clean\")) > 0) \\\n    .select(\"id\", \"by\", \"parent\", \"text_clean\", \"timestamp\", \"_ingested_at\")\n\ncomments_silver.show(3, truncate=40)"
+   "source": [
+    "stories_silver = stories_bronze \\\n",
+    "    .filter(col(\"id\").isNotNull()) \\\n",
+    "    .dropDuplicates([\"id\"]) \\\n",
+    "    .withColumn(\"text_clean\", clean_html(\"text\")) \\\n",
+    "    .withColumn(\"domain\", extract_domain(\"url\")) \\\n",
+    "    .select(\"id\", \"by\", \"title\", \"url\", \"domain\", \"score\", \"descendants\", \n",
+    "            \"text_clean\", \"timestamp\", \"_ingested_at\")\n",
+    "\n",
+    "stories_silver.show(3, truncate=40)"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 7. Jointure Comments + Stories"
+   "source": [
+    "## 6. Nettoyage Comments"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "stories_for_join = stories_silver.select(\n    col(\"id\").alias(\"story_id\"),\n    col(\"title\").alias(\"story_title\"),\n    col(\"score\").alias(\"story_score\"),\n    col(\"domain\").alias(\"story_domain\")\n)\n\ncomments_enriched = comments_silver.join(\n    stories_for_join,\n    comments_silver[\"parent\"] == stories_for_join[\"story_id\"],\n    \"left\"\n)\n\ncomments_enriched.show(3, truncate=30)"
+   "source": [
+    "comments_silver = comments_bronze \\\n",
+    "    .filter(col(\"id\").isNotNull()) \\\n",
+    "    .filter(coalesce(col(\"deleted\"), lit(False)) == False) \\\n",
+    "    .filter(coalesce(col(\"dead\"), lit(False)) == False) \\\n",
+    "    .dropDuplicates([\"id\"]) \\\n",
+    "    .withColumn(\"text_clean\", clean_html(\"text\")) \\\n",
+    "    .filter(length(col(\"text_clean\")) > 0) \\\n",
+    "    .select(\"id\", \"by\", \"parent\", \"text_clean\", \"timestamp\", \"_ingested_at\")\n",
+    "\n",
+    "comments_silver.show(3, truncate=40)"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 8. Écriture Silver"
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": "stories_silver.write.format(\"delta\").mode(\"overwrite\").save(f\"{SILVER_PATH}/stories\")"
+   "source": [
+    "## 7. Écriture Silver"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "comments_silver.write.format(\"delta\").mode(\"overwrite\").save(f\"{SILVER_PATH}/comments\")"
+   "source": [
+    "stories_silver.write.format(\"delta\").mode(\"overwrite\").save(f\"{SILVER_PATH}/stories\")"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "comments_enriched.write.format(\"delta\").mode(\"overwrite\").save(f\"{SILVER_PATH}/comments_enriched\")"
+   "source": [
+    "comments_silver.write.format(\"delta\").mode(\"overwrite\").save(f\"{SILVER_PATH}/comments\")"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 9. Vérification"
+   "source": [
+    "## 8. Vérification"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "spark.read.format(\"delta\").load(f\"{SILVER_PATH}/stories\").show(3, truncate=30)\nspark.read.format(\"delta\").load(f\"{SILVER_PATH}/comments\").show(3, truncate=30)\nspark.read.format(\"delta\").load(f\"{SILVER_PATH}/comments_enriched\").show(3, truncate=30)"
+   "source": [
+    "spark.read.format(\"delta\").load(f\"{SILVER_PATH}/stories\").show(3, truncate=30)\n",
+    "spark.read.format(\"delta\").load(f\"{SILVER_PATH}/comments\").show(3, truncate=30)"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "spark.read.format(\"delta\").load(f\"{SILVER_PATH}/stories\") \\\n    .filter(col(\"domain\") != \"\") \\\n    .groupBy(\"domain\").count() \\\n    .orderBy(col(\"count\").desc()) \\\n    .show(5)"
+   "source": [
+    "spark.read.format(\"delta\").load(f\"{SILVER_PATH}/stories\") \\\n",
+    "    .filter(col(\"domain\") != \"\") \\\n",
+    "    .groupBy(\"domain\").count() \\\n",
+    "    .orderBy(col(\"count\").desc()) \\\n",
+    "    .show(5)"
+   ]
   },
   {
    "cell_type": "code",
@@ -158,10 +255,18 @@
    "name": "python3"
   },
   "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
    "name": "python",
-   "version": "3.11.0"
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
diff --git a/notebooks/03_silver_to_gold.ipynb b/notebooks/03_silver_to_gold.ipynb
diff --git a/notebooks/Dockerfile b/notebooks/Dockerfile