UoA-eResearch
diff --git a/‎notebooks/03-exploratory-data-analysis.ipynb‎
Lines changed: 73 additions & 18 deletions b/‎notebooks/03-exploratory-data-analysis.ipynb‎
Lines changed: 73 additions & 18 deletions
@@ -1,34 +1,76 @@
 {
- "nbformat": 4,
- "nbformat_minor": 5,
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.10.0"
-  }
- },
  "cells": [
   {
    "cell_type": "markdown",
+   "id": "9ae22037",
    "metadata": {},
-   "source": "# Notebook 03 — Extraction and Precision\n\n\n\n**Duration:** ~30 minutes\n\n\n\nIn notebooks 01 and 02 you connected to an LLM and explored how it behaves. Now we give it real data for the first time.\n\n\n\nThe dataset is the **Privacy Act 2020** — a New Zealand Act of Parliament published as XML on the government's legislation website. We will fetch it live, parse it into sections, and then try two different methods for extracting structured information from those sections:\n\n\n\n1. A **rule-based method** (regular expressions) — deterministic, precise, but rigid\n\n2. An **LLM-based method** — flexible, but probabilistic and sometimes wrong\n\n\n\nBy the end of this notebook you will have measured the gap between these two approaches on real data — and started thinking about what that means for research at scale."
+   "source": [
+    "# Notebook 03 — Extraction and Precision\n",
+    "\n",
+    "\n",
+    "\n",
+    "**Duration:** ~30 minutes\n",
+    "\n",
+    "\n",
+    "\n",
+    "In notebooks 01 and 02 you connected to an LLM and explored how it behaves. Now we give it real data for the first time.\n",
+    "\n",
+    "\n",
+    "\n",
+    "The dataset is the **Privacy Act 2020** — a New Zealand Act of Parliament published as XML on the government's legislation website. We will fetch it live, parse it into sections, and then try two different methods for extracting structured information from those sections:\n",
+    "\n",
+    "\n",
+    "\n",
+    "1. A **rule-based method** (regular expressions) — deterministic, precise, but rigid\n",
+    "\n",
+    "2. An **LLM-based method** — flexible, but probabilistic and sometimes wrong\n",
+    "\n",
+    "\n",
+    "\n",
+    "By the end of this notebook you will have measured the gap between these two approaches on real data — and started thinking about what that means for research at scale."
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## Setup\n\nRun this cell first. If you stored your Groq API key in Colab Secrets (notebook 01), it will load automatically."
+   "source": [
+    "## Setup\n",
+    "\n",
+    "Run this cell first. If you stored your Groq API key in Colab Secrets (notebook 01), it will load automatically."
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "# ============================================================\n# SETUP CELL — Run this once at the start of every notebook\n# ============================================================\n\n!pip install groq requests lxml Pillow\nimport os, json, base64, requests, io\nfrom groq import Groq\nfrom lxml import etree\nfrom PIL import Image\nfrom IPython.display import Image as IPImage, display\n\n# Load API key from Colab Secrets (set up in notebook 01)\ntry:\n    from google.colab import userdata\n    os.environ[\"GROQ_API_KEY\"] = userdata.get(\"GROQ_API_KEY\")\n    print(\"API key loaded from Colab Secrets.\")\nexcept Exception:\n    os.environ[\"GROQ_API_KEY\"] = \"paste_your_key_here\"   # <-- fallback: paste key here\n    print(\"Could not load from Secrets. Paste your key in the line above.\")\n\nclient = Groq(api_key=os.environ[\"GROQ_API_KEY\"])\nTEXT_MODEL = \"llama-3.3-70b-versatile\"\nVISION_MODEL = \"meta-llama/llama-4-maverick-17b-128e-instruct\"\n\nprint(\"Setup complete.\")"
+   "source": [
+    "# ============================================================\n",
+    "# SETUP CELL — Run this once at the start of every notebook\n",
+    "# ============================================================\n",
+    "\n",
+    "!pip install groq requests lxml Pillow\n",
+    "import os, json, base64, requests, io\n",
+    "from groq import Groq\n",
+    "from lxml import etree\n",
+    "from PIL import Image\n",
+    "from IPython.display import Image as IPImage, display\n",
+    "\n",
+    "# Load API key from Colab Secrets (set up in notebook 01)\n",
+    "try:\n",
+    "    from google.colab import userdata\n",
+    "    os.environ[\"GROQ_API_KEY\"] = userdata.get(\"GROQ_API_KEY\")\n",
+    "    print(\"API key loaded from Colab Secrets.\")\n",
+    "except Exception:\n",
+    "    os.environ[\"GROQ_API_KEY\"] = \"paste_your_key_here\"   # <-- fallback: paste key here\n",
+    "    print(\"Could not load from Secrets. Paste your key in the line above.\")\n",
+    "\n",
+    "client = Groq(api_key=os.environ[\"GROQ_API_KEY\"])\n",
+    "TEXT_MODEL = \"llama-3.3-70b-versatile\"\n",
+    "VISION_MODEL = \"meta-llama/llama-4-maverick-17b-128e-instruct\"\n",
+    "\n",
+    "print(\"Setup complete.\")"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -494,5 +536,18 @@
     "**Next up:** Notebook 04 moves from extraction to analysis. Instead of pulling out cross-references, you will ask the LLM to generate themes from the same corpus — and then test whether those themes hold up under scrutiny."
    ]
   }
- ]
-}
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}