Skip to content

Commit 9437012

Browse files
committed
Refactor code structure for improved readability and maintainability
1 parent fc241e6 commit 9437012

3 files changed

Lines changed: 750 additions & 185 deletions

File tree

notebooks/03-exploratory-data-analysis.ipynb

Lines changed: 73 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,76 @@
11
{
2-
"nbformat": 4,
3-
"nbformat_minor": 5,
4-
"metadata": {
5-
"kernelspec": {
6-
"display_name": "Python 3",
7-
"language": "python",
8-
"name": "python3"
9-
},
10-
"language_info": {
11-
"name": "python",
12-
"version": "3.10.0"
13-
}
14-
},
152
"cells": [
163
{
174
"cell_type": "markdown",
5+
"id": "9ae22037",
186
"metadata": {},
19-
"source": "# Notebook 03 — Extraction and Precision\n\n\n\n**Duration:** ~30 minutes\n\n\n\nIn notebooks 01 and 02 you connected to an LLM and explored how it behaves. Now we give it real data for the first time.\n\n\n\nThe dataset is the **Privacy Act 2020** — a New Zealand Act of Parliament published as XML on the government's legislation website. We will fetch it live, parse it into sections, and then try two different methods for extracting structured information from those sections:\n\n\n\n1. A **rule-based method** (regular expressions) — deterministic, precise, but rigid\n\n2. An **LLM-based method** — flexible, but probabilistic and sometimes wrong\n\n\n\nBy the end of this notebook you will have measured the gap between these two approaches on real data — and started thinking about what that means for research at scale."
7+
"source": [
8+
"# Notebook 03 — Extraction and Precision\n",
9+
"\n",
10+
"\n",
11+
"\n",
12+
"**Duration:** ~30 minutes\n",
13+
"\n",
14+
"\n",
15+
"\n",
16+
"In notebooks 01 and 02 you connected to an LLM and explored how it behaves. Now we give it real data for the first time.\n",
17+
"\n",
18+
"\n",
19+
"\n",
20+
"The dataset is the **Privacy Act 2020** — a New Zealand Act of Parliament published as XML on the government's legislation website. We will fetch it live, parse it into sections, and then try two different methods for extracting structured information from those sections:\n",
21+
"\n",
22+
"\n",
23+
"\n",
24+
"1. A **rule-based method** (regular expressions) — deterministic, precise, but rigid\n",
25+
"\n",
26+
"2. An **LLM-based method** — flexible, but probabilistic and sometimes wrong\n",
27+
"\n",
28+
"\n",
29+
"\n",
30+
"By the end of this notebook you will have measured the gap between these two approaches on real data — and started thinking about what that means for research at scale."
31+
]
2032
},
2133
{
2234
"cell_type": "markdown",
2335
"metadata": {},
24-
"source": "## Setup\n\nRun this cell first. If you stored your Groq API key in Colab Secrets (notebook 01), it will load automatically."
36+
"source": [
37+
"## Setup\n",
38+
"\n",
39+
"Run this cell first. If you stored your Groq API key in Colab Secrets (notebook 01), it will load automatically."
40+
]
2541
},
2642
{
2743
"cell_type": "code",
2844
"execution_count": null,
2945
"metadata": {},
3046
"outputs": [],
31-
"source": "# ============================================================\n# SETUP CELL — Run this once at the start of every notebook\n# ============================================================\n\n!pip install groq requests lxml Pillow\nimport os, json, base64, requests, io\nfrom groq import Groq\nfrom lxml import etree\nfrom PIL import Image\nfrom IPython.display import Image as IPImage, display\n\n# Load API key from Colab Secrets (set up in notebook 01)\ntry:\n from google.colab import userdata\n os.environ[\"GROQ_API_KEY\"] = userdata.get(\"GROQ_API_KEY\")\n print(\"API key loaded from Colab Secrets.\")\nexcept Exception:\n os.environ[\"GROQ_API_KEY\"] = \"paste_your_key_here\" # <-- fallback: paste key here\n print(\"Could not load from Secrets. Paste your key in the line above.\")\n\nclient = Groq(api_key=os.environ[\"GROQ_API_KEY\"])\nTEXT_MODEL = \"llama-3.3-70b-versatile\"\nVISION_MODEL = \"meta-llama/llama-4-maverick-17b-128e-instruct\"\n\nprint(\"Setup complete.\")"
47+
"source": [
48+
"# ============================================================\n",
49+
"# SETUP CELL — Run this once at the start of every notebook\n",
50+
"# ============================================================\n",
51+
"\n",
52+
"!pip install groq requests lxml Pillow\n",
53+
"import os, json, base64, requests, io\n",
54+
"from groq import Groq\n",
55+
"from lxml import etree\n",
56+
"from PIL import Image\n",
57+
"from IPython.display import Image as IPImage, display\n",
58+
"\n",
59+
"# Load API key from Colab Secrets (set up in notebook 01)\n",
60+
"try:\n",
61+
" from google.colab import userdata\n",
62+
" os.environ[\"GROQ_API_KEY\"] = userdata.get(\"GROQ_API_KEY\")\n",
63+
" print(\"API key loaded from Colab Secrets.\")\n",
64+
"except Exception:\n",
65+
" os.environ[\"GROQ_API_KEY\"] = \"paste_your_key_here\" # <-- fallback: paste key here\n",
66+
" print(\"Could not load from Secrets. Paste your key in the line above.\")\n",
67+
"\n",
68+
"client = Groq(api_key=os.environ[\"GROQ_API_KEY\"])\n",
69+
"TEXT_MODEL = \"llama-3.3-70b-versatile\"\n",
70+
"VISION_MODEL = \"meta-llama/llama-4-maverick-17b-128e-instruct\"\n",
71+
"\n",
72+
"print(\"Setup complete.\")"
73+
]
3274
},
3375
{
3476
"cell_type": "markdown",
@@ -494,5 +536,18 @@
494536
"**Next up:** Notebook 04 moves from extraction to analysis. Instead of pulling out cross-references, you will ask the LLM to generate themes from the same corpus — and then test whether those themes hold up under scrutiny."
495537
]
496538
}
497-
]
498-
}
539+
],
540+
"metadata": {
541+
"kernelspec": {
542+
"display_name": "Python 3",
543+
"language": "python",
544+
"name": "python3"
545+
},
546+
"language_info": {
547+
"name": "python",
548+
"version": "3.10.0"
549+
}
550+
},
551+
"nbformat": 4,
552+
"nbformat_minor": 5
553+
}

0 commit comments

Comments
 (0)