improve the normalization chapter

dimitri-yatsenko · dimitri-yatsenko · commit f9034878eca8 · 2025-10-09T15:28:44.000-05:00
diff --git a/book/30-schema-design/055-normalization.ipynb b/book/30-schema-design/055-normalization.ipynb
@@ -308,6 +308,149 @@
         "```"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## DataJoint's Workflow Perspective\n",
+        "\n",
+        "A fundamental insight underlying DataJoint's normalization approach: **databases are workflows where downstream data depends on the integrity of upstream data**.\n",
+        "\n",
+        "This workflow-centric view fundamentally shapes normalization principles and explains why DataJoint emphasizes immutability and avoidance of updates.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Databases as Data Dependency Graphs\n",
+        "\n",
+        "**Traditional database thinking** emphasizes:\n",
+        "- Transactions (e.g., moving money between accounts)\n",
+        "- Current state (e.g., what is the balance now?)\n",
+        "- Updates to reflect real-world changes\n",
+        "\n",
+        "**DataJoint's workflow thinking** emphasizes:\n",
+        "- Data pipelines (derive results from source data through computational steps)\n",
+        "- Data provenance (what upstream data produced these results?)\n",
+        "- Immutable facts (each record represents a fact at a specific point in time)\n",
+        "\n",
+        "In DataJoint, tables form a **directed acyclic graph (DAG)** of dependencies, much like a computational workflow.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Example: Neuroscience Pipeline\n",
+        "\n",
+        "Consider a typical neuroscience workflow:\n",
+        "\n",
+        "```\n",
+        "Session                    ← Manual: Experimenter enters session info\n",
+        "    ↓ (foreign key)\n",
+        "Recording                  ← Imported: Data acquisition writes raw signals\n",
+        "    ↓ (foreign key)\n",
+        "FilteredRecording         ← Computed: applies filters to Recording\n",
+        "    ↓ (foreign key)\n",
+        "SpikeSorting              ← Computed: detects spikes in FilteredRecording\n",
+        "    ↓ (foreign key)\n",
+        "NeuronStatistics          ← Computed: analyzes SpikeSorting results\n",
+        "```\n",
+        "\n",
+        "**Each downstream table depends on upstream data:**\n",
+        "- `FilteredRecording` is computed FROM `Recording` data\n",
+        "- `SpikeSorting` is computed FROM `FilteredRecording` data\n",
+        "- `NeuronStatistics` is computed FROM `SpikeSorting` data\n",
+        "\n",
+        "**Critical implication**: If upstream data changes, all downstream results become invalid.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Why Updates Break Workflows\n",
+        "\n",
+        "In a workflow/pipeline view, **updates to upstream data silently invalidate downstream results**:\n",
+        "\n",
+        "**Scenario**: You discover the sampling rate in `Recording` for `{'session': 42}` was recorded incorrectly.\n",
+        "\n",
+        "**If you UPDATE:**\n",
+        "```python\n",
+        "# Fix the sampling rate\n",
+        "Recording.update1({'session': 42, 'sampling_rate': 30000})  # Was 20000, should be 30000\n",
+        "\n",
+        "# But now:\n",
+        "# - FilteredRecording(42) was computed using sampling_rate=20000\n",
+        "# - SpikeSorting(42) was computed from FilteredRecording with wrong rate\n",
+        "# - NeuronStatistics(42) was computed from SpikeSorting with wrong rate\n",
+        "#\n",
+        "# All downstream results are INVALID, but the database doesn't know!\n",
+        "# No error, no warning, no indication of the problem.\n",
+        "# The data looks fine but the science is wrong.\n",
+        "```\n",
+        "\n",
+        "**If you use DELETE (forced by immutability):**\n",
+        "```python\n",
+        "# Try to fix by deleting and reinserting\n",
+        "(Recording & {'session': 42}).delete()\n",
+        "# Propagates the delete for session=42 to all downstream tables that depend on it:\n",
+        "# - FilteredRecording\n",
+        "# - SpikeSorting\n",
+        "# - NeuronStatistics\n",
+        "\n",
+        "# Reinsert with correct data\n",
+        "Recording.insert1({'session': 42, 'sampling_rate': 30000, ...})\n",
+        "\n",
+        "# Recompute entire pipeline\n",
+        "FilteredRecording.populate({'session': 42})\n",
+        "SpikeSorting.populate({'session': 42})\n",
+        "NeuronStatistics.populate({'session': 42})\n",
+        "\n",
+        "# Now ALL results are consistent and scientifically valid\n",
+        "```\n",
+        "\n",
+        "The dependency chain is **explicit** and **enforced**.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### How Workflow Thinking Leads to Normalization Principles\n",
+        "\n",
+        "The workflow perspective directly motivates DataJoint's normalization principles:\n",
+        "\n",
+        "**1. Immutability (INSERT/DELETE, not UPDATE)**\n",
+        "- **Why**: Updates hide broken dependencies in the workflow\n",
+        "- **Workflow view**: Upstream data is \"input\" to downstream computations—changing input invalidates output\n",
+        "- **Solution**: DELETE forces explicit handling of all dependent data\n",
+        "\n",
+        "**2. Separate Changeable Attributes (Rule 3)**  \n",
+        "- **Why**: Time-varying properties represent different states in the workflow\n",
+        "- **Workflow view**: Each state is a distinct input that produces distinct outputs\n",
+        "- **Solution**: Model states as separate records (INSERTs), not updates to existing records\n",
+        "\n",
+        "**3. Entities with Only Intrinsic Properties (Rules 1 & 2)**\n",
+        "- **Why**: Properties of different entities are at different nodes in the dependency graph\n",
+        "- **Workflow view**: Each entity type represents a distinct stage or data type in the pipeline\n",
+        "- **Solution**: Separate entity types into separate tables to make dependencies explicit\n",
+        "\n",
+        "### The Key Insight\n",
+        "\n",
+        "> **In workflow-centric databases, referential integrity isn't just about preventing orphaned records—it's about ensuring computational validity.**\n",
+        "\n",
+        "Foreign keys don't just link data; they represent **data provenance**:\n",
+        "- \"This result was computed FROM this input\"\n",
+        "- \"This analysis is BASED ON this measurement\"  \n",
+        "- \"This conclusion DEPENDS ON this observation\"\n",
+        "\n",
+        "When you UPDATE input data but leave outputs unchanged, you break the provenance chain. The outputs claim to be based on inputs that no longer exist (in their original form).\n",
+        "\n",
+        "**DataJoint's normalization principles ensure that data dependencies remain explicit and enforceable, making workflows scientifically reproducible and computationally sound.**\n"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {},
@@ -476,19 +619,19 @@
         "# ❌ Using UPDATE - silent inconsistency:\n",
         "RawImage.update1({'image_id': 42, 'brightness': 1.5})\n",
         "# No error! But...\n",
-        "# - PreprocessedImage(42) was computed with brightness=1.0\n",
-        "# - SegmentedCells(42) was computed from OLD preprocessed image\n",
-        "# - CellActivity(42) was computed from OLD segmentation\n",
+        "# - PreprocessedImage was computed with brightness=1.0\n",
+        "# - SegmentedCells was computed from OLD preprocessed image\n",
+        "# - CellActivity was computed from OLD segmentation\n",
         "# All downstream results are now invalid but database doesn't know!\n",
         "\n",
         "# ✅ Using DELETE - explicit dependency handling:\n",
         "(RawImage & {'image_id': 42}).delete()\n",
-        "# ERROR: PreprocessedImage(42) references this image!\n",
-        "# Must delete downstream first:\n",
-        "(CellActivity & {'image_id': 42}).delete()\n",
-        "(SegmentedCells & {'image_id': 42}).delete()\n",
-        "(PreprocessedImage & {'image_id': 42}).delete()\n",
-        "(RawImage & {'image_id': 42}).delete()\n",
+        "# This delete will cascade to the dependent tables for image_id=42, \n",
+        "# in reverse order of dependency:\n",
+        "# CellActivity \n",
+        "# SegmentedCells\n",
+        "# PreprocessedImage\n",
+        "# RawImage\n",
         "\n",
         "# Now reinsert and recompute entire pipeline\n",
         "RawImage.insert1({'image_id': 42, 'brightness': 1.5, 'contrast': 1.0})\n",
@@ -926,14 +1069,14 @@
         "\n",
         "When these principles are followed:\n",
         "\n",
-        "✅ **Data Integrity**: Each fact stored in exactly one place\n",
-        "✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n",
-        "✅ **Consistency**: Changes propagate correctly through foreign key relationships\n",
-        "✅ **Maintainability**: Changes are localized to specific tables\n",
-        "✅ **Clear Structure**: Schema reflects domain entities intuitively\n",
-        "✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n",
-        "✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n",
-        "✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n",
+        "- ✅ **Data Integrity**: Each fact stored in exactly one place\n",
+        "- ✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n",
+        "- ✅ **Consistency**: Changes propagate correctly through foreign key relationships\n",
+        "- ✅ **Maintainability**: Changes are localized to specific tables\n",
+        "- ✅ **Clear Structure**: Schema reflects domain entities intuitively\n",
+        "- ✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n",
+        "- ✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n",
+        "- ✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n",
         "\n",
         "### Practical Application\n",
         "\n",
@@ -942,8 +1085,8 @@
         "1. **Identify entity types** in your domain\n",
         "2. **For each entity**, determine its permanent, intrinsic properties\n",
         "3. **Separate entities** into distinct tables\n",
-        "4. **Model relationships** with foreign keys\n",
-        "5. **Extract time-varying properties** into separate tables\n",
+        "4. **Model relationships** with foreign keys and association entities\n",
+        "5. **Extract time-varying properties** into separate entities\n",
         "6. **Verify** each table passes the three-question test\n",
         "\n",
         "This entity-centric approach achieves the same rigor as classical normalization but is much more intuitive and practical, especially for complex scientific and computational workflows.\n"