Skip to content

Commit f903487

Browse files
improve the normalization chapter
1 parent 7a70926 commit f903487

File tree

1 file changed

+162
-19
lines changed

1 file changed

+162
-19
lines changed

book/30-schema-design/055-normalization.ipynb

Lines changed: 162 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,149 @@
308308
"```"
309309
]
310310
},
311+
{
312+
"cell_type": "markdown",
313+
"metadata": {},
314+
"source": [
315+
"## DataJoint's Workflow Perspective\n",
316+
"\n",
317+
"A fundamental insight underlying DataJoint's normalization approach: **databases are workflows where downstream data depends on the integrity of upstream data**.\n",
318+
"\n",
319+
"This workflow-centric view fundamentally shapes normalization principles and explains why DataJoint emphasizes immutability and avoidance of updates.\n"
320+
]
321+
},
322+
{
323+
"cell_type": "markdown",
324+
"metadata": {},
325+
"source": [
326+
"### Databases as Data Dependency Graphs\n",
327+
"\n",
328+
"**Traditional database thinking** emphasizes:\n",
329+
"- Transactions (e.g., moving money between accounts)\n",
330+
"- Current state (e.g., what is the balance now?)\n",
331+
"- Updates to reflect real-world changes\n",
332+
"\n",
333+
"**DataJoint's workflow thinking** emphasizes:\n",
334+
"- Data pipelines (derive results from source data through computational steps)\n",
335+
"- Data provenance (what upstream data produced these results?)\n",
336+
"- Immutable facts (each record represents a fact at a specific point in time)\n",
337+
"\n",
338+
"In DataJoint, tables form a **directed acyclic graph (DAG)** of dependencies, much like a computational workflow.\n"
339+
]
340+
},
341+
{
342+
"cell_type": "markdown",
343+
"metadata": {},
344+
"source": [
345+
"### Example: Neuroscience Pipeline\n",
346+
"\n",
347+
"Consider a typical neuroscience workflow:\n",
348+
"\n",
349+
"```\n",
350+
"Session ← Manual: Experimenter enters session info\n",
351+
" ↓ (foreign key)\n",
352+
"Recording ← Imported: Data acquisition writes raw signals\n",
353+
" ↓ (foreign key)\n",
354+
"FilteredRecording ← Computed: applies filters to Recording\n",
355+
" ↓ (foreign key)\n",
356+
"SpikeSorting ← Computed: detects spikes in FilteredRecording\n",
357+
" ↓ (foreign key)\n",
358+
"NeuronStatistics ← Computed: analyzes SpikeSorting results\n",
359+
"```\n",
360+
"\n",
361+
"**Each downstream table depends on upstream data:**\n",
362+
"- `FilteredRecording` is computed FROM `Recording` data\n",
363+
"- `SpikeSorting` is computed FROM `FilteredRecording` data\n",
364+
"- `NeuronStatistics` is computed FROM `SpikeSorting` data\n",
365+
"\n",
366+
"**Critical implication**: If upstream data changes, all downstream results become invalid.\n"
367+
]
368+
},
369+
{
370+
"cell_type": "markdown",
371+
"metadata": {},
372+
"source": [
373+
"### Why Updates Break Workflows\n",
374+
"\n",
375+
"In a workflow/pipeline view, **updates to upstream data silently invalidate downstream results**:\n",
376+
"\n",
377+
"**Scenario**: You discover the sampling rate in `Recording` for `{'session': 42}` was recorded incorrectly.\n",
378+
"\n",
379+
"**If you UPDATE:**\n",
380+
"```python\n",
381+
"# Fix the sampling rate\n",
382+
"Recording.update1({'session': 42, 'sampling_rate': 30000}) # Was 20000, should be 30000\n",
383+
"\n",
384+
"# But now:\n",
385+
"# - FilteredRecording(42) was computed using sampling_rate=20000\n",
386+
"# - SpikeSorting(42) was computed from FilteredRecording with wrong rate\n",
387+
"# - NeuronStatistics(42) was computed from SpikeSorting with wrong rate\n",
388+
"#\n",
389+
"# All downstream results are INVALID, but the database doesn't know!\n",
390+
"# No error, no warning, no indication of the problem.\n",
391+
"# The data looks fine but the science is wrong.\n",
392+
"```\n",
393+
"\n",
394+
"**If you use DELETE (forced by immutability):**\n",
395+
"```python\n",
396+
"# Try to fix by deleting and reinserting\n",
397+
"(Recording & {'session': 42}).delete()\n",
398+
"# Propagates the delete for session=42 to all downstream tables that depend on it:\n",
399+
"# - FilteredRecording\n",
400+
"# - SpikeSorting\n",
401+
"# - NeuronStatistics\n",
402+
"\n",
403+
"# Reinsert with correct data\n",
404+
"Recording.insert1({'session': 42, 'sampling_rate': 30000, ...})\n",
405+
"\n",
406+
"# Recompute entire pipeline\n",
407+
"FilteredRecording.populate({'session': 42})\n",
408+
"SpikeSorting.populate({'session': 42})\n",
409+
"NeuronStatistics.populate({'session': 42})\n",
410+
"\n",
411+
"# Now ALL results are consistent and scientifically valid\n",
412+
"```\n",
413+
"\n",
414+
"The dependency chain is **explicit** and **enforced**.\n"
415+
]
416+
},
417+
{
418+
"cell_type": "markdown",
419+
"metadata": {},
420+
"source": [
421+
"### How Workflow Thinking Leads to Normalization Principles\n",
422+
"\n",
423+
"The workflow perspective directly motivates DataJoint's normalization principles:\n",
424+
"\n",
425+
"**1. Immutability (INSERT/DELETE, not UPDATE)**\n",
426+
"- **Why**: Updates hide broken dependencies in the workflow\n",
427+
"- **Workflow view**: Upstream data is \"input\" to downstream computations—changing input invalidates output\n",
428+
"- **Solution**: DELETE forces explicit handling of all dependent data\n",
429+
"\n",
430+
"**2. Separate Changeable Attributes (Rule 3)** \n",
431+
"- **Why**: Time-varying properties represent different states in the workflow\n",
432+
"- **Workflow view**: Each state is a distinct input that produces distinct outputs\n",
433+
"- **Solution**: Model states as separate records (INSERTs), not updates to existing records\n",
434+
"\n",
435+
"**3. Entities with Only Intrinsic Properties (Rules 1 & 2)**\n",
436+
"- **Why**: Properties of different entities are at different nodes in the dependency graph\n",
437+
"- **Workflow view**: Each entity type represents a distinct stage or data type in the pipeline\n",
438+
"- **Solution**: Separate entity types into separate tables to make dependencies explicit\n",
439+
"\n",
440+
"### The Key Insight\n",
441+
"\n",
442+
"> **In workflow-centric databases, referential integrity isn't just about preventing orphaned records—it's about ensuring computational validity.**\n",
443+
"\n",
444+
"Foreign keys don't just link data; they represent **data provenance**:\n",
445+
"- \"This result was computed FROM this input\"\n",
446+
"- \"This analysis is BASED ON this measurement\" \n",
447+
"- \"This conclusion DEPENDS ON this observation\"\n",
448+
"\n",
449+
"When you UPDATE input data but leave outputs unchanged, you break the provenance chain. The outputs claim to be based on inputs that no longer exist (in their original form).\n",
450+
"\n",
451+
"**DataJoint's normalization principles ensure that data dependencies remain explicit and enforceable, making workflows scientifically reproducible and computationally sound.**\n"
452+
]
453+
},
311454
{
312455
"cell_type": "markdown",
313456
"metadata": {},
@@ -476,19 +619,19 @@
476619
"# ❌ Using UPDATE - silent inconsistency:\n",
477620
"RawImage.update1({'image_id': 42, 'brightness': 1.5})\n",
478621
"# No error! But...\n",
479-
"# - PreprocessedImage(42) was computed with brightness=1.0\n",
480-
"# - SegmentedCells(42) was computed from OLD preprocessed image\n",
481-
"# - CellActivity(42) was computed from OLD segmentation\n",
622+
"# - PreprocessedImage was computed with brightness=1.0\n",
623+
"# - SegmentedCells was computed from OLD preprocessed image\n",
624+
"# - CellActivity was computed from OLD segmentation\n",
482625
"# All downstream results are now invalid but database doesn't know!\n",
483626
"\n",
484627
"# ✅ Using DELETE - explicit dependency handling:\n",
485628
"(RawImage & {'image_id': 42}).delete()\n",
486-
"# ERROR: PreprocessedImage(42) references this image!\n",
487-
"# Must delete downstream first:\n",
488-
"(CellActivity & {'image_id': 42}).delete()\n",
489-
"(SegmentedCells & {'image_id': 42}).delete()\n",
490-
"(PreprocessedImage & {'image_id': 42}).delete()\n",
491-
"(RawImage & {'image_id': 42}).delete()\n",
629+
"# This delete will cascade to the dependent tables for image_id=42, \n",
630+
"# in reverse order of dependency:\n",
631+
"# CellActivity \n",
632+
"# SegmentedCells\n",
633+
"# PreprocessedImage\n",
634+
"# RawImage\n",
492635
"\n",
493636
"# Now reinsert and recompute entire pipeline\n",
494637
"RawImage.insert1({'image_id': 42, 'brightness': 1.5, 'contrast': 1.0})\n",
@@ -926,14 +1069,14 @@
9261069
"\n",
9271070
"When these principles are followed:\n",
9281071
"\n",
929-
"✅ **Data Integrity**: Each fact stored in exactly one place\n",
930-
"✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n",
931-
"✅ **Consistency**: Changes propagate correctly through foreign key relationships\n",
932-
"✅ **Maintainability**: Changes are localized to specific tables\n",
933-
"✅ **Clear Structure**: Schema reflects domain entities intuitively\n",
934-
"✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n",
935-
"✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n",
936-
"✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n",
1072+
"- ✅ **Data Integrity**: Each fact stored in exactly one place\n",
1073+
"- ✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n",
1074+
"- ✅ **Consistency**: Changes propagate correctly through foreign key relationships\n",
1075+
"- ✅ **Maintainability**: Changes are localized to specific tables\n",
1076+
"- ✅ **Clear Structure**: Schema reflects domain entities intuitively\n",
1077+
"- ✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n",
1078+
"- ✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n",
1079+
"- ✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n",
9371080
"\n",
9381081
"### Practical Application\n",
9391082
"\n",
@@ -942,8 +1085,8 @@
9421085
"1. **Identify entity types** in your domain\n",
9431086
"2. **For each entity**, determine its permanent, intrinsic properties\n",
9441087
"3. **Separate entities** into distinct tables\n",
945-
"4. **Model relationships** with foreign keys\n",
946-
"5. **Extract time-varying properties** into separate tables\n",
1088+
"4. **Model relationships** with foreign keys and association entities\n",
1089+
"5. **Extract time-varying properties** into separate entities\n",
9471090
"6. **Verify** each table passes the three-question test\n",
9481091
"\n",
9491092
"This entity-centric approach achieves the same rigor as classical normalization but is much more intuitive and practical, especially for complex scientific and computational workflows.\n"

0 commit comments

Comments
 (0)