Skip to content

Commit eee6aa3

Browse files
expand normalization
1 parent 9ce9595 commit eee6aa3

File tree

1 file changed

+145
-8
lines changed

1 file changed

+145
-8
lines changed

book/30-schema-design/055-normalization.ipynb

Lines changed: 145 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
"\n",
1313
"1. **Traditional Normalization**: Based on Codd's normal forms, rooted in predicate calculus and functional dependencies\n",
1414
"2. **Entity Normalization**: Based on Chen's Entity-Relationship Model, focused on identifying well-defined entity types\n",
15-
"3. **Workflow Normalization**: Based on DataJoint's Entity-Workflow Model, emphasizing temporal workflow execution\n",
15+
"3. **Workflow Normalization**: Based on DataJoint's Entity-Workflow Model, emphasizing steps in workflow execution.\n",
1616
"\n",
1717
"Each approach provides a different lens for understanding what makes a schema well-designed, yet all converge on the same practical principles.\n"
1818
]
@@ -285,7 +285,7 @@
285285
"source": [
286286
"## Approach 2: Entity Normalization (Chen's Entity-Relationship Model)\n",
287287
"\n",
288-
"In 1976, Peter Chen introduced the Entity-Relationship Model [@10.1145/320434.320440], which revolutionized how we think about database design. Rather than starting with attributes and functional dependencies, Chen proposed starting with **entities** and **relationships**.\n",
288+
"In 1976, Peter Chen introduced the Entity-Relationship Model [@10.1145/320434.320440], which revolutionized how we think about database design. Rather than starting with predicates, attributes, and functional dependencies, Chen proposed starting with **entities** and **relationships**.\n",
289289
"\n",
290290
"### The Entity-Centric Foundation\n",
291291
"\n",
@@ -448,7 +448,7 @@
448448
"source": [
449449
"## Approach 3: Workflow Normalization (DataJoint's Entity-Workflow Model)\n",
450450
"\n",
451-
"DataJoint extends entity normalization with a temporal dimension: the **Entity-Workflow Model**. While traditional ERM focuses on **what entities exist**, DataJoint emphasizes **when and how entities are created** through workflow execution.\n",
451+
"DataJoint extends entity normalization with a sequential dimension: the **Entity-Workflow Model**. While traditional ERM focuses on **what entities exist** and how their relate to each other, DataJoint emphasizes **when and how entities are created** through workflow execution. Foreign keys not only define referential integrity, but also the order of operations.\n",
452452
"\n",
453453
"### The Workflow-Temporal Foundation\n",
454454
"\n",
@@ -640,6 +640,138 @@
640640
"- **Data populated at different times are separate**: Each workflow step has its own table\n"
641641
]
642642
},
643+
{
644+
"cell_type": "markdown",
645+
"metadata": {},
646+
"source": [
647+
"### Why Workflow Normalization is Stricter\n",
648+
"\n",
649+
"**Key insight**: Workflow normalization enforces temporal separation that entity normalization doesn't require. A table can be perfectly normalized under traditional and entity normalization yet still violate workflow normalization principles.\n",
650+
"\n",
651+
"**Example: E-commerce Order Processing**\n",
652+
"\n",
653+
"Consider an Order table that tracks the complete lifecycle of an order:\n",
654+
"\n",
655+
"```\n",
656+
"Order table\n",
657+
"┌──────────┬──────────────┬──────────────┬───────────────┬──────────────┬────────────────┐\n",
658+
"│order_id* │ product_id │ payment_date │ payment_method│ shipment_date│ delivery_date │\n",
659+
"├──────────┼──────────────┼──────────────┼───────────────┼──────────────┼────────────────┤\n",
660+
"│ 1001 │ WIDGET-A │ 2024-10-15 │ Credit Card │ 2024-10-16 │ 2024-10-18 │\n",
661+
"│ 1002 │ GADGET-B │ 2024-10-15 │ PayPal │ 2024-10-17 │ NULL │\n",
662+
"│ 1003 │ TOOL-C │ NULL │ NULL │ NULL │ NULL │\n",
663+
"└──────────┴──────────────┴──────────────┴───────────────┴──────────────┴────────────────┘\n",
664+
"```\n",
665+
"\n",
666+
"**Traditional normalization analysis:**\n",
667+
"- ✅ **1NF**: All attributes are atomic\n",
668+
"- ✅ **2NF**: No composite key, so no partial dependencies\n",
669+
"- ✅ **3NF**: All non-key attributes depend directly on `order_id`\n",
670+
"\n",
671+
"**Verdict**: Perfectly normalized!\n",
672+
"\n",
673+
"**Entity normalization analysis:**\n",
674+
"- **What entity type does this represent?** An Order\n",
675+
"- **Do all attributes describe the order?** Yes—payment details, shipment details, delivery details are all properties of this order\n",
676+
"- ✅ All attributes describe the Order entity\n",
677+
"- ✅ No transitive dependencies through other entity types\n",
678+
"\n",
679+
"**Verdict**: Perfectly normalized!\n",
680+
"\n",
681+
"**Workflow normalization analysis:**\n",
682+
"- **When is each attribute created?**\n",
683+
" - `product_id`: When order is **placed** (workflow step 1)\n",
684+
" - `payment_date`, `payment_method`: When payment is **processed** (workflow step 2)\n",
685+
" - `shipment_date`: When order is **shipped** (workflow step 3)\n",
686+
" - `delivery_date`: When order is **delivered** (workflow step 4)\n",
687+
"\n",
688+
"**Problems identified:**\n",
689+
"1. **Mixes workflow steps**: Table contains data created at four different times\n",
690+
"2. **Temporal sequence not enforced**: Nothing prevents `shipment_date` before `payment_date`\n",
691+
"3. **NULLs indicate incomplete workflow**: Row 1003 has many NULLs because workflow hasn't progressed\n",
692+
"4. **Requires UPDATEs**: As workflow progresses, must UPDATE the row multiple times\n",
693+
"5. **Lost workflow history**: When payment method changes, old value is lost\n",
694+
"6. **No workflow dependencies**: Database doesn't know that payment must precede shipment\n",
695+
"\n",
696+
"**Workflow normalization requires:**\n",
697+
"\n",
698+
"```python\n",
699+
"@schema\n",
700+
"class Order(dj.Manual):\n",
701+
" definition = \"\"\"\n",
702+
" order_id : int\n",
703+
" order_date : datetime\n",
704+
" ---\n",
705+
" -> Product\n",
706+
" customer_id : int\n",
707+
" \"\"\"\n",
708+
"\n",
709+
"@schema\n",
710+
"class Payment(dj.Manual):\n",
711+
" definition = \"\"\"\n",
712+
" -> Order # Can't pay before ordering\n",
713+
" ---\n",
714+
" payment_date : datetime\n",
715+
" payment_method : enum('Credit Card', 'PayPal', 'Bank Transfer')\n",
716+
" amount : decimal(10,2)\n",
717+
" \"\"\"\n",
718+
"\n",
719+
"@schema\n",
720+
"class Shipment(dj.Manual):\n",
721+
" definition = \"\"\"\n",
722+
" -> Payment # Can't ship before payment\n",
723+
" ---\n",
724+
" shipment_date : datetime\n",
725+
" carrier : varchar(50)\n",
726+
" tracking_number : varchar(100)\n",
727+
" \"\"\"\n",
728+
"\n",
729+
"@schema\n",
730+
"class Delivery(dj.Manual):\n",
731+
" definition = \"\"\"\n",
732+
" -> Shipment # Can't deliver before shipping\n",
733+
" ---\n",
734+
" delivery_date : datetime\n",
735+
" recipient_signature : varchar(100)\n",
736+
" \"\"\"\n",
737+
"\n",
738+
"@schema \n",
739+
"class DeliveryConfirmation(dj.Manual):\n",
740+
" definition = \"\"\"\n",
741+
" -> Delivery # Can't confirm before delivery\n",
742+
" ---\n",
743+
" confirmation_date : datetime\n",
744+
" confirmation_method : enum('Email', 'SMS', 'App')\n",
745+
" \"\"\"\n",
746+
"```\n",
747+
"\n",
748+
"**Workflow structure (enforced by foreign keys):**\n",
749+
"\n",
750+
"```\n",
751+
"Order ← Step 1: Customer places order\n",
752+
" ↓ (must exist before payment)\n",
753+
"Payment ← Step 2: Payment processed\n",
754+
" ↓ (must exist before shipment)\n",
755+
"Shipment ← Step 3: Order shipped\n",
756+
" ↓ (must exist before delivery)\n",
757+
"Delivery ← Step 4: Order delivered\n",
758+
" ↓ (must exist before confirmation)\n",
759+
"DeliveryConfirmation ← Step 5: Delivery confirmed\n",
760+
"```\n",
761+
"\n",
762+
"**Why this is better:**\n",
763+
"\n",
764+
"1. ✅ **Workflow sequence enforced**: Database prevents shipment before payment\n",
765+
"2. ✅ **No NULLs**: Each table only exists when its workflow step completes\n",
766+
"3. ✅ **Immutable artifacts**: Each workflow step creates permanent record\n",
767+
"4. ✅ **Complete history**: Can see exactly when each step occurred\n",
768+
"5. ✅ **No UPDATE needed**: Workflow progression is INSERT operations\n",
769+
"6. ✅ **Explicit dependencies**: Schema IS the workflow diagram\n",
770+
"7. ✅ **Workflow state query**: \"Show all paid-but-not-shipped orders\" = `Payment - Shipment`\n",
771+
"\n",
772+
"**This demonstrates**: Workflow normalization is **stricter** than traditional or entity normalization. It requires separating data not just by entity type or functional dependencies, but by **when and how that data is created** in the workflow.\n"
773+
]
774+
},
643775
{
644776
"cell_type": "markdown",
645777
"metadata": {},
@@ -745,7 +877,7 @@
745877
"**When UPDATE is appropriate:**\n",
746878
"- ✅ Correcting data entry errors (e.g., mouse sex was recorded incorrectly) -- but only if it is known that none of the downstream data depends on the attribute that is being updated.\n",
747879
"\n",
748-
"In all other cases, it is more appropriate to delete the old record and to re-populate the downstream data from the primary data."
880+
"In all other cases, it is more appropriate to delete the old record and to re-populate the downstream data from the primary data taking into account the updated attributes."
749881
]
750882
},
751883
{
@@ -756,7 +888,7 @@
756888
"\n",
757889
"In a properly normalized schema:\n",
758890
"- **Permanent attributes** never change (they're intrinsic to the entity)\n",
759-
"- **Time-varying attributes** are in separate tables with date/time in the primary key\n",
891+
"- **Time-varying attributes** are in separate tables (separate steps in the workflow), often with date/time in the primary key to preserve history.\n",
760892
"- \"Changing\" means adding new records (INSERT) or removing invalid ones (DELETE)\n",
761893
"\n",
762894
"**Example: Mouse Weight Over Time**\n"
@@ -915,7 +1047,7 @@
9151047
"source": [
9161048
"### The Philosophy: Updates as a Design Smell\n",
9171049
"\n",
918-
"**Key insight**: In a well-designed DataJoint schema, regular operations flow naturally through INSERT and DELETE alone. If you need UPDATE as part of normal workflows, it's a signal that:\n",
1050+
"**Key insight**: In a workflow-normalized schema, regular operations flow naturally through INSERT and DELETE alone. If you need UPDATE as part of normal workflows, it's a signal that:\n",
9191051
"\n",
9201052
"1. **Changeable attributes weren't separated** → Violates Rule 3\n",
9211053
"2. **Entities are poorly defined** → Violates Rules 1 & 2 \n",
@@ -1405,7 +1537,7 @@
14051537
"| **Foundation** | Predicate calculus, functional dependencies | Entity types and their properties | Workflow steps creating entities |\n",
14061538
"| **Conceptual Model** | Relations as predicates | Entities and relationships | Workflow execution graph (DAG) |\n",
14071539
"| **Core Question** | \"What functional dependencies exist?\" | \"What entity types exist?\" | \"When/how are entities created?\" |\n",
1408-
"| **Design Method** | Identify dependencies, decompose | Identify entities, separate entity types | Identify workflow steps, separate by time |\n",
1540+
"| **Design Method** | Identify dependencies, decompose | Identify entities, separate entity types | Identify workflow steps, separate by workflow steps |\n",
14091541
"| **Reasoning Style** | Abstract, mathematical | Concrete, intuitive | Temporal, operational |\n",
14101542
"| **Primary Focus** | Attribute-level dependencies | Entity-level coherence | Workflow-level dependencies |\n",
14111543
"| **Foreign Keys** | Referential integrity | Entity relationships | Workflow dependencies + referential integrity |\n",
@@ -1454,7 +1586,12 @@
14541586
"- **Entity**: Might be acceptable (all describe Mouse entity)\n",
14551587
"- **Workflow**: Violates workflow normalization (created at different workflow steps)\n",
14561588
"\n",
1457-
"**This shows:** Workflow normalization is the **strictest** form, requiring temporal separation that the other approaches don't mandate.\n"
1589+
"**Problem: Order with payment, shipment, and delivery dates**\n",
1590+
"- **Traditional**: ✅ Satisfies 3NF (all attributes depend directly on order_id)\n",
1591+
"- **Entity**: ✅ Acceptable (all describe properties of the Order entity)\n",
1592+
"- **Workflow**: ❌ Violates workflow normalization (each date created at different workflow step)\n",
1593+
"\n",
1594+
"**This shows:** Workflow normalization is the **strictest** form, requiring temporal separation that the other approaches don't mandate. Tables can be perfectly normalized under traditional and entity approaches yet still require further decomposition under workflow normalization to enforce temporal sequences and workflow dependencies.\n"
14581595
]
14591596
},
14601597
{

0 commit comments

Comments
 (0)