|
308 | 308 | "```" |
309 | 309 | ] |
310 | 310 | }, |
| 311 | + { |
| 312 | + "cell_type": "markdown", |
| 313 | + "metadata": {}, |
| 314 | + "source": [ |
| 315 | + "## DataJoint's Workflow Perspective\n", |
| 316 | + "\n", |
| 317 | + "A fundamental insight underlying DataJoint's normalization approach: **databases are workflows where downstream data depends on the integrity of upstream data**.\n", |
| 318 | + "\n", |
| 319 | + "This workflow-centric view fundamentally shapes normalization principles and explains why DataJoint emphasizes immutability and avoidance of updates.\n" |
| 320 | + ] |
| 321 | + }, |
| 322 | + { |
| 323 | + "cell_type": "markdown", |
| 324 | + "metadata": {}, |
| 325 | + "source": [ |
| 326 | + "### Databases as Data Dependency Graphs\n", |
| 327 | + "\n", |
| 328 | + "**Traditional database thinking** emphasizes:\n", |
| 329 | + "- Transactions (e.g., moving money between accounts)\n", |
| 330 | + "- Current state (e.g., what is the balance now?)\n", |
| 331 | + "- Updates to reflect real-world changes\n", |
| 332 | + "\n", |
| 333 | + "**DataJoint's workflow thinking** emphasizes:\n", |
| 334 | + "- Data pipelines (derive results from source data through computational steps)\n", |
| 335 | + "- Data provenance (what upstream data produced these results?)\n", |
| 336 | + "- Immutable facts (each record represents a fact at a specific point in time)\n", |
| 337 | + "\n", |
| 338 | + "In DataJoint, tables form a **directed acyclic graph (DAG)** of dependencies, much like a computational workflow.\n" |
| 339 | + ] |
| 340 | + }, |
| 341 | + { |
| 342 | + "cell_type": "markdown", |
| 343 | + "metadata": {}, |
| 344 | + "source": [ |
| 345 | + "### Example: Neuroscience Pipeline\n", |
| 346 | + "\n", |
| 347 | + "Consider a typical neuroscience workflow:\n", |
| 348 | + "\n", |
| 349 | + "```\n", |
| 350 | + "Session ← Manual: Experimenter enters session info\n", |
| 351 | + " ↓ (foreign key)\n", |
| 352 | + "Recording ← Imported: Data acquisition writes raw signals\n", |
| 353 | + " ↓ (foreign key)\n", |
| 354 | + "FilteredRecording ← Computed: applies filters to Recording\n", |
| 355 | + " ↓ (foreign key)\n", |
| 356 | + "SpikeSorting ← Computed: detects spikes in FilteredRecording\n", |
| 357 | + " ↓ (foreign key)\n", |
| 358 | + "NeuronStatistics ← Computed: analyzes SpikeSorting results\n", |
| 359 | + "```\n", |
| 360 | + "\n", |
| 361 | + "**Each downstream table depends on upstream data:**\n", |
| 362 | + "- `FilteredRecording` is computed FROM `Recording` data\n", |
| 363 | + "- `SpikeSorting` is computed FROM `FilteredRecording` data\n", |
| 364 | + "- `NeuronStatistics` is computed FROM `SpikeSorting` data\n", |
| 365 | + "\n", |
| 366 | + "**Critical implication**: If upstream data changes, all downstream results become invalid.\n" |
| 367 | + ] |
| 368 | + }, |
| 369 | + { |
| 370 | + "cell_type": "markdown", |
| 371 | + "metadata": {}, |
| 372 | + "source": [ |
| 373 | + "### Why Updates Break Workflows\n", |
| 374 | + "\n", |
| 375 | + "In a workflow/pipeline view, **updates to upstream data silently invalidate downstream results**:\n", |
| 376 | + "\n", |
| 377 | + "**Scenario**: You discover the sampling rate in `Recording` for `{'session': 42}` was recorded incorrectly.\n", |
| 378 | + "\n", |
| 379 | + "**If you UPDATE:**\n", |
| 380 | + "```python\n", |
| 381 | + "# Fix the sampling rate\n", |
| 382 | + "Recording.update1({'session': 42, 'sampling_rate': 30000}) # Was 20000, should be 30000\n", |
| 383 | + "\n", |
| 384 | + "# But now:\n", |
| 385 | + "# - FilteredRecording(42) was computed using sampling_rate=20000\n", |
| 386 | + "# - SpikeSorting(42) was computed from FilteredRecording with wrong rate\n", |
| 387 | + "# - NeuronStatistics(42) was computed from SpikeSorting with wrong rate\n", |
| 388 | + "#\n", |
| 389 | + "# All downstream results are INVALID, but the database doesn't know!\n", |
| 390 | + "# No error, no warning, no indication of the problem.\n", |
| 391 | + "# The data looks fine but the science is wrong.\n", |
| 392 | + "```\n", |
| 393 | + "\n", |
| 394 | + "**If you use DELETE (forced by immutability):**\n", |
| 395 | + "```python\n", |
| 396 | + "# Try to fix by deleting and reinserting\n", |
| 397 | + "(Recording & {'session': 42}).delete()\n", |
| 398 | + "# Propagates the delete for session=42 to all downstream tables that depend on it:\n", |
| 399 | + "# - FilteredRecording\n", |
| 400 | + "# - SpikeSorting\n", |
| 401 | + "# - NeuronStatistics\n", |
| 402 | + "\n", |
| 403 | + "# Reinsert with correct data\n", |
| 404 | + "Recording.insert1({'session': 42, 'sampling_rate': 30000, ...})\n", |
| 405 | + "\n", |
| 406 | + "# Recompute entire pipeline\n", |
| 407 | + "FilteredRecording.populate({'session': 42})\n", |
| 408 | + "SpikeSorting.populate({'session': 42})\n", |
| 409 | + "NeuronStatistics.populate({'session': 42})\n", |
| 410 | + "\n", |
| 411 | + "# Now ALL results are consistent and scientifically valid\n", |
| 412 | + "```\n", |
| 413 | + "\n", |
| 414 | + "The dependency chain is **explicit** and **enforced**.\n" |
| 415 | + ] |
| 416 | + }, |
| 417 | + { |
| 418 | + "cell_type": "markdown", |
| 419 | + "metadata": {}, |
| 420 | + "source": [ |
| 421 | + "### How Workflow Thinking Leads to Normalization Principles\n", |
| 422 | + "\n", |
| 423 | + "The workflow perspective directly motivates DataJoint's normalization principles:\n", |
| 424 | + "\n", |
| 425 | + "**1. Immutability (INSERT/DELETE, not UPDATE)**\n", |
| 426 | + "- **Why**: Updates hide broken dependencies in the workflow\n", |
| 427 | + "- **Workflow view**: Upstream data is \"input\" to downstream computations—changing input invalidates output\n", |
| 428 | + "- **Solution**: DELETE forces explicit handling of all dependent data\n", |
| 429 | + "\n", |
| 430 | + "**2. Separate Changeable Attributes (Rule 3)** \n", |
| 431 | + "- **Why**: Time-varying properties represent different states in the workflow\n", |
| 432 | + "- **Workflow view**: Each state is a distinct input that produces distinct outputs\n", |
| 433 | + "- **Solution**: Model states as separate records (INSERTs), not updates to existing records\n", |
| 434 | + "\n", |
| 435 | + "**3. Entities with Only Intrinsic Properties (Rules 1 & 2)**\n", |
| 436 | + "- **Why**: Properties of different entities are at different nodes in the dependency graph\n", |
| 437 | + "- **Workflow view**: Each entity type represents a distinct stage or data type in the pipeline\n", |
| 438 | + "- **Solution**: Separate entity types into separate tables to make dependencies explicit\n", |
| 439 | + "\n", |
| 440 | + "### The Key Insight\n", |
| 441 | + "\n", |
| 442 | + "> **In workflow-centric databases, referential integrity isn't just about preventing orphaned records—it's about ensuring computational validity.**\n", |
| 443 | + "\n", |
| 444 | + "Foreign keys don't just link data; they represent **data provenance**:\n", |
| 445 | + "- \"This result was computed FROM this input\"\n", |
| 446 | + "- \"This analysis is BASED ON this measurement\" \n", |
| 447 | + "- \"This conclusion DEPENDS ON this observation\"\n", |
| 448 | + "\n", |
| 449 | + "When you UPDATE input data but leave outputs unchanged, you break the provenance chain. The outputs claim to be based on inputs that no longer exist (in their original form).\n", |
| 450 | + "\n", |
| 451 | + "**DataJoint's normalization principles ensure that data dependencies remain explicit and enforceable, making workflows scientifically reproducible and computationally sound.**\n" |
| 452 | + ] |
| 453 | + }, |
311 | 454 | { |
312 | 455 | "cell_type": "markdown", |
313 | 456 | "metadata": {}, |
|
476 | 619 | "# ❌ Using UPDATE - silent inconsistency:\n", |
477 | 620 | "RawImage.update1({'image_id': 42, 'brightness': 1.5})\n", |
478 | 621 | "# No error! But...\n", |
479 | | - "# - PreprocessedImage(42) was computed with brightness=1.0\n", |
480 | | - "# - SegmentedCells(42) was computed from OLD preprocessed image\n", |
481 | | - "# - CellActivity(42) was computed from OLD segmentation\n", |
| 622 | + "# - PreprocessedImage was computed with brightness=1.0\n", |
| 623 | + "# - SegmentedCells was computed from OLD preprocessed image\n", |
| 624 | + "# - CellActivity was computed from OLD segmentation\n", |
482 | 625 | "# All downstream results are now invalid but database doesn't know!\n", |
483 | 626 | "\n", |
484 | 627 | "# ✅ Using DELETE - explicit dependency handling:\n", |
485 | 628 | "(RawImage & {'image_id': 42}).delete()\n", |
486 | | - "# ERROR: PreprocessedImage(42) references this image!\n", |
487 | | - "# Must delete downstream first:\n", |
488 | | - "(CellActivity & {'image_id': 42}).delete()\n", |
489 | | - "(SegmentedCells & {'image_id': 42}).delete()\n", |
490 | | - "(PreprocessedImage & {'image_id': 42}).delete()\n", |
491 | | - "(RawImage & {'image_id': 42}).delete()\n", |
| 629 | + "# This delete will cascade to the dependent tables for image_id=42, \n", |
| 630 | + "# in reverse order of dependency:\n", |
| 631 | + "# CellActivity \n", |
| 632 | + "# SegmentedCells\n", |
| 633 | + "# PreprocessedImage\n", |
| 634 | + "# RawImage\n", |
492 | 635 | "\n", |
493 | 636 | "# Now reinsert and recompute entire pipeline\n", |
494 | 637 | "RawImage.insert1({'image_id': 42, 'brightness': 1.5, 'contrast': 1.0})\n", |
|
926 | 1069 | "\n", |
927 | 1070 | "When these principles are followed:\n", |
928 | 1071 | "\n", |
929 | | - "✅ **Data Integrity**: Each fact stored in exactly one place\n", |
930 | | - "✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n", |
931 | | - "✅ **Consistency**: Changes propagate correctly through foreign key relationships\n", |
932 | | - "✅ **Maintainability**: Changes are localized to specific tables\n", |
933 | | - "✅ **Clear Structure**: Schema reflects domain entities intuitively\n", |
934 | | - "✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n", |
935 | | - "✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n", |
936 | | - "✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n", |
| 1072 | + "- ✅ **Data Integrity**: Each fact stored in exactly one place\n", |
| 1073 | + "- ✅ **No Anomalies**: Update, insertion, and deletion anomalies eliminated\n", |
| 1074 | + "- ✅ **Consistency**: Changes propagate correctly through foreign key relationships\n", |
| 1075 | + "- ✅ **Maintainability**: Changes are localized to specific tables\n", |
| 1076 | + "- ✅ **Clear Structure**: Schema reflects domain entities intuitively\n", |
| 1077 | + "- ✅ **Immutability**: Entities remain stable; changes are tracked explicitly\n", |
| 1078 | + "- ✅ **History Preservation**: Time-varying data naturally preserved in separate tables\n", |
| 1079 | + "- ✅ **Pipeline Integrity**: Data dependencies are explicit and enforced\n", |
937 | 1080 | "\n", |
938 | 1081 | "### Practical Application\n", |
939 | 1082 | "\n", |
|
942 | 1085 | "1. **Identify entity types** in your domain\n", |
943 | 1086 | "2. **For each entity**, determine its permanent, intrinsic properties\n", |
944 | 1087 | "3. **Separate entities** into distinct tables\n", |
945 | | - "4. **Model relationships** with foreign keys\n", |
946 | | - "5. **Extract time-varying properties** into separate tables\n", |
| 1088 | + "4. **Model relationships** with foreign keys and association entities\n", |
| 1089 | + "5. **Extract time-varying properties** into separate entities\n", |
947 | 1090 | "6. **Verify** each table passes the three-question test\n", |
948 | 1091 | "\n", |
949 | 1092 | "This entity-centric approach achieves the same rigor as classical normalization but is much more intuitive and practical, especially for complex scientific and computational workflows.\n" |
|
0 commit comments