explain the philosophy of databases as workflows

dimitri-yatsenko · dimitri-yatsenko · commit be89cb5da92e · 2025-10-08T17:27:07.000-05:00
diff --git a/book/30-schema-design/035-diagrams.ipynb b/book/30-schema-design/035-diagrams.ipynb
@@ -909,6 +909,11 @@
    "metadata": {},
    "source": "### Observing Orange Dots in Action\n\nIn the diagram above, notice:\n\n* **Two orange dots** appear between `Neuron` and `Synapse`\n* Each orange dot represents a renamed foreign key reference\n* One dot represents the `presynaptic` reference\n* The other represents the `postsynaptic` reference\n* Both ultimately reference `Neuron.neuron_id` (the primary key)\n\n**This is how DataJoint visualizes multigraphs**: When two tables are connected by multiple foreign keys, each foreign key appears as a separate edge with its own orange dot (if renamed) or direct line (if not renamed).\n\n**Interactive tip**: In Jupyter notebooks, hover over the orange dots to see:\n- Which table is being referenced\n- The projection expression showing the rename\n- The attribute name in the child table"
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Database Schemas as Workflows: A Different Philosophy\n\nDataJoint espouses a fundamentally different view of database schemas compared to traditional Entity-Relationship modeling: **database schemas represent workflows**, not just static collections of entities and their relationships.\n\n### Traditional ERD View: Static Entity-Relationship Model\n\nTraditional ER diagrams focus on:\n- **Entities**: Things that exist (Customer, Product, Order)\n- **Relationships**: How entities relate (customer \"places\" order, order \"contains\" product)\n- **Cardinality**: How many of each entity participate in relationships\n\n**Conceptual Model**: The database is a collection of related entities\n\n**No workflow concept**: ERDs don't inherently suggest an order of operations. You can't look at an ERD and know:\n- Which tables to populate first\n- What sequence of operations the business follows\n- How data flows through the system\n\n```\nTraditional ERD (no inherent direction):\n\n    Customer \u2190\u2500\u2500places\u2500\u2500\u2192 Order \u2190\u2500\u2500contains\u2500\u2500\u2192 Product\n       \u2195                    \u2195                      \u2195\n   Department          Shipment              Inventory\n```\n\n### DataJoint View: Schemas as Operational Workflows\n\nDataJoint schemas represent **sequences of steps, operations, or transformations**:\n\n**Conceptual Model**: The database is a workflow\u2014a directed sequence of data transformations and dependencies\n\n**Built-in workflow concept**: Every DataJoint diagram shows:\n- Which entities can be created first (top of diagram)\n- What depends on what (arrows show dependencies)\n- The operational sequence (read top-to-bottom)\n\n```\nDataJoint Schema (directional workflow):\n\n  Customer      Product      \u2190 Independent entities (populate first)\n      \u2193            \u2193\n         Order                \u2190 Depends on customers and products\n            \u2193\n        OrderItem             \u2190 Depends on orders\n            \u2193\n       Shipment               \u2190 Depends on items being ready\n            \u2193\n       Delivery               \u2190 Final step in workflow\n```\n\n**Reading the workflow**: \n1. Start by creating customers and products\n2. Customers place orders referencing products\n3. Orders are broken into items\n4. Items are collected and shipped\n5. Shipments are delivered to customers\n\n### The DAG Structure Enables Workflow Interpretation\n\nThe **Directed Acyclic Graph (DAG)** structure is not just a technical constraint\u2014it's a fundamental design choice that enables workflow thinking:\n\n**Direction**:\n- All foreign keys point \"upstream\" in the workflow\n- Dependencies flow from top to bottom\n- Schemas naturally represent operational sequences\n\n**Acyclic (No Loops)**:\n- Prevents circular dependencies\n- Ensures there's always a valid starting point\n- Makes the workflow execution order unambiguous\n\n**Implications**:\n- You can read any DataJoint schema as an operational manual\n- The vertical position tells you when things happen in the workflow\n- Understanding the schema = understanding the business process\n\n### Real-World Workflow Examples\n\n#### Scientific Experiment Workflow\n\n```\n   Study                    \u2190 Design the study\n     \u2193\n   Subject                  \u2190 Recruit subjects\n     \u2193\n   Session                  \u2190 Conduct sessions\n     \u2193\n   Recording                \u2190 Acquire data\n     \u2193\n   PreprocessedData         \u2190 Clean and preprocess\n     \u2193\n   AnalysisResult           \u2190 Analyze\n     \u2193\n   Figure                   \u2190 Visualize results\n```\n\n**Reading as workflow**:\n1. Scientists design a study (top-level entity)\n2. Recruit subjects for that study\n3. Run experimental sessions with those subjects\n4. Record data during sessions\n5. Preprocess the raw recordings\n6. Run analysis on preprocessed data\n7. Generate figures from results\n\n**The schema IS the experimental protocol**\u2014each table represents a step in the scientific process.\n\n#### E-Commerce Workflow\n\n```\n   Product      Customer    \u2190 Base entities\n      \u2193            \u2193\n         Order               \u2190 Customer purchases products\n           \u2193\n      OrderItem              \u2190 Order broken into items\n           \u2193\n        Payment              \u2190 Payment processed\n           \u2193\n       Shipment              \u2190 Items shipped\n           \u2193\n       Delivery              \u2190 Delivery confirmed\n```\n\n**The schema IS the business process**\u2014from product catalog and customer registration through to delivery confirmation.\n\n### Computational Workflows\n\nDataJoint is particularly powerful for **computational data pipelines** where data undergoes transformations:\n\n```\n   RawData                  \u2190 Initial data acquisition\n      \u2193\n   Validated                \u2190 Quality control step\n      \u2193\n   Normalized               \u2190 Normalization step\n      \u2193\n   FeatureExtracted         \u2190 Feature extraction\n      \u2193\n   Model (lookup)           \u2190 Model parameters\n      \u2193\n   Prediction               \u2190 Apply model to features\n      \u2193\n   Evaluation               \u2190 Evaluate predictions\n```\n\n**Key insight**: Each downstream table represents:\n- A transformation of upstream data\n- A computation that depends on previous steps\n- A checkpoint in the processing pipeline\n\nThe schema design directly maps to the computational workflow, making it clear:\n- What computations depend on what inputs\n- In what order operations can be executed\n- Which results can be cached or recomputed\n\n### Relationships Through Converging Edges\n\nIn DataJoint, **relationships are established by converging edges**\u2014when a table has foreign keys to multiple upstream tables:\n\n```\n    TableA        TableB\n       \u2193            \u2193\n          TableC            \u2190 Converging edges create relationship\n```\n\n**What this means**:\n- `TableC` requires matching entities from both `TableA` and `TableB`\n- To create an entry in `TableC`, you must find compatible entities upstream\n- The relationship is defined by the **matching** of upstream entities\n\n**Example: Assigning Employees to Projects**\n\n```\n   Employee      Project\n       \u2193            \u2193\n        Assignment          \u2190 Relates employees to projects\n```\n\nTo create an assignment:\n1. Find an employee (upstream in the workflow)\n2. Find a project (upstream in the workflow)\n3. Create the relationship (assignment) downstream\n\n**The schema enforces the workflow**: You can't assign employees to projects that don't exist yet. The structure guarantees the operational constraints.\n\n### Design for Efficient Queries\n\nDataJoint schemas are designed with **query efficiency** in mind:\n\n#### Solid Lines = Direct Join Paths\n\n```\n    Customer\n       \u2193 (solid)\n      Order\n       \u2193 (solid)\n    OrderItem\n```\n\n**Query benefit**: Can join `Customer * OrderItem` directly:\n```python\n# The primary key cascade means this works\ncustomer_items = Customer & {'customer_id': 42} * OrderItem\n```\n\n**In SQL**, the solid line path means the join is simple:\n```sql\nSELECT *\nFROM customer\nJOIN order_item USING (customer_id)  -- customer_id propagated through solid lines\n```\n\n#### Dashed Lines = Must Include Intermediate Tables\n\n```\n    Product\n       \u2193 (dashed)\n      Order\n       \u2193 (solid)\n    OrderItem\n```\n\n**Query requirement**: To join `Product` and `OrderItem`, must include `Order`:\n```python\n# Must include Order because of dashed line\nproduct_items = Product & {'product_id': 10} * Order * OrderItem\n```\n\n**The diagram guides query design**: Follow the solid line paths for efficient joins.\n\n### Contrast with Traditional ERD Philosophy\n\n| Aspect | Traditional ERD | DataJoint DAG |\n|--------|----------------|---------------|\n| **Primary focus** | Entities and their relationships | Workflow and data dependencies |\n| **Direction** | No inherent direction | Explicit top-to-bottom flow |\n| **Interpretation** | \"What entities exist and how they relate\" | \"What sequence of operations to perform\" |\n| **Workflow** | Not represented | Central organizing principle |\n| **Query guidance** | Not a primary concern | Line styles guide join strategies |\n| **Cycles** | Allowed (e.g., circular references) | Prohibited (enforces workflow) |\n| **Time/sequence** | Not represented | Implicit in vertical position |\n| **Process mapping** | Requires separate documentation | Schema IS the process map |\n\n### Scientific Workflows: A Concrete Example\n\nIn neuroscience research using DataJoint:\n\n```\n   Experiment              \u2190 Experimental design (what we plan to do)\n       \u2193\n     Subject               \u2190 Recruit and prepare subjects\n       \u2193\n     Session               \u2190 Conduct experimental sessions\n       \u2193\n     Recording             \u2190 Acquire neural recordings\n       \u2193\n     SpikeDetection        \u2190 Process: detect neural spikes\n       \u2193\n     CellSegmentation      \u2190 Process: identify individual cells\n       \u2193\n     ResponseAnalysis      \u2190 Process: analyze cell responses\n       \u2193\n     StatisticalTest       \u2190 Process: run statistics\n       \u2193\n     Figure                \u2190 Generate publication figures\n```\n\n**The schema IS the research pipeline**:\n- Each table represents a concrete step in the scientific method\n- Dependencies are explicit (can't analyze responses before detecting spikes)\n- Computational steps are clearly marked\n- Anyone can understand the research workflow by reading the schema\n\n**Practical benefits**:\n- **Onboarding**: New lab members understand the pipeline by reading the schema\n- **Reproducibility**: Schema documents the exact workflow\n- **Collaboration**: Everyone works with the same workflow representation\n- **Automation**: Computed tables automatically execute the workflow\n- **Debugging**: Can trace data lineage through the workflow\n\n### How This Shapes Database Design\n\nWhen you design a DataJoint schema, you're not just answering:\n- \"What entities do I have?\"\n- \"How are they related?\"\n\nYou're answering:\n- **\"What is the workflow?\"**\n- **\"What are the steps in my process?\"**\n- **\"What depends on what?\"**\n- **\"What can be computed from what?\"**\n\nThis workflow-centric thinking leads to different design decisions:\n\n**Traditional thinking**: \n- \"Employee and Department are entities that relate to each other\"\n- Might allow: Employee \u2192 Department \u2192 Manager \u2192 Employee (cycle)\n\n**Workflow thinking**:\n- \"First we create employees, then we assign them to departments\"\n- Requires: Employee, Department (independent), then EmployeeDepartment (assignment)\n- No cycles\u2014clear operational sequence\n\n### The Workflow Paradigm in Practice\n\nWhen using DataJoint:\n\n1. **Design phase**: Think about your workflow\n   - What are the steps in your process?\n   - What inputs does each step need?\n   - What does each step produce?\n\n2. **Schema structure**: Map workflow to tables\n   - Independent entities at top\n   - Dependent steps cascade down\n   - Computational steps as computed tables\n\n3. **Execution**: Follow the workflow\n   - Populate independent entities first\n   - Computed tables automatically execute when dependencies are ready\n   - Query results flow from the workflow structure\n\n4. **Understanding**: Read the schema as a process map\n   - Vertical position = when in the workflow\n   - Converging edges = where things come together\n   - Solid line paths = efficient query routes\n\n### Summary\n\nDataJoint's approach represents a paradigm shift:\n\n**From**: \"Database as a collection of related entities\" (ERD view)\n\n**To**: \"Database as an operational workflow\" (DataJoint view)\n\nThis shift has profound implications:\n- Schemas naturally represent business or scientific processes\n- Diagrams serve as executable documentation of workflows\n- Query patterns emerge naturally from workflow structure\n- Design thinking focuses on dependencies and sequences\n- The database becomes a workflow engine, not just a data store\n\nWhen you look at a DataJoint diagram, you're seeing:\n1. What data exists (entities)\n2. How data relates (relationships)\n3. **How work flows through the system** (workflow) \u2190 **DataJoint's unique contribution**"
+  },
   {
    "cell_type": "markdown",
    "metadata": {},