|
| 1 | +--- |
| 2 | +title: Data Integrity |
| 3 | +date: 2025-10-31 |
| 4 | +authors: |
| 5 | + - name: Dimitri Yatsenko |
| 6 | +--- |
| 7 | + |
| 8 | +# Why Data Integrity Matters |
| 9 | + |
| 10 | +Imagine a neuroscience lab where recording sessions are tracked in a database. Without proper safeguards, you might encounter: |
| 11 | +- An experiment record pointing to a non-existent mouse |
| 12 | +- Two different experiments claiming the same unique identifier |
| 13 | +- A recording session missing its timestamp |
| 14 | +- Concurrent processes writing conflicting data simultaneously |
| 15 | + |
| 16 | +Each scenario represents a failure of **data integrity** — the database's ability to maintain accurate, consistent, and reliable data that faithfully represents reality. |
| 17 | + |
| 18 | +```{card} The Challenge |
| 19 | +**Data Integrity** is the ability of a database to define, express, and enforce rules for valid data states and transformations. |
| 20 | +^^^ |
| 21 | +
|
| 22 | +Scientific databases face unique challenges: |
| 23 | +- **Multiple users** entering data concurrently |
| 24 | +- **Long-running experiments** generating data over months or years |
| 25 | +- **Complex relationships** between experimental entities |
| 26 | +- **Evolving protocols** requiring schema updates |
| 27 | +- **Collaborative teams** with different data entry practices |
| 28 | +
|
| 29 | +Without robust integrity mechanisms, these challenges lead to: |
| 30 | +- Invalid or incomplete data entry |
| 31 | +- Loss of data during updates |
| 32 | +- Unwarranted alteration of historical records |
| 33 | +- Misidentification or mismatch of experimental subjects |
| 34 | +- Data duplication across tables |
| 35 | +- Broken references between related datasets |
| 36 | +``` |
| 37 | + |
| 38 | +# From Real-World Rules to Database Constraints |
| 39 | + |
| 40 | +The core challenge of database design is translating organizational rules into enforceable constraints. Consider a simple example: |
| 41 | + |
| 42 | +**Lab Rule:** "Each mouse must have a unique ID, and every recording session must reference a valid mouse." |
| 43 | + |
| 44 | +**Database Implementation:** |
| 45 | +- Mouse table with **primary key** constraint (entity integrity) |
| 46 | +- RecordingSession table with **foreign key** to Mouse (referential integrity) |
| 47 | +- Mouse ID **cannot be null** (completeness) |
| 48 | +- Recording timestamp **must be datetime type** (domain integrity) |
| 49 | + |
| 50 | +Relational databases excel at expressing and enforcing such rules through **integrity constraints** — declarative rules that the database automatically enforces. |
| 51 | + |
| 52 | +# Types of Data Integrity Constraints |
| 53 | + |
| 54 | +This section introduces six fundamental types of integrity constraints. Each will be covered in detail in subsequent chapters, with DataJoint implementation examples. |
| 55 | + |
| 56 | +## 1. Domain Integrity |
| 57 | +**Ensures values are within valid ranges and types.** |
| 58 | + |
| 59 | +Domain integrity restricts attribute values to predefined valid sets using: |
| 60 | +- **Data types**: `int`, `float`, `varchar`, `date`, `enum` |
| 61 | +- **Range constraints**: `unsigned`, `decimal(10,2)` |
| 62 | +- **Pattern matching**: Regular expressions for formatted strings |
| 63 | + |
| 64 | +**Example:** Recording temperature must be between 20-25°C. |
| 65 | + |
| 66 | +**Covered in:** [Tables](015-table.ipynb) — Data type specification |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## 2. Completeness |
| 71 | +**Guarantees required data is present.** |
| 72 | + |
| 73 | +Completeness prevents missing values that could invalidate analyses: |
| 74 | +- **Required fields** cannot be left empty (non-nullable) |
| 75 | +- **Default values** provide sensible fallbacks |
| 76 | +- **NOT NULL constraints** enforce data presence |
| 77 | + |
| 78 | +**Example:** Every experiment must have a start date. |
| 79 | + |
| 80 | +**Covered in:** |
| 81 | +- [Tables](015-table.ipynb) — Required vs. optional attributes |
| 82 | +- [Default Values](020-default-values.ipynb) — Handling optional data |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 3. Entity Integrity |
| 87 | +**Each entity has a unique, reliable identifier.** |
| 88 | + |
| 89 | +Entity integrity ensures one-to-one mapping between database records and real-world entities: |
| 90 | +- **Primary keys** uniquely identify each row |
| 91 | +- **Uniqueness constraints** prevent duplicates |
| 92 | +- **Identification strategies** (auto-increment, UUIDs, natural keys) |
| 93 | + |
| 94 | +**Example:** Each mouse has exactly one unique ID. |
| 95 | + |
| 96 | +**Covered in:** |
| 97 | +- [Primary Keys](025-primary-key.md) — Identification strategies |
| 98 | +- [UUID](030-uuid.ipynb) — Universally unique identifiers |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## 4. Referential Integrity |
| 103 | +**Relationships between entities remain consistent.** |
| 104 | + |
| 105 | +Referential integrity maintains logical associations across tables: |
| 106 | +- **Foreign keys** link related records |
| 107 | +- **Cascade operations** propagate changes |
| 108 | +- **Referential constraints** prevent orphaned records |
| 109 | + |
| 110 | +**Example:** A recording session cannot reference a non-existent mouse. |
| 111 | + |
| 112 | +**Covered in:** |
| 113 | +- [Foreign Keys](035-foreign-keys.ipynb) — Cross-table relationships |
| 114 | +- [Relationships](050-relationships.ipynb) — Dependency patterns |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## 5. Compositional Integrity |
| 119 | +**Complex entities remain complete with all parts.** |
| 120 | + |
| 121 | +Compositional integrity ensures multi-part entities are never partially stored: |
| 122 | +- **Transactions** bundle multiple operations |
| 123 | +- **Atomicity** guarantees all-or-nothing completion |
| 124 | +- **Part tables** maintain parent-child relationships |
| 125 | + |
| 126 | +**Example:** An imaging session's metadata and all acquired frames are stored together or not at all. |
| 127 | + |
| 128 | +**Covered in:** |
| 129 | +- [Part Tables](055-part-tables.ipynb) — Hierarchical compositions |
| 130 | +- [Transactions](../operations/045-transactions.ipynb) — Atomic operations |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## 6. Consistency |
| 135 | +**All users see the same valid data state.** |
| 136 | + |
| 137 | +Consistency provides a unified view during concurrent access: |
| 138 | +- **Isolation levels** control transaction visibility |
| 139 | +- **Locking mechanisms** prevent conflicting updates |
| 140 | +- **ACID properties** guarantee reliable state transitions |
| 141 | + |
| 142 | +**Example:** Two researchers inserting experiments simultaneously don't create duplicates. |
| 143 | + |
| 144 | +**Covered in:** |
| 145 | +- [Concurrency](../operations/050-concurrency.ipynb) — Multi-user operations |
| 146 | +- [Transactions](../operations/045-transactions.ipynb) — ACID guarantees |
| 147 | + |
| 148 | +# The Power of Declarative Constraints |
| 149 | + |
| 150 | +Unlike application-level validation (checking rules in Python code), database constraints are: |
| 151 | + |
| 152 | +1. **Always enforced** — Cannot be bypassed by any application |
| 153 | +2. **Automatically checked** — No developer implementation needed |
| 154 | +3. **Concurrent-safe** — Work correctly with multiple users |
| 155 | +4. **Self-documenting** — Schema explicitly declares rules |
| 156 | +5. **Performance-optimized** — Database engine enforces efficiently |
| 157 | + |
| 158 | +**Example Contrast:** |
| 159 | + |
| 160 | +```python |
| 161 | +# Application-level (fragile) |
| 162 | +if mouse_id not in existing_mice: |
| 163 | + raise ValueError("Invalid mouse ID") |
| 164 | +# Can be bypassed by other applications |
| 165 | + |
| 166 | +# Database-level (robust) |
| 167 | +# RecordingSession.mouse → FOREIGN KEY → Mouse.mouse_id |
| 168 | +# Automatically enforced for all applications |
| 169 | +``` |
| 170 | + |
| 171 | +# DataJoint's Approach to Integrity |
| 172 | + |
| 173 | +DataJoint builds on SQL's integrity mechanisms with additional features: |
| 174 | + |
| 175 | +- **Automatic foreign keys** from table dependencies |
| 176 | +- **Cascading deletes** that respect data pipelines |
| 177 | +- **Transaction management** for atomic operations |
| 178 | +- **Schema validation** catching errors before database creation |
| 179 | +- **Entity relationships** expressed in intuitive Python syntax |
| 180 | + |
| 181 | +As you progress through the following chapters, you'll see how DataJoint implements each integrity type through concise, expressive table declarations. |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +```{admonition} Next Steps |
| 186 | +:class: tip |
| 187 | +
|
| 188 | +Now that you understand *why* integrity matters, the following chapters show *how* to implement each constraint type: |
| 189 | +
|
| 190 | +1. **[Tables](015-table.ipynb)** — Basic structure with domain integrity |
| 191 | +2. **[Primary Keys](025-primary-key.md)** — Entity integrity through unique identification |
| 192 | +3. **[Foreign Keys](035-foreign-keys.ipynb)** — Referential integrity across tables |
| 193 | +
|
| 194 | +Each chapter builds on these foundational integrity concepts. |
| 195 | +``` |
0 commit comments