expand Create Schemas

dimitri-yatsenko · dimitri-yatsenko · commit 2c02fcab293c · 2025-01-12T12:23:56.000-06:00
diff --git a/book/30-schema-design/010-schema.ipynb b/book/30-schema-design/010-schema.ipynb
@@ -46,8 +46,7 @@
     "This modular approach:\n",
     "* Separates tables into logical groups for better organization.\n",
     "* Avoids naming conflicts in large databases with multiple schemas.\n",
-    "\n",
-    "For more details on designing multi-schema databases, refer to the section on multi-schema designs."
+    "\n"
    ]
   },
   {
@@ -102,10 +101,133 @@
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": []
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Working with Multi-Schema Databases\n",
+    "\n",
+    "Organizing larger databases into multiple smaller schemas (or modules) enhances clarity, modularity, and maintainability. In DataJoint, schemas serve as namespaces that group related tables together, while Python modules provide a corresponding organizational structure for the database code.\n",
+    "\n",
+    "## Convention: One Database Schema = One Python Module\n",
+    "\n",
+    "DataJoint projects are typically organized with each database schema mapped to a single Python module (`.py` file). This convention:\n",
+    "\n",
+    "* Promotes modularity by grouping all tables of a schema within one module.\n",
+    "* Ensures clarity by maintaining a single schema object per module.\n",
+    "* Avoids naming conflicts and simplifies dependency management.\n",
+    "\n",
+    "Each module declares its own schema object and defines all associated tables. Downstream schemas explicitly import upstream schemas to manage dependencies.\n",
+    "\n",
+    "## Dependency Management and Acyclic Design\n",
+    "\n",
+    "In multi-schema databases, dependencies between tables and schemas must form a Directed Acyclic Graph (DAG). Cyclic dependencies are not allowed. This ensures:\n",
+    "* Foreign key constraints maintain logical order without forming loops.\n",
+    "* Python module imports align with the dependency structure of the database.\n",
+    "\n",
+    "**Key Principles**:\n",
+    "1. Tables can reference each other within a schema or across schemas using foreign keys.\n",
+    "2. Dependencies should be topologically sorted, ensuring upstream schemas are imported into downstream schemas.\n",
+    "\n",
+    "# Advantages of Multi-Schema Design\n",
+    "1. **Modularity**: Each schema focuses on a specific aspect of the pipeline (e.g., acquisition, processing, analysis).\n",
+    "2. **Separation of Concerns**: Clear boundaries between schemas simplify navigation and troubleshooting.\n",
+    "3. **Scalability**: Isolated schemas enable easier updates and scaling as projects grow.\n",
+    "4. **Collaboration**: Teams can work on separate modules independently without conflicts.\n",
+    "5. **Maintainability**: Modular design facilitates version control and debugging.\n",
+    "\n",
+    "# Defining Complex Databases with Multiple Schemas in DataJoint\n",
+    "\n",
+    "In DataJoint, defining **multiple schemas across separate Python modules** ensures that large, complex projects remain well-organized, modular, and maintainable. Each schema should be defined in a **dedicated Python module** to adhere to best practices. This structure ensures that every module maintains **only one `schema` object**, and **downstream schemas import upstream schemas** to manage dependencies correctly. This approach improves code clarity, enables better version control, and simplifies collaboration across teams.\n",
+    "\n",
+    "The database schema and its Python module usually have similar names, although they need not be identical. \n",
+    "\n",
+    "Tables can form foreign key dependencies within modules and but also across modules. \n",
+    "In DataJoint, Such dependencies must be acyclic within each schema: dependencies cannot form closed cycles, so that the graph of dependences forms a DAG (directed acyclic graph). \n",
+    "Then also database modules form a directed acyclic graph at a higher level: the python modules should never form cyclic import dependences and their database schemas must be topologically sorted in the same way so that tables cannot make foreign key dependencies into tables that are in downstream schemas.\n",
+    "\n",
+    "\n",
+    "## Why Use Multiple Schemas in Separate Modules?\n",
+    "\n",
+    "Using multiple schemas across separate modules offers the following benefits:\n",
+    "\n",
+    "1. **Modularity and Code Organization**: Each module contains only the tables relevant to a specific schema, making the codebase easier to manage and navigate.\n",
+    "2. **Clear Boundaries Between Schemas**: Ensures a separation of concerns, where each schema focuses on a specific aspect of the pipeline (e.g., acquisition, processing, analysis).\n",
+    "3. **Dependency Management**: Downstream schemas explicitly **import upstream schemas** to manage table dependencies and data flow.\n",
+    "4. **Collaboration**: Multiple developers or teams can work on separate modules without conflicts.\n",
+    "5. **Scalability and Maintainability**: Isolating schemas into modules simplifies future updates and troubleshooting.\n",
+    "\n",
+    "\n",
+    "## How to Structure Modules for Multiple Schemas\n",
+    "\n",
+    "Below is an example that demonstrates how to organize multiple schemas in separate Python modules.\n",
+    "\n",
+    "# Example Project Structure\n",
+    "\n",
+    "Here’s an example of how to organize multiple schemas in a DataJoint project:\n",
+    "\n",
+    "```\n",
+    "my_pipeline/\n",
+    "│\n",
+    "├── subject.py      # Defines subject_management schema\n",
+    "├── acquisition.py  # Defines acquisition schema (depends on subject_management)\n",
+    "├── processing.py   # Defines processing schema (depends on acquisition)\n",
+    "└── analysis.py     # Defines analysis schema (depends on processing)\n",
+    "```\n",
+    "\n",
+    "## Step-by-Step Example\n",
+    "\n",
+    "1. `subject.py`:\n",
+    " * Defines the `subject_management` schema.\n",
+    " * Contains the Subject table and related entities.\n",
+    "2. `acquisition.py`:\n",
+    " * Defines the `acquisition` schema.\n",
+    " * Depends on subject_management for subject-related data.\n",
+    "3. `processing.py`:\n",
+    " * Defines the `processing` schema.\n",
+    " * Depends on `acquisition` for data to process.\n",
+    "4. `analysis.py`:\n",
+    " * Defines the `analysis` schema.\n",
+    " * Depends on `processing` for processed data to analyze.\n",
+    "\n",
+    "By adhering to these principles, large projects remain modular, scalable, and easy to maintain.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[0;32mimport\u001b[0m \u001b[0mdatajoint\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mdj\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0;31m# Define the subject management schema\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0mschema\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSchema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"subject_management\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0;34m@\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m\u001b[0;32mclass\u001b[0m \u001b[0mSubject\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mManual\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\n",
+      "\u001b[0;34m\u001b[0m    \u001b[0mdefinition\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"\"\"\u001b[0m\n",
+      "\u001b[0;34m    subject_id : int\u001b[0m\n",
+      "\u001b[0;34m    ---\u001b[0m\n",
+      "\u001b[0;34m    subject_name : varchar(50)\u001b[0m\n",
+      "\u001b[0;34m    species : varchar(50)\u001b[0m\n",
+      "\u001b[0;34m    \"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pycat code/subject.py"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -146,7 +268,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.11.10"
   }
  },
  "nbformat": 4,