work on normalization

dimitri-yatsenko · dimitri-yatsenko · commit ef582c92a5b2 · 2025-09-18T21:32:04.000-05:00
diff --git a/book/30-schema-design/045-normalization.ipynb b/book/30-schema-design/045-normalization.ipynb
@@ -4,93 +4,35 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Database Normalization \n",
+    "# Database Normalization\n",
     "\n",
-    "# Normalization Principle\n",
+    "**Database normalization** is a set of principles for designing databases with clarity and logical rigor. Normalized designs communicate the mapping between real-world entities and their representations in database design.\n",
     "\n",
-    "A fundamental principle in database design is **normalization**—the practice of organizing data to minimize redundancy and dependency. One key aspect of normalization is that **each table should represent one distinct entity class**.\n",
+    "## Core Principle\n",
     "\n",
-    "## Why Separate Entity Types?\n",
+    "The fundamental principle of normalization is that **each table should represent one distinct entity class**.\n",
+    "\n",
+    "```{note}\n",
+    "In a normalized design, each row of a given table describes a distinct entity, and no two rows in that table represent different types of entities.\n",
+    "```\n",
+    "\n",
+    "## Why Normalization Matters\n",
     "\n",
     "Different entity types have different:\n",
     "- **Identification systems**: How they are uniquely identified\n",
-    "- **Attributes**: What properties they have\n",
+    "- **Attributes**: What properties they have  \n",
     "- **Relationships**: How they connect to other entities\n",
     "- **Business rules**: What constraints apply to them\n",
     "\n",
-    "## Example: Pet Shop Database\n",
-    "\n",
-    "Consider designing a database for a pet shop. You might be tempted to put everything in one table:\n",
-    "\n",
-    "```sql\n",
-    "-- BAD DESIGN: Mixing different entity types\n",
-    "CREATE TABLE pet_shop_data (\n",
-    "    id INT PRIMARY KEY,\n",
-    "    name VARCHAR(50),\n",
-    "    type VARCHAR(20),  -- 'cat', 'dog', 'employee', 'customer'\n",
-    "    breed VARCHAR(30),\n",
-    "    salary DECIMAL(10,2),\n",
-    "    phone VARCHAR(20),\n",
-    "    address TEXT\n",
-    ");\n",
-    "```\n",
-    "\n",
-    "This design violates the normalization principle because it mixes:\n",
-    "- **Pets** (cats, dogs) with attributes like breed\n",
-    "- **Employees** with attributes like salary\n",
-    "- **Customers** with attributes like phone and address\n",
-    "\n",
-    "## Better Design: Separate Tables\n",
-    "\n",
-    "```sql\n",
-    "-- GOOD DESIGN: Separate entity types\n",
-    "CREATE TABLE pet (\n",
-    "    pet_id INT PRIMARY KEY,\n",
-    "    name VARCHAR(50) NOT NULL,\n",
-    "    species ENUM('cat', 'dog', 'bird', 'fish') NOT NULL,\n",
-    "    breed VARCHAR(30),\n",
-    "    birth_date DATE,\n",
-    "    owner_id INT\n",
-    ");\n",
-    "\n",
-    "CREATE TABLE employee (\n",
-    "    employee_id INT PRIMARY KEY,\n",
-    "    name VARCHAR(50) NOT NULL,\n",
-    "    position VARCHAR(30) NOT NULL,\n",
-    "    salary DECIMAL(10,2) NOT NULL,\n",
-    "    hire_date DATE NOT NULL\n",
-    ");\n",
-    "\n",
-    "CREATE TABLE customer (\n",
-    "    customer_id INT PRIMARY KEY,\n",
-    "    name VARCHAR(50) NOT NULL,\n",
-    "    phone VARCHAR(20),\n",
-    "    email VARCHAR(100),\n",
-    "    address TEXT\n",
-    ");\n",
-    "```\n",
-    "\n",
-    "Each table now represents a distinct entity class with appropriate attributes and identification systems.\n",
-    "\n",
-    "**Database normalization** is a set of principles for designing databases with clarity and logical rigor. \n",
-    "Normalized designs communicate the mapping between real-world entities and their representations in database design. \n",
-    "\n",
-    "The term database normalization derives from relational database theory: \n",
-    "It applies to a data model where all data are represented as collections of related tables. \n",
-    "It may not apply equally to other data models.\n",
-    "\n",
-    "```{note}\n",
-    "In a normalized design, each row of a given table describes a distinct entity, and no two rows in that table represent different types of entities.\n",
-    "```\n",
+    "## Key Requirements\n",
     "\n",
-    "The table name (and its documentation) must clearly indicate what entity type is represented by the table's rows. We follow the convention whereby the table name must describe in singular form what each row represents. Thus a table describing database users might be named `User`.\n",
-    "Each table must have a primary key: the attributes that uniquely identify each entity in the table and in the real world. \n",
-    "Besides the primary key, each table may have secondary attributes. The secondary attributes must directly describe the entities of the table's class. In fully-normalized designs, the secondary attributes apply to each entity.\n",
+    "1. **Clear Entity Representation**: The table name must clearly indicate what entity type is represented by the table's rows (using singular form)\n",
+    "2. **Primary Key**: Each table must have a primary key that uniquely identifies each entity\n",
+    "3. **Relevant Attributes**: Secondary attributes must directly describe the entities of the table's class\n",
+    "4. **No Mixed Entities**: Avoid mixing different entity types in the same table\n",
     "\n",
-    "## Example of unnormalized designs\n",
-    "SQL does not enforce normalization and most database tutorials are full of unnormalized designs. For example, SQL allows defining tables with no primary key, which allows storing duplicate entries. DataJoint table definition syntax presumes the existence of a primary key: one must only indicate the separation between the primary attributes comprising the primary key and the secondary attributes. \n",
-    "Leaving SQL behind, I will show a few unnormalized designs using DataJoint table definition notation and then normalize the design. \n",
-    "For example, consider a table for representing items in a shopping cart for an e-commerce site.\n",
+    "## Example: E-commerce Shopping Cart\n",
+    "Let's examine a common unnormalized design using DataJoint notation. Consider a table for representing items in a shopping cart for an e-commerce site:\n",
     "\n",
     "```\n",
     ":: ShoppingCart\n",
@@ -115,30 +57,55 @@
     "The typical novice mistake is to put too much information in the same table, mixing information about different entities in the same table. This table contains information describing multiple entities: orders, items, and buyers, all in one. \n",
     "How would you fix this design?\n",
     "\n",
-    "## Fixing it\n",
-    "Database normalization requires splitting unnormalized tables into multiple tables where each table describes its separate entity type. We separate the representations of the order, items, and items in the order. We will also establish dependencies between them.\n",
-    "Then we describe items that might be included in different orders. We will assume that the item price is specific for each order and will omit it from the item table. Then the only secondary field is `item_description`. \n",
+    "## Normalized Solution\n",
     "\n",
+    "Database normalization requires splitting this into multiple tables, each representing a distinct entity type: \n",
+    "\n",
+    "### 1. Item Table\n",
+    "(DataJoint)\n",
+    "```python\n",
+    "@schema\n",
+    "class Item(dj.Manual):\n",
+    "    definition = \"\"\"\n",
+    "    item : int\n",
+    "    ---\n",
+    "    item_description : varchar(1000)\n",
+    "    \"\"\"\n",
     "```\n",
-    "::Item \n",
-    "item : int\n",
-    "---\n",
-    "item_description : varchar(1000)\n",
+    "(Equivalent SQL)\n",
+    "```sql\n",
+    "CREATE TABLE item (\n",
+    "    item INT,\n",
+    "    item_description VARCHAR(1000) NOT NULL\n",
+    "    PRIMARY KEY (item)\n",
+    ");\n",
     "```\n",
     "\n",
-    "Then let's represent the general information about the order, not pertaining to each item:\n",
-    "```\n",
-    "::Order\n",
-    "order_number : int\n",
-    "---\n",
-    "purchase_date : date\n",
-    "buyer_full_name : varchar(16)\n",
-    "buyer_address : varchar(1000)\n",
-    "buyer_email : varchar(120)\n",
-    "total_amount : numeric(8, 2)\n",
+    "### 2. Order Table\n",
+    "\n",
+    "(DataJoint)\n",
+    "```python\n",
+    "@schema\n",
+    "class Order(dj.Manual):\n",
+    "    definition = \"\"\"\n",
+    "    order_number : int\n",
+    "    ---\n",
+    "    purchase_date : date\n",
+    "    buyer_full_name : varchar(16)\n",
+    "    buyer_address : varchar(1000)\n",
+    "    buyer_email : varchar(120)\n",
+    "    total_amount : numeric(8, 2)\n",
+    "    \"\"\"\n",
     "```\n",
     "\n",
-    "Finally, we specify the items in the order in a separate table, OrderItem . This table associates each item, its price and quantity, to the order.\n",
+    "(Equivalent SQL)\n",
+    "```sql\n",
+    "CREATE TABLE order (\n",
+    "    order_number INT PRIMARY KEY,\n",
+    "    purchase_date DATE NOT NULL,\n",
+    "    buyer_full_name VARCHAR(16) NOT NULL,\n",
+    "\n",
+    "### 3. OrderItem Table (Junction Table), we specify the items in the order in a separate table, OrderItem . This table associates each item, its price and quantity, to the order.\n",
     "\n",
     "```\n",
     "::OrderItem\n",
@@ -162,28 +129,30 @@
     "item_quantity : int\n",
     "```\n",
     "\n",
-    "We can now plot the schema diagram:\n",
+    "## Benefits of Normalized Design\n",
     "\n",
-    "## Is the normalized design better?\n",
+    "The normalized design provides several advantages:\n",
     "\n",
-    "## Relationship to the classical normal forms "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import datajoint as dj\n",
-    "schema = dj.schema('dimitri_university')"
+    "1. **Eliminates Redundancy**: Item descriptions and buyer information stored once\n",
+    "2. **Ensures Consistency**: Changes to item descriptions automatically apply everywhere\n",
+    "3. **Prevents Anomalies**: No risk of inconsistent data across related records\n",
+    "4. **Improves Performance**: Smaller tables with focused indexes\n",
+    "5. **Enhances Maintainability**: Clear separation of concerns\n",
+    "\n",
+    "## Classical Normal Forms\n",
+    "\n",
+    "The normalization process follows specific rules called **normal forms**:\n",
+    "\n",
+    "- **First Normal Form (1NF)**: Eliminate repeating groups and ensure atomic values\n",
+    "- **Second Normal Form (2NF)**: Remove partial dependencies (all non-key attributes depend on the entire primary key)\n",
+    "- **Third Normal Form (3NF)**: Remove transitive dependencies (non-key attributes don't depend on other non-key attributes)\n",
+    "\n",
+    "The DataJoint approach enforces these principles by design, making it difficult to create unnormalized tables."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": []
   }
  ],