|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# Auto-Populate\n", |
| 7 | + "# Computation as Workflow\n", |
8 | 8 | "\n", |
9 | | - "**This is a draft, work in progress**\n", |
| 9 | + "**Draft — subject to revision**\n", |
10 | 10 | "\n", |
11 | | - "DataJoint’s superpower is declarative computation via dependencies: tables express what depends on what, and `populate()` orchestrates how/when to compute.\n", |
| 11 | + "DataJoint reframes databases as *workflows*: each table advertises what it depends on, and DataJoint’s `populate()` method runs only the computations that are still missing. The [Blob-detection Pipeline](../80-examples/075-blob-detection.ipynb) from the examples chapter demonstrates how this plays out in practice and meets the demands of scientific reproducibility (reproducible processing and a clear path from primary data to interface).\n", |
12 | 12 | "\n", |
13 | | - "The `populate` mechanism in DataJoint is a cornerstone for automating data processing within pipelines. It enables users to execute computations for derived tables systematically, ensuring that all required data is processed, stored, and remains consistent with the upstream dependencies.\n", |
| 13 | + "## From Declarative Schema to Executable Pipeline\n", |
14 | 14 | "\n", |
15 | | - "## Overview of `populate`\n", |
| 15 | + "A DataJoint schema mixes several table roles:\n", |
16 | 16 | "\n", |
17 | | - "Derived tables in DataJoint are typically declared as `Computed` or `Imported` tables. These tables depend on upstream tables and are populated by executing computations that generate their content. The `populate` mechanism automates this process by:\n", |
| 17 | + "- **Manual / lookup tables** capture authoritative inputs and configuration options.\n", |
| 18 | + "- **Computed tables** declare derived data and embed the logic that produces it.\n", |
| 19 | + "- **Part tables** attach one-to-many detail that should always be inserted atomically with their parent.\n", |
18 | 20 | "\n", |
19 | | - "1. Identifying unprocessed entries in the upstream dependencies.\n", |
20 | | - "2. Executing the computation logic defined in the `make` method of the table.\n", |
21 | | - "3. Inserting the resulting data into the derived table.\n", |
| 21 | + "Because dependencies are explicit, `populate()` can explore the graph top-down: for every upstream key that has not been processed, it executes the table’s `make()` method; if anything fails, the transaction is rolled back.\n", |
22 | 22 | "\n", |
23 | | - "### Syntax\n", |
| 23 | + "## Case Study: Blob Detection\n", |
24 | 24 | "\n", |
25 | | - "```python\n", |
26 | | - "<Table>.populate(safemode=True, reserve_jobs=False, display_progress=False)\n", |
27 | | - "```\n", |
28 | | - "\n", |
29 | | - "### Parameters\n", |
30 | | - "\n", |
31 | | - "1. **`safemode`** *(default: True)*:\n", |
32 | | - " - Prompts for confirmation before populating the table.\n", |
33 | | - " - Set to `False` to skip the confirmation prompt.\n", |
34 | | - "2. **`reserve_jobs`** *(default: False)*:\n", |
35 | | - " - Enables reservation of jobs for distributed processing.\n", |
36 | | - "3. **`display_progress`** *(default: False)*:\n", |
37 | | - " - Displays a progress bar for monitoring the population process.\n", |
38 | | - "\n", |
39 | | - "## Declaring a Computed Table\n", |
40 | | - "\n", |
41 | | - "To use the `populate` mechanism, define a derived table with a `make` method. The `make` method contains the logic for processing and populating the table.\n", |
| 25 | + "The notebook `075-blob-detection.ipynb` assembles a compact image-analysis workflow:\n", |
42 | 26 | "\n", |
43 | | - "### Example\n", |
| 27 | + "1. **Store source imagery** – `Image` is a manual table with a `longblob` field. NumPy arrays fetched from `skimage` are serialized automatically, illustrating the lecture’s warning that binary payloads need a serializer when you save them in a relational database.\n", |
| 28 | + "2. **Scan parameter space** – `BlobParamSet` is a lookup table of min/max sigma and threshold values for `skimage.feature.blob_doh`. Each combination represents an alternative experiment configuration—exactly the “experiment parameters” mindset stressed in class.\n", |
| 29 | + "3. **Compute detections** – `Detection` depends on both upstream tables. Its part table `Detection.Blob` holds every circle (x, y, radius) produced by the detector so that master and detail rows stay in sync.\n", |
44 | 30 | "\n", |
45 | 31 | "```python\n", |
46 | | - "import datajoint as dj\n", |
47 | | - "\n", |
48 | | - "schema = dj.Schema('example_schema')\n", |
49 | | - "\n", |
50 | 32 | "@schema\n", |
51 | | - "class Animal(dj.Manual):\n", |
| 33 | + "class Detection(dj.Computed):\n", |
52 | 34 | " definition = \"\"\"\n", |
53 | | - " animal_id: int # Unique identifier for the animal\n", |
| 35 | + " -> Image\n", |
| 36 | + " -> BlobParamSet\n", |
54 | 37 | " ---\n", |
55 | | - " species: varchar(64) # Species of the animal\n", |
56 | | - " age: int # Age of the animal in years\n", |
| 38 | + " nblobs : int\n", |
57 | 39 | " \"\"\"\n", |
58 | 40 | "\n", |
59 | | - "@schema\n", |
60 | | - "class AnimalSummary(dj.Computed):\n", |
61 | | - " definition = \"\"\"\n", |
62 | | - " -> Animal\n", |
63 | | - " ---\n", |
64 | | - " age_in_months: int # Age of the animal in months\n", |
65 | | - " \"\"\"\n", |
| 41 | + " class Blob(dj.Part):\n", |
| 42 | + " definition = \"\"\"\n", |
| 43 | + " -> master\n", |
| 44 | + " blob_id : int\n", |
| 45 | + " ---\n", |
| 46 | + " x : float\n", |
| 47 | + " y : float\n", |
| 48 | + " r : float\n", |
| 49 | + " \"\"\"\n", |
66 | 50 | "\n", |
67 | 51 | " def make(self, key):\n", |
68 | | - " # Fetch the source data\n", |
69 | | - " animal = (Animal & key).fetch1()\n", |
70 | | - " # Compute derived data\n", |
71 | | - " key['age_in_months'] = animal['age'] * 12\n", |
72 | | - " # Insert the result into the table\n", |
73 | | - " self.insert1(key)\n", |
74 | | - "\n", |
75 | | - "# Insert example data\n", |
76 | | - "Animal.insert([\n", |
77 | | - " {'animal_id': 1, 'species': 'Dog', 'age': 5},\n", |
78 | | - " {'animal_id': 2, 'species': 'Cat', 'age': 3}\n", |
79 | | - "])\n", |
80 | | - "\n", |
81 | | - "# Populate the AnimalSummary table\n", |
82 | | - "AnimalSummary.populate()\n", |
| 52 | + " img = (Image & key).fetch1(\"image\")\n", |
| 53 | + " params = (BlobParamSet & key).fetch1()\n", |
| 54 | + " blobs = blob_doh(img,\n", |
| 55 | + " min_sigma=params['min_sigma'],\n", |
| 56 | + " max_sigma=params['max_sigma'],\n", |
| 57 | + " threshold=params['threshold'])\n", |
| 58 | + " self.insert1(dict(key, nblobs=len(blobs)))\n", |
| 59 | + " self.Blob.insert(dict(key, blob_id=i, x=x, y=y, r=r)\n", |
| 60 | + " for i, (x, y, r) in enumerate(blobs))\n", |
83 | 61 | "```\n", |
84 | 62 | "\n", |
85 | | - "### Output\n", |
86 | | - "The `AnimalSummary` table will now contain:\n", |
| 63 | + "Running `Detection.populate(display_progress=True)` fans out over every `(image, paramset)` pair, creating six jobs in the demo notebook. Because each job lives in a transaction, half-written results never leak—one of the isolation guarantees highlighted in the lecture’s ACID recap.\n", |
87 | 64 | "\n", |
88 | | - "```plaintext\n", |
89 | | - "animal_id | age_in_months\n", |
90 | | - "----------|---------------\n", |
91 | | - " 1 | 60\n", |
92 | | - " 2 | 36\n", |
93 | | - "```\n", |
94 | | - "\n", |
95 | | - "## Using `populate` with Restrictions\n", |
96 | | - "\n", |
97 | | - "The `populate` method can be restricted to process only specific entries.\n", |
| 65 | + "## Curate the Preferred Result\n", |
98 | 66 | "\n", |
99 | | - "### Example\n", |
100 | | - "\n", |
101 | | - "```python\n", |
102 | | - "# Populate only entries for a specific animal\n", |
103 | | - "AnimalSummary.populate({'animal_id': 1})\n", |
104 | | - "```\n", |
| 67 | + "After inspecting the plots, a small manual table `SelectDetection` records the “best” parameter set for each image. That drives a final visualization that renders only the chosen detections. This illustrates a common pattern for the final project: let automation explore the combinatorics, then capture human judgment in a concise manual table. In the presentation, this curated view is what you would surface through Dash, Streamlit, or another GUI toolkit.\n", |
105 | 68 | "\n", |
106 | | - "## Distributed Processing with `reserve_jobs`\n", |
107 | | - "\n", |
108 | | - "The `reserve_jobs` parameter facilitates distributed processing by reserving entries for parallel workers. This ensures that multiple workers do not process the same entry.\n", |
109 | | - "\n", |
110 | | - "### Example\n", |
111 | | - "\n", |
112 | | - "```python\n", |
113 | | - "AnimalSummary.populate(reserve_jobs=True)\n", |
114 | | - "```\n", |
| 69 | + "## Why It Matters for the Final Project\n", |
115 | 70 | "\n", |
116 | | - "## Best Practices\n", |
| 71 | + "- **Reproducibility** – rerunning `populate()` regenerates every derived table from raw inputs, satisfying the requirement for trustworthy analyses.\n", |
| 72 | + "- **Dependency-aware scheduling** – you do not need to script job order; DataJoint infers it from foreign keys, exactly as promised in lecture.\n", |
| 73 | + "- **Extensibility** – adding a new image or parameter set triggers only the necessary new jobs, so the pipeline scales to the “at least six tables” complexity target.\n", |
117 | 74 | "\n", |
118 | | - "1. **Define Robust `make` Methods**:\n", |
119 | | - " - Ensure that `make` handles all dependencies and edge cases.\n", |
120 | | - "2. **Use `populate` Incrementally**:\n", |
121 | | - " - Test your `make` logic with specific keys before populating the entire table.\n", |
122 | | - "3. **Monitor Progress**:\n", |
123 | | - " - Enable `display_progress` to track long-running population processes.\n", |
124 | | - "4. **Leverage Distributed Processing**:\n", |
125 | | - " - Use `reserve_jobs` for large-scale pipelines to distribute the workload.\n", |
126 | | - "5. **Restrict When Necessary**:\n", |
127 | | - " - Use restrictions to focus on specific entries during debugging or incremental processing.\n", |
| 75 | + "## Practical Tips\n", |
128 | 76 | "\n", |
129 | | - "## Summary\n", |
| 77 | + "- Develop `make()` logic with restrictions (e.g., `Detection.populate(key)`) before unlocking the entire pipeline.\n", |
| 78 | + "- Use `display_progress=True` when you need visibility; use `reserve_jobs=True` when distributing work across multiple machines.\n", |
| 79 | + "- If your computed table writes both summary and detail rows, keep them in a part table so the transaction boundary protects them together.\n", |
130 | 80 | "\n", |
131 | | - "The `populate` mechanism in DataJoint automates the process of filling derived tables, ensuring consistent and efficient computation across your pipeline. By defining clear `make` methods and leveraging the flexibility of `populate`, you can streamline data processing workflows and maintain the integrity of your derived data.\n", |
132 | | - "\n" |
| 81 | + "The blob-detection notebook is a self-contained template: swap in your own raw data source, adjust the parameter search, and you have the skeleton for an end-to-end computational database ready to feed a dashboard demo on presentation day.\n" |
133 | 82 | ] |
134 | 83 | }, |
135 | 84 | { |
|
0 commit comments