@@ -23,3 +23,193 @@ The `make` call of a master table first inserts the master entity and then inser
2323the matching part entities in the part tables.
2424None of the entities become visible to other processes until the entire ` make ` call
2525completes, at which point they all become visible.
26+
27+ ### Three-Part Make Pattern for Long Computations
28+
29+ For long-running computations, DataJoint provides an advanced pattern called the
30+ ** three-part make** that separates the ` make ` method into three distinct phases.
31+ This pattern is essential for maintaining database performance and data integrity
32+ during expensive computations.
33+
34+ #### The Problem: Long Transactions
35+
36+ Traditional ` make ` methods perform all operations within a single database transaction:
37+
38+ ``` python
39+ def make (self , key ):
40+ # All within one transaction
41+ data = (ParentTable & key).fetch1() # Fetch
42+ result = expensive_computation(data) # Compute (could take hours)
43+ self .insert1(dict (key, result = result)) # Insert
44+ ```
45+
46+ This approach has significant limitations:
47+ - ** Database locks** : Long transactions hold locks on tables, blocking other operations
48+ - ** Connection timeouts** : Database connections may timeout during long computations
49+ - ** Memory pressure** : All fetched data must remain in memory throughout the computation
50+ - ** Failure recovery** : If computation fails, the entire transaction is rolled back
51+
52+ #### The Solution: Three-Part Make Pattern
53+
54+ The three-part make pattern splits the ` make ` method into three distinct phases,
55+ allowing the expensive computation to occur outside of database transactions:
56+
57+ ``` python
58+ def make_fetch (self , key ):
59+ """ Phase 1: Fetch all required data from parent tables"""
60+ fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1())
61+ return fetched_data # must be a sequence, eg tuple or list
62+
63+ def make_compute (self , key , * fetched_data ):
64+ """ Phase 2: Perform expensive computation (outside transaction)"""
65+ computed_result = expensive_computation(* fetched_data)
66+ return computed_result # must be a sequence, eg tuple or list
67+
68+ def make_insert (self , key , * computed_result ):
69+ """ Phase 3: Insert results into the current table"""
70+ self .insert1(dict (key, result = computed_result))
71+ ```
72+
73+ #### Execution Flow
74+
75+ To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence:
76+
77+ ``` python
78+ # Step 1: Fetch data outside transaction
79+ fetched_data1 = self .make_fetch(key)
80+ computed_result = self .make_compute(key, * fetched_data1)
81+
82+ # Step 2: Begin transaction and verify data consistency
83+ begin transaction:
84+ fetched_data2 = self .make_fetch(key)
85+ if fetched_data1 != fetched_data2: # deep comparison
86+ cancel transaction # Data changed during computation
87+ else :
88+ self .make_insert(key, * computed_result)
89+ commit_transaction
90+ ```
91+
92+ #### Key Benefits
93+
94+ 1 . ** Reduced Database Lock Time** : Only the fetch and insert operations occur within transactions, minimizing lock duration
95+ 2 . ** Connection Efficiency** : Database connections are only used briefly for data transfer
96+ 3 . ** Memory Management** : Fetched data can be processed and released during computation
97+ 4 . ** Fault Tolerance** : Computation failures don't affect database state
98+ 5 . ** Scalability** : Multiple computations can run concurrently without database contention
99+
100+ #### Referential Integrity Protection
101+
102+ The pattern includes a critical safety mechanism: ** referential integrity verification** .
103+ Before inserting results, the system:
104+
105+ 1 . Re-fetches the source data within the transaction
106+ 2 . Compares it with the originally fetched data using deep hashing
107+ 3 . Only proceeds with insertion if the data hasn't changed
108+
109+ This prevents the "phantom read" problem where source data changes during long computations,
110+ ensuring that results remain consistent with their inputs.
111+
112+ #### Implementation Details
113+
114+ The pattern is implemented using Python generators in the ` AutoPopulate ` class:
115+
116+ ``` python
117+ def make (self , key ):
118+ # Step 1: Fetch data from parent tables
119+ fetched_data = self .make_fetch(key)
120+ computed_result = yield fetched_data
121+
122+ # Step 2: Compute if not provided
123+ if computed_result is None :
124+ computed_result = self .make_compute(key, * fetched_data)
125+ yield computed_result
126+
127+ # Step 3: Insert the computed result
128+ self .make_insert(key, * computed_result)
129+ yield
130+ ```
131+ Therefore, it is possible to override the ` make ` method to implement the three-part make pattern by using the ` yield ` statement to return the fetched data and computed result as above.
132+
133+ #### Use Cases
134+
135+ This pattern is particularly valuable for:
136+
137+ - ** Machine learning model training** : Hours-long training sessions
138+ - ** Image processing pipelines** : Large-scale image analysis
139+ - ** Statistical computations** : Complex statistical analyses
140+ - ** Data transformations** : ETL processes with heavy computation
141+ - ** Simulation runs** : Time-consuming simulations
142+
143+ #### Example: Long-Running Image Analysis
144+
145+ Here's an example of how to implement the three-part make pattern for a
146+ long-running image analysis task:
147+
148+ ``` python
149+ @schema
150+ class ImageAnalysis (dj .Computed ):
151+ definition = """
152+ # Complex image analysis results
153+ -> Image
154+ ---
155+ analysis_result : longblob
156+ processing_time : float
157+ """
158+
159+ def make_fetch (self , key ):
160+ """ Fetch the image data needed for analysis"""
161+ image_data = (Image & key).fetch1(' image' )
162+ params = (Params & key).fetch1(' params' )
163+ return (image_data, params) # pack fetched_data
164+
165+ def make_compute (self , key , image_data , params ):
166+ """ Perform expensive image analysis outside transaction"""
167+ import time
168+ start_time = time.time()
169+
170+ # Expensive computation that could take hours
171+ result = complex_image_analysis(image_data, params)
172+ processing_time = time.time() - start_time
173+ return result, processing_time
174+
175+ def make_insert (self , key , analysis_result , processing_time ):
176+ """ Insert the analysis results"""
177+ self .insert1(dict (key,
178+ analysis_result = analysis_result,
179+ processing_time = processing_time))
180+ ```
181+
182+ The exact same effect may be achieved by overriding the ` make ` method as a generator function using the ` yield ` statement to return the fetched data and computed result as above:
183+
184+ ``` python
185+ @schema
186+ class ImageAnalysis (dj .Computed ):
187+ definition = """
188+ # Complex image analysis results
189+ -> Image
190+ ---
191+ analysis_result : longblob
192+ processing_time : float
193+ """
194+
195+ def make (self , key ):
196+ image_data = (Image & key).fetch1(' image' )
197+ params = (Params & key).fetch1(' params' )
198+ computed_result = yield (image, params) # pack fetched_data
199+
200+ if computed_result is None :
201+ # Expensive computation that could take hours
202+ import time
203+ start_time = time.time()
204+ result = complex_image_analysis(image_data, params)
205+ processing_time = time.time() - start_time
206+ computed_result = result, processing_time # pack
207+ yield computed_result
208+
209+ result, processing_time = computed_result # unpack
210+ self .insert1(dict (key,
211+ analysis_result = result,
212+ processing_time = processing_time))
213+ yield # yield control back to the caller
214+ ```
215+ We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity.
0 commit comments