-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix concurrent batches deadlock #12335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
QMalcolm
wants to merge
12
commits into
main
Choose a base branch
from
qmalcolm--11420-fix-concurrent-batches-hanging
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+103
−17
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Keeping track of these things will allow us to not execute "too many" microbatch model runners in the upcoming commits
…ads, rounded down Each `MicrobatchModelRunner` is essentially a batch orchestrator, scheduling `MicrobatchBatchRunner` instances. When in a multi threaded environment, if the number of `MicrobatchModelRunner`s running was equal to the number of threads, then the run would lock up, because there'd be no threads available for running batches. By limiting the number of running `MicrobatchModelRunner` instances, we ensure to avoid deadlock.
…ilable It turns out `dbt run` (no threading) and `dbt run --threads=1` are not the same. The former is synchronous execution, the latter attemps asynchronous execution (but only has 1 thread). In the latter case, when `dbt run --threads=1`, a `MicrobatchModelRunner` would occupy the only available thread, and not be able to run any `MicrobatchBatchRunner`s causing deadlock. This change makes it such that if we're doing asynchronous execution (even if there is only one thread), we only submit the batch for asynchronous execution if there are threads available. If there are no threads available, the `MicrobatchBatchRunner` gets run syncchronously on the thread of the `MicrobatchModelRunner`. This conveniently also makes it so that the `MicrobatchModelRunner` doesn't "just" orchestrate batches in a asynchronous environment, but will also sychronously execute batches when threads are maxed.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #12335 +/- ##
=======================================
Coverage 91.35% 91.35%
=======================================
Files 203 203
Lines 25044 25063 +19
=======================================
+ Hits 22878 22896 +18
- Misses 2166 2167 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
MichelleArk
reviewed
Jan 9, 2026
MichelleArk
reviewed
Jan 9, 2026
MichelleArk
reviewed
Jan 9, 2026
… node_ids We weren't doing this :face-palm:. This meant we were over counting how many "microbatch" nodes were being run. This didn't cause any failures/deadlock, but it did "slow down" the process by making it think more microbatch things were running than there really were.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #11420
Problem
When microbatch models were being run concurrently (only possible with snowflake currently), people were experiencing deadlock 💀 😬 This happened when the number of distinct microbatch models being run reached the number of threads available. That is, say I have a project with ~1000 models, with 100 microbatch models, and running with 64 threads. If all of those threads were executing distinct microbatch models, you'd suddenly have deadlock. This is because there are
MicrobatchModelRunners andMicrobatchBatchRunners. In a multi-threaded environment, eachMicrobatchModelRunnertakes up thread and acts as an orchestrator ofMicrobatchBatchRunners which are run on separate threads (except for the first and last batch which are always run synchronously). So if you had 64 threads held byMicrobatchModelRunners that were trying to run 3+ batches, there were no threads left to execute the actual batches (MicrobatchBatchRunners) 🤦🏻Solution
MicrobatchModelRunnerthreads to half the number of possible threads, rounded down (minimum 1)MicrobatchModelRunners to executeMicrobatchBatchRunners synchronously when all threads are currently held by other processesOf note, I had taken a prior approach to this. It was similar in philosophy, but implemented differently. I don't have that code anymore, it was in a stash that I blew away. That first stab unfortunately caused the second to last batch to always hang. I'm not sure what I did differently this time, and I unfortunately don't have the code to compare. This implementation though does not suffer from that problem 🎉
Testing
Unfortunately we don't/can't have an integration test for this currently because our integration test suite uses postgres and the only adapter that currently supports concurrent batch execution is postgres. However, I did manually test this for what it's worth.
My testing process was
a. ✅
dbt run --single-threaded(will work)b. ❌
dbt run --threads=1(will deadlock)c. ❌
dbt run --threads=2(will deadlock)b. ✅
dbt run --threads=3(will work)a. ✅
dbt run --single-threaded(will work)b. ✅
dbt run --threads=1(will work)c. ✅
dbt run --threads=2(will work)b. ✅
dbt run --threads=3(will work)Checklist