Skip to content

feat: microbatch strategy #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

canbekley
Copy link
Contributor

@canbekley canbekley commented Jan 6, 2025

Adds support for the incremental "microbatch" strategy #398, that comes with dbt-core 1.9.0 #403
Currently only works for non-distributed incremental model, as dbt-core has hard-coded config.materialization = 'incremental' into a microbatching condition. Maybe this can be patched by the adapater though?

Checklist

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG

@canbekley
Copy link
Contributor Author

@BentsiLeviav may I ask for a review on this? it would be an important addition for our pipelines

@mshustov mshustov requested a review from BentsiLeviav January 20, 2025 16:51
@BentsiLeviav
Copy link
Contributor

Hi @canbekley

I'll review this in the next following days.
This feature depends on #403 , so feel free to answer there once you have time.
In addition, could you please add an indication of this feature to the README.md file? an explanation of how this was implemented would be appreciated

@pheepa
Copy link
Contributor

pheepa commented Feb 11, 2025

Hi, that's a great feature - thanks for implementing it!

I don't quite understand the logic behind requiring partition_by for this method if microbatching is handled via a series of delete_insert strategy operations.

Wouldn't it make more sense to either remove this restriction or implement it using insert_overwrite, similar to how it's done in dbt-bigquery? Reference.

In Snowflake, delete_insert are used without requiring partition_by: Reference.

Since ClickHouse struggles when handling a large number of complex mutations (my experience), I assume it would be better to go with insert_overwrite.

@canbekley canbekley force-pushed the feat/microbatch-strategy branch from 4f99c6e to 4041b02 Compare February 18, 2025 13:58
@canbekley
Copy link
Contributor Author

@pheepa I don't see where the microbatch strategy is requiring partition_by. You can also check the integration test model, which doesn't have a partition_by configuration, but a required unique_key instead. I have refactored some of the existing incremental strategy validations by moving them to a adapter method, including some of the insert_overwrite validations. Maybe that part was confusing?

I opted for delete_insert here because I think it is more flexibel, in that it doesn't require a partitioned table, and we can have arbitrary batch sizes independent of existing partition keys.

@pheepa
Copy link
Contributor

pheepa commented Feb 19, 2025

@canbekley I'm sorry, the refactoring of the incremental strategy validation confused me a little bit (but as an adapter method it makes more sense). Now it looks good to me.
Let's see how delete_insert goes

@canbekley
Copy link
Contributor Author

hi @BentsiLeviav, I've added documentation and fixed an issue with previous python versions. could you have a second look?

@canbekley
Copy link
Contributor Author

canbekley commented Mar 5, 2025

@BentsiLeviav Do you maybe have an idea how to circumvent this check in MicrobatchModelRunner._is_incremental(), which currently prevents distributed_incremental models being run in microbatches?

@canbekley canbekley force-pushed the feat/microbatch-strategy branch from 69bc361 to 91fc08e Compare March 14, 2025 14:48
@maowerner
Copy link

Hi, is there any update on this? I believe this feature is waiting for review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants