feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables#52
Open
sunild wants to merge 5 commits intoMeltanoLabs:mainfrom
Open
feat: Add configuration options for PK chunking to help w/the initial sync of large/wide tables#52sunild wants to merge 5 commits intoMeltanoLabs:mainfrom
sunild wants to merge 5 commits intoMeltanoLabs:mainfrom
Conversation
Add a `pk_chunking` boolean config option to use PK chunking from the start, instead of failing over to it when a job fails w/a query timeout. This is useful when syncing large tables. The query timeout behavior has been preserved, but does not seem to work in all scenarios.
- replace array of table names w/a tables property - add a chunk_size property to resolve issues w/large tables - add a polling_sleep_time property to specify how long to wait
sunild
commented
Feb 7, 2024
|
|
||
| return batch['batchInfo']['id'] | ||
|
|
||
| def _complete_batch(self, state, tap_stream_id, batch_id): |
Author
There was a problem hiding this comment.
I added this method b/c when a pk chunked job was complete, the bookmark still had the JobID and BatchIDs properties in it. The JobID needs to be cleared from the state, otherwise on the next syncing interval, it would incorrectly think it was resuming a failed batch (but the BatchIDs array was empty).
See:
tap-salesforce/tap_salesforce/__init__.py
Lines 453 to 455 in d802f68
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I ran into two problems when syncing tables with the BULK api.
Notetable in one of our SF accounts must be associated w/a large number of other objects, b/c we get this error:InvalidBatch : Failed to process query: OPERATION_TOO_LARGE: exceeded 20000 distinct idsAddressing the first problem
We found it helpful to enable PK chunking for specific tables from the get go, rather than wait for it to fail over to PK chunking after a query timeout. The tables in question are fairly large and have lots of columns, and we have to use PK chunking w/them elsewhere.
The existing code is supposed to fail over to PK chunking when a query timeout occurs, but it didn't in our case. I believe the behavior of the API may have changed, based on what the the original tap-salesforce is doing now (and that seems outdated too, 15 vs 30 retries).
I'd like to submit a separate PR to address the fail over problem.
Addressing the second one
We got the "OPERATION_TOO_LARGE" error with and without PK chunking. One of the solutions is to do a smaller query, so a smaller chunk size avoided the error.
Configuration options for PK chunking
I'll happily submit PR's to update the docs. I added 4 config options:
Intent
This is intended to be used during an initial sync for specific problematic tables, and then disabled for subsequent runs.