-
Notifications
You must be signed in to change notification settings - Fork 41
Batch Workbench Uploads #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Batch Workbench Uploads #7578
Conversation
|
Did some profiling to determine which parts of the upload/validate pipeline is taking the most time for each row. Here are the timing results done on the first 1000 rows of the cash upload dataset.:
After adding caching for apply_scoping, the new timing results were
So, that helps us get about a 2x improvement, from about 500 rows per minute to 1000 rows per minute. Trying to get 5x to 10x improvements if possible. Working now on speeding up the sections in the process_row function. It's not lending itself well to batching, so exploring multiple solutions and adding more fine grained profiling. |
|
Added code that can use bulk_insert on applicable rows. The full validation of the cash Workbench dataset of 321,216 records took 50 minutes to complete. So, in terms of rows-per-minute, we've gone from 500 to 1,000, and now to about 6,000. We've roughing got a 10x speed increase in the cash example. Still need to look into which types are rows can be used in bulk_insert, and which should not be to avoid possible issues. Also looking into implementing bulk_update for other situations. There is also some possible speedups that might work for the binding and matching sections of the code. |
Fixes #7577
Edit the workbench upload code to batch rows together used in workbench uploads and validation. This is intended to speedup workbench uploads and validations. Will need to continue tuning the batch size that seems to be optimal for workbench uploads. Also, still working on adjusting the progress bar update for the batch upload code.
A good test case that has been causing problems with slow uploads is this one here: https://drive.google.com/file/d/1Mpr_KWMkCY74_yZv_knXiNGeG6TSKCYk/view?usp=drive_link
There are many fields in this file, here is a data mapping I made for testing purposes: https://drive.google.com/file/d/1eo56GKwGbMXV7luGD_SJ24b-ADxFb53X/view?usp=drive_link
Checklist
self-explanatory (or properly documented)
Testing instructions