Skip to content

Conversation

@acwhite211
Copy link
Member

Fixes #7577

Edit the workbench upload code to batch rows together used in workbench uploads and validation. This is intended to speedup workbench uploads and validations. Will need to continue tuning the batch size that seems to be optimal for workbench uploads. Also, still working on adjusting the progress bar update for the batch upload code.

A good test case that has been causing problems with slow uploads is this one here: https://drive.google.com/file/d/1Mpr_KWMkCY74_yZv_knXiNGeG6TSKCYk/view?usp=drive_link
There are many fields in this file, here is a data mapping I made for testing purposes: https://drive.google.com/file/d/1eo56GKwGbMXV7luGD_SJ24b-ADxFb53X/view?usp=drive_link

Checklist

  • Self-review the PR after opening it to make sure the changes look good and
    self-explanatory (or properly documented)
  • Add relevant issue to release milestone
  • Add pr to documentation list
  • Add automated tests
  • Add a reverse migration if a migration is present in the PR

Testing instructions

  • Run workbench validation on a large workbench dataset, see that it finishes within a few minutes.
  • Run workbench upload on a large workbench dataset, see that it finishes within a few minutes.

@acwhite211
Copy link
Member Author

Did some profiling to determine which parts of the upload/validate pipeline is taking the most time for each row.

Here are the timing results done on the first 1000 rows of the cash upload dataset.:

  • Total: 124.39 s
  • apply_scoping: 59.18 s (~47.6%)
  • process_row: 58.70 s (~47.2%)
  • bind result: 5.13 s (~4.1%)

After adding caching for apply_scoping, the new timing results were

  • Total: 64.5897 s
  • process_row 58.57 s (90.7%)
  • bind result 4.84 s
  • apply_scoping 0.14 s

So, that helps us get about a 2x improvement, from about 500 rows per minute to 1000 rows per minute. Trying to get 5x to 10x improvements if possible.

Working now on speeding up the sections in the process_row function. It's not lending itself well to batching, so exploring multiple solutions and adding more fine grained profiling.

@acwhite211
Copy link
Member Author

acwhite211 commented Dec 3, 2025

Added code that can use bulk_insert on applicable rows. The full validation of the cash Workbench dataset of 321,216 records took 50 minutes to complete. So, in terms of rows-per-minute, we've gone from 500 to 1,000, and now to about 6,000. We've roughing got a 10x speed increase in the cash example. Still need to look into which types are rows can be used in bulk_insert, and which should not be to avoid possible issues. Also looking into implementing bulk_update for other situations. There is also some possible speedups that might work for the binding and matching sections of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 📋Back Log

Development

Successfully merging this pull request may close these issues.

WB upload speedup needed

2 participants