perf: TBB with memory growth control.#2356
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces TBB parallelization for HessianFactor operations to improve performance and adds a compile-time option to control TBB memory growth.
Changes:
- Adds a new
updateHessianmethod overload with column range parameters toGaussianFactorand its implementations to support parallelized block column updates - Implements TBB-based parallel updates in the
HessianFactormerge constructor for large matrices - Introduces
GTSAM_TBB_BOUNDED_MEMORY_GROWTHCMake option to disable parallel tree traversal when memory usage is a concern
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| gtsam/linear/GaussianFactor.h | Adds pure virtual updateHessian method with column range parameters |
| gtsam/linear/HessianFactor.h | Declares new updateHessian overload with column range |
| gtsam/linear/HessianFactor.cpp | Implements TBB parallelization and column-range updateHessian |
| gtsam/linear/JacobianFactor.h | Declares new updateHessian overload with column range |
| gtsam/linear/JacobianFactor.cpp | Implements column-range updateHessian |
| gtsam/slam/RegularImplicitSchurFactor.h | Implements stub throwing exception for new method |
| gtsam/base/SymmetricBlockMatrix.h | Adds setZeroColumns method for efficient column zeroing |
| gtsam/base/treeTraversal-inst.h | Conditionally disables parallel tree traversal based on memory flag |
| cmake/HandleGeneralOptions.cmake | Adds GTSAM_TBB_BOUNDED_MEMORY_GROWTH option |
| cmake/HandleTBB.cmake | Sets flag based on CMake option |
| gtsam/config.h.in | Adds configuration define for bounded memory flag |
| INSTALL.md | Documents the new CMake option and memory trade-offs |
| gtsam/linear/tests/testJacobianFactor.cpp | Adds test for column-range updateHessian |
| gtsam/linear/tests/testHessianFactor.cpp | Adds test for column-range updateHessian |
| gtsam/base/tests/testSymmetricBlockMatrix.cpp | Adds test for setZeroColumns |
Comments suppressed due to low confidence (1)
gtsam/linear/HessianFactor.h:324
- Corrected spelling of 'The' - should be 'The' instead of 'THe'.
* @param keys THe ordered vector of keys for the information matrix to be updated
dellaert
left a comment
There was a problem hiding this comment.
This is a pretty brilliant strategy to parallelize the update of the Hessian! It will be good to adopt this in the multi-frontal solver as well, where I have done this with local threads storage instead. But this avoids extra mallocs.
Many of the copilot comments are small but good. I'll merge when addressed.
c404ae5 to
1e72eb1
Compare
Thanks :) Fixed the copilot comments. I think we are ready for merging :) |
|
OK, I tool some time today this on my Linux machine (20 cores), and basically got zero difference. I also think there is no flag to enable this, right? AFTER: |
|
I do know it should make a difference, though - parallelizing the update was a big bump for the MFS as well. |
|
Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH? |
No. Does parallel update only happen if we set that ? |
At the moment, yes, it does require the flag. We can change it, but I assumed it would need extensive benchmarking, since in theory it could hurt performance if the parallel tree traversal is using all available threads. gtsam/gtsam/linear/HessianFactor.cpp Line 261 in 1e72eb1 |
|
When I set that flag, things get worse :-( I'm assuming it is because with 20 cores there is quite a bit of parallelism. So any gains you get from the parallel update are negated by the loss in multi-threading. |
|
FYI, if I do TBB and parallel update, I get: |
My main goal is to reduce memory usage and not to have to much of performance penalty https://github.com/tzvist/gtsam/blob/tzvist/benchmark/monitor_script.sh. before this commit we where compiling without tbb in order to avoid the memory exploding. Additionally i saw that i get best performance when i set tbb to use physical number of threads -1. |
|
I had some issues writing code that will do this especially when running inside a docker in K8s, in our fork of this repo i added And then i wrap 'void NonlinearOptimizer::defaultOptimize()' with this arena this is kind of a hack, but would you like me to add it to my MR? |
|
Let me discuss it with Fan on Monday and then we'll comment here. |
@dellaert please take a look at ea2efaf Improves optimizer performance on my test case when tbb enabled with Before: After: I think we should probably test it on other use cases, we can also create a flag only for this |
|
Hi @tzvist thank you for the PR, it looks amazing! I'm gonna implement a benchmark CI which will run the TBB benchmarks on different architectures tonight, and we can merge this immediately after :) |
ea2efaf to
7c8121d
Compare
|
@tzvist Could you rebase on develop? |
Introduce GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option (OFF by default) to disable parallel tree traversal when memory usage is a concern. Parallel tree traversal can cause significant memory growth (e.g., from ~4GB to ~12GB in tested scenarios). Changes: - Add GTSAM_TBB_BOUNDED_MEMORY_GROWTH option in HandleGeneralOptions.cmake - Set GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG when option is enabled - Conditionally disable parallel traversal in treeTraversal-inst.h
Parallelize the updateHessian operation when forming A'*A in the HessianFactor merge constructor. For large matrices (>50 rows), the work is split across columns using TBB parallel_for, with each thread updating a disjoint set of block columns. Key changes: - Add column-range overload of updateHessian() to GaussianFactor, HessianFactor, JacobianFactor, and RegularImplicitSchurFactor - Add setZeroColumns() method to SymmetricBlockMatrix for efficient column-wise zeroing using memset In my toy example it reducded updateHessian time from ~72s to ~41s. (with 10 threads and GTSAM_TBB_BOUNDED_MEMORY_GROWTH=OFF)
Remove conditional check for GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG to always enable TBB parallelization when GTSAM_USE_TBB is defined. Improves optimizer performance on my test case with 12 cores by ~27% (117.2s -> 85.7s) by reducing iteration times from ~5-6s to ~3s. Before: ``` Running gtsam optimizer took 117.2392 seconds ``` After: ``` Running gtsam optimizer took 85.6852 seconds ```
7c8121d to
87098f7
Compare
|
/bench |
timeSFMBAL benchmark
Worker runs
|
|
/bench |
|
The macOS results seems to be very noisy, I reran the benchmarks it seems to fluctuate in ~1s ranges. Maybe this is related to how GitHub run its macOS workers. |
|
Hmmm - I was not ready to merge yet :-/ I think the benchmark was inconclusive: we need a larger dataset and comparison of tbb/no-tbb :-) But maybe we can at least do that after the fact? |
|
By the way, @ProfFan , On this PR, the benchmark results were not even mixed, right? They were all increasing in time! Or am I reading the results wrong? |
|
I don't think that multifrontal solver is effected by this change, looks like a lot of what we see is just noise.
|
|
5% difference is normal for GitHub runners (Linux) The macOS runners are funky and I need to figure out why. Might be that GitHub scheduled the tasks to different generation Macs |
This MR introduces TBB parallelization for HessianFactor operations and adds a compile-time option to control TBB memory growth.
Changes
1. TBB Parallelization for HessianFactor
updateHessianoperation when forming A'*A in the HessianFactor merge constructorparallel_for, with each thread updating a disjoint set of block columnsupdateHessiantime from ~72s to ~41s in tested scenarios (with 10 threads)2. Compile-time Option to Limit TBB Memory Growth
GTSAM_TBB_BOUNDED_MEMORY_GROWTHCMake option (OFF by default)