Skip to content

perf: TBB with memory growth control.#2356

Merged
ProfFan merged 3 commits intoborglab:developfrom
tzvist:feature/tzvist/tbb-mem-opt
Feb 5, 2026
Merged

perf: TBB with memory growth control.#2356
ProfFan merged 3 commits intoborglab:developfrom
tzvist:feature/tzvist/tbb-mem-opt

Conversation

@tzvist
Copy link
Copy Markdown
Contributor

@tzvist tzvist commented Jan 15, 2026

This MR introduces TBB parallelization for HessianFactor operations and adds a compile-time option to control TBB memory growth.

Changes

1. TBB Parallelization for HessianFactor

  • Parallelizes updateHessian operation when forming A'*A in the HessianFactor merge constructor
  • For large matrices (>50 rows), work is split across columns using TBB parallel_for, with each thread updating a disjoint set of block columns
  • Performance improvement: Reduced updateHessian time from ~72s to ~41s in tested scenarios (with 10 threads)

2. Compile-time Option to Limit TBB Memory Growth

  • Introduces GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option (OFF by default)
  • Disables parallel tree traversal when memory usage is a concern
  • Addresses significant memory growth observed in some scenarios (e.g., ~4GB to ~12GB)

@tzvist tzvist changed the title perf: TBB parallelization for HessianFactor with memory growth control. perf: TBB with memory growth control. Jan 15, 2026
@dellaert dellaert requested a review from Copilot January 16, 2026 04:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces TBB parallelization for HessianFactor operations to improve performance and adds a compile-time option to control TBB memory growth.

Changes:

  • Adds a new updateHessian method overload with column range parameters to GaussianFactor and its implementations to support parallelized block column updates
  • Implements TBB-based parallel updates in the HessianFactor merge constructor for large matrices
  • Introduces GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option to disable parallel tree traversal when memory usage is a concern

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
gtsam/linear/GaussianFactor.h Adds pure virtual updateHessian method with column range parameters
gtsam/linear/HessianFactor.h Declares new updateHessian overload with column range
gtsam/linear/HessianFactor.cpp Implements TBB parallelization and column-range updateHessian
gtsam/linear/JacobianFactor.h Declares new updateHessian overload with column range
gtsam/linear/JacobianFactor.cpp Implements column-range updateHessian
gtsam/slam/RegularImplicitSchurFactor.h Implements stub throwing exception for new method
gtsam/base/SymmetricBlockMatrix.h Adds setZeroColumns method for efficient column zeroing
gtsam/base/treeTraversal-inst.h Conditionally disables parallel tree traversal based on memory flag
cmake/HandleGeneralOptions.cmake Adds GTSAM_TBB_BOUNDED_MEMORY_GROWTH option
cmake/HandleTBB.cmake Sets flag based on CMake option
gtsam/config.h.in Adds configuration define for bounded memory flag
INSTALL.md Documents the new CMake option and memory trade-offs
gtsam/linear/tests/testJacobianFactor.cpp Adds test for column-range updateHessian
gtsam/linear/tests/testHessianFactor.cpp Adds test for column-range updateHessian
gtsam/base/tests/testSymmetricBlockMatrix.cpp Adds test for setZeroColumns
Comments suppressed due to low confidence (1)

gtsam/linear/HessianFactor.h:324

  • Corrected spelling of 'The' - should be 'The' instead of 'THe'.
     * @param keys THe ordered vector of keys for the information matrix to be updated

Copy link
Copy Markdown
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty brilliant strategy to parallelize the update of the Hessian! It will be good to adopt this in the multi-frontal solver as well, where I have done this with local threads storage instead. But this avoids extra mallocs.

Many of the copilot comments are small but good. I'll merge when addressed.

@tzvist tzvist force-pushed the feature/tzvist/tbb-mem-opt branch 4 times, most recently from c404ae5 to 1e72eb1 Compare January 17, 2026 14:47
@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Jan 17, 2026

This is a pretty brilliant strategy to parallelize the update of the Hessian! It will be good to adopt this in the multi-frontal solver as well, where I have done this with local threads storage instead. But this avoids extra mallocs.

Many of the copilot comments are small but good. I'll merge when addressed.

Thanks :)

Fixed the copilot comments. I think we are ready for merging :)

@dellaert
Copy link
Copy Markdown
Member

OK, I tool some time today this on my Linux machine (20 cores), and basically got zero difference. I also think there is no flag to enable this, right?
BEFORE:

Reading values file took 0.0048 seconds
Reading graph file took 0.0985 seconds
Processing factor lines took 1.6313 seconds
Processing values lines took 0.0575 seconds
Setting up optimizer took 4.6140 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.51
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.17
   1          inf         0.00       0.00      0       0.47
   1     80163.07       258.81       0.00      1       2.17
   2          inf         0.00       0.00      0       0.45
   2     80113.89        49.18       0.00      1       2.10
   3          inf         0.00       0.00      0       0.46
   3     80096.39        17.50       0.00      1       2.09
   4          inf         0.00       0.00      0       0.45
   4     80089.79         6.60       0.00      1       2.11
   5          inf         0.00       0.00      0       0.47
   5     80086.41         3.38       0.00      1       2.09
   6          inf         0.00       0.00      0       0.44
   6     80084.44         1.98       0.00      1       2.07
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       2.00
   8          inf         0.00       0.00      0       0.44
   8     80081.84         1.21       0.00      1       2.06
   9          inf         0.00       0.00      0       0.45
   9     80080.52         1.32       0.00      1       2.09
  10          inf         0.00       0.00      0       0.46
  10     80078.92         1.60       0.00      1       2.05
  11          inf         0.00       0.00      0       0.46
  11     80076.99         1.93       0.00      1       2.06
  12          inf         0.00       0.00      0       0.46
  12     80074.94         2.05       0.00      1       1.99
  13          inf         0.00       0.00      0       0.45
  13     80073.22         1.72       0.00      1       2.06
  14          inf         0.00       0.00      0       0.44
  14     80071.87         1.34       0.00      1       2.05
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       2.03
  16          inf         0.00       0.00      0       0.45
  16     80070.76         0.28       0.00      1       2.05
Running gtsam optimizer took 45.3219 seconds

AFTER:

Reading values file took 0.0055 seconds
Reading graph file took 0.1001 seconds
Processing factor lines took 1.6415 seconds
Processing values lines took 0.0577 seconds
Setting up optimizer took 4.5054 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.47
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.13
   1          inf         0.00       0.00      0       0.46
   1     80163.07       258.81       0.00      1       2.13
   2          inf         0.00       0.00      0       0.47
   2     80113.89        49.18       0.00      1       2.18
   3          inf         0.00       0.00      0       0.44
   3     80096.39        17.50       0.00      1       2.08
   4          inf         0.00       0.00      0       0.46
   4     80089.79         6.60       0.00      1       2.05
   5          inf         0.00       0.00      0       0.44
   5     80086.41         3.38       0.00      1       2.07
   6          inf         0.00       0.00      0       0.44
   6     80084.44         1.98       0.00      1       2.15
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       2.05
   8          inf         0.00       0.00      0       0.46
   8     80081.84         1.21       0.00      1       2.10
   9          inf         0.00       0.00      0       0.45
   9     80080.52         1.32       0.00      1       2.07
  10          inf         0.00       0.00      0       0.45
  10     80078.92         1.60       0.00      1       2.08
  11          inf         0.00       0.00      0       0.45
  11     80076.99         1.93       0.00      1       2.03
  12          inf         0.00       0.00      0       0.45
  12     80074.94         2.05       0.00      1       2.09
  13          inf         0.00       0.00      0       0.46
  13     80073.22         1.72       0.00      1       2.07
  14          inf         0.00       0.00      0       0.45
  14     80071.87         1.34       0.00      1       2.03
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       2.12
  16          inf         0.00       0.00      0       0.44
  16     80070.76         0.28       0.00      1       2.04
Running gtsam optimizer took 45.4542 seconds

@dellaert
Copy link
Copy Markdown
Member

I do know it should make a difference, though - parallelizing the update was a big bump for the MFS as well.

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Jan 21, 2026

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

@dellaert
Copy link
Copy Markdown
Member

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

No. Does parallel update only happen if we set that ?

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Jan 21, 2026

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

No. Does parallel update only happen if we set that ?

At the moment, yes, it does require the flag. We can change it, but I assumed it would need extensive benchmarking, since in theory it could hurt performance if the parallel tree traversal is using all available threads.

#if defined(GTSAM_USE_TBB) && defined(GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG)

@dellaert
Copy link
Copy Markdown
Member

When I set that flag, things get worse :-(

Reading values file took 0.0081 seconds
Reading graph file took 0.1122 seconds
Processing factor lines took 1.9739 seconds
Processing values lines took 0.0689 seconds
Setting up optimizer took 4.7204 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.57
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.63
   1          inf         0.00       0.00      0       0.54
   1     80163.07       258.81       0.00      1       3.21
   2          inf         0.00       0.00      0       0.42
   2     80113.89        49.18       0.00      1       3.23
   3          inf         0.00       0.00      0       0.42
   3     80096.39        17.50       0.00      1       3.20
   4          inf         0.00       0.00      0       0.42
   4     80089.79         6.60       0.00      1       3.15
   5          inf         0.00       0.00      0       0.42
   5     80086.41         3.38       0.00      1       3.16
   6          inf         0.00       0.00      0       0.42
   6     80084.44         1.98       0.00      1       3.19
   7          inf         0.00       0.00      0       0.41
   7     80083.05         1.39       0.00      1       3.21
   8          inf         0.00       0.00      0       0.42
   8     80081.84         1.21       0.00      1       3.26
   9          inf         0.00       0.00      0       0.41
   9     80080.52         1.32       0.00      1       3.14
  10          inf         0.00       0.00      0       0.41
  10     80078.92         1.60       0.00      1       3.18
  11          inf         0.00       0.00      0       0.41
  11     80076.99         1.93       0.00      1       3.17
  12          inf         0.00       0.00      0       0.41
  12     80074.94         2.05       0.00      1       3.15
  13          inf         0.00       0.00      0       0.41
  13     80073.22         1.72       0.00      1       3.19
  14          inf         0.00       0.00      0       0.41
  14     80071.87         1.34       0.00      1       3.18
  15          inf         0.00       0.00      0       0.41
  15     80071.04         0.83       0.00      1       3.26
  16          inf         0.00       0.00      0       0.46
  16     80070.76         0.28       0.00      1       3.21
Running gtsam optimizer took 63.6851 seconds

I'm assuming it is because with 20 cores there is quite a bit of parallelism. So any gains you get from the parallel update are negated by the loss in multi-threading.

@dellaert
Copy link
Copy Markdown
Member

FYI, if I do TBB and parallel update, I get:

Reading values file took 0.0080 seconds
Reading graph file took 0.1025 seconds
Processing factor lines took 1.6343 seconds
Processing values lines took 0.0577 seconds
Setting up optimizer took 4.5915 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.46
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       1.42
   1          inf         0.00       0.00      0       0.47
   1     80163.07       258.81       0.00      1       1.60
   2          inf         0.00       0.00      0       0.45
   2     80113.89        49.18       0.00      1       1.61
   3          inf         0.00       0.00      0       0.44
   3     80096.39        17.50       0.00      1       1.61
   4          inf         0.00       0.00      0       0.44
   4     80089.79         6.60       0.00      1       1.60
   5          inf         0.00       0.00      0       0.46
   5     80086.41         3.38       0.00      1       1.60
   6          inf         0.00       0.00      0       0.46
   6     80084.44         1.98       0.00      1       1.62
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       1.62
   8          inf         0.00       0.00      0       0.46
   8     80081.84         1.21       0.00      1       1.63
   9          inf         0.00       0.00      0       0.46
   9     80080.52         1.32       0.00      1       1.62
  10          inf         0.00       0.00      0       0.45
  10     80078.92         1.60       0.00      1       1.58
  11          inf         0.00       0.00      0       0.47
  11     80076.99         1.93       0.00      1       1.61
  12          inf         0.00       0.00      0       0.46
  12     80074.94         2.05       0.00      1       1.63
  13          inf         0.00       0.00      0       0.45
  13     80073.22         1.72       0.00      1       1.60
  14          inf         0.00       0.00      0       0.45
  14     80071.87         1.34       0.00      1       1.61
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       1.61
  16          inf         0.00       0.00      0       0.44
  16     80070.76         0.28       0.00      1       1.60
Running gtsam optimizer took 37.5500 seconds

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Jan 21, 2026

When I set that flag, things get worse :-(

Reading values file took 0.0081 seconds
Reading graph file took 0.1122 seconds
Processing factor lines took 1.9739 seconds
Processing values lines took 0.0689 seconds
Setting up optimizer took 4.7204 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.57
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.63
   1          inf         0.00       0.00      0       0.54
   1     80163.07       258.81       0.00      1       3.21
   2          inf         0.00       0.00      0       0.42
   2     80113.89        49.18       0.00      1       3.23
   3          inf         0.00       0.00      0       0.42
   3     80096.39        17.50       0.00      1       3.20
   4          inf         0.00       0.00      0       0.42
   4     80089.79         6.60       0.00      1       3.15
   5          inf         0.00       0.00      0       0.42
   5     80086.41         3.38       0.00      1       3.16
   6          inf         0.00       0.00      0       0.42
   6     80084.44         1.98       0.00      1       3.19
   7          inf         0.00       0.00      0       0.41
   7     80083.05         1.39       0.00      1       3.21
   8          inf         0.00       0.00      0       0.42
   8     80081.84         1.21       0.00      1       3.26
   9          inf         0.00       0.00      0       0.41
   9     80080.52         1.32       0.00      1       3.14
  10          inf         0.00       0.00      0       0.41
  10     80078.92         1.60       0.00      1       3.18
  11          inf         0.00       0.00      0       0.41
  11     80076.99         1.93       0.00      1       3.17
  12          inf         0.00       0.00      0       0.41
  12     80074.94         2.05       0.00      1       3.15
  13          inf         0.00       0.00      0       0.41
  13     80073.22         1.72       0.00      1       3.19
  14          inf         0.00       0.00      0       0.41
  14     80071.87         1.34       0.00      1       3.18
  15          inf         0.00       0.00      0       0.41
  15     80071.04         0.83       0.00      1       3.26
  16          inf         0.00       0.00      0       0.46
  16     80070.76         0.28       0.00      1       3.21
Running gtsam optimizer took 63.6851 seconds

I'm assuming it is because with 20 cores there is quite a bit of parallelism. So any gains you get from the parallel update are negated by the loss in multi-threading.

My main goal is to reduce memory usage and not to have to much of performance penalty https://github.com/tzvist/gtsam/blob/tzvist/benchmark/monitor_script.sh. before this commit we where compiling without tbb in order to avoid the memory exploding.

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.

@dellaert
Copy link
Copy Markdown
Member

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.
Could you add this change in this PR?

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Jan 24, 2026

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.
Could you add this change in this PR?

I had some issues writing code that will do this especially when running inside a docker in K8s, in our fork of this repo i added


#ifdef GTSAM_USE_TBB
#include <oneapi/tbb/global_control.h>
#include <oneapi/tbb/info.h>
#include <oneapi/tbb/task_arena.h>

#include <cstdlib>
#include <memory>

static std::optional<int> read_tbb_num_threads() {
  if (const char* env = std::getenv("TBB_NUM_THREADS")) {
    try {
      return std::stoi(env);
    } catch (const std::invalid_argument&) {
      return std::nullopt;  // invalid value
    }
  }
  return std::nullopt;  // variable not set
}

// Returns a task_arena object that limits TBB parallelism.
static tbb::task_arena make_tbb_task_arena() {
  // oneAPI TBB removed support for TBB_NUM_THREADS but we would like it to keep
  // working
  std::optional<int> num_threads = read_tbb_num_threads();
  
  auto max_threads = num_threads.value_or(tbb::info::default_concurrency() );

  // we do minus one to avoid contention with the main thread
  auto max_concurrency = std::max(max_threads - 1, 1);

  auto arena = tbb::task_arena(max_concurrency, 1);

  int max_allowed_parallelism =
      static_cast<int>(tbb::global_control::active_value(
          tbb::global_control::max_allowed_parallelism));
  static bool warned = false;

  if ((!warned) && 
      (!num_threads.has_value()) &&
      (tbb::info::default_concurrency() == max_allowed_parallelism) &&
      (max_allowed_parallelism > 1)) {
    std::cout << "GTSAM warning: TBB_NUM_THREADS not set. This can cause "
                  "significant slowdown. We recommend as a rule of thumb to "
                  "limit to number of physical cores, TBB using default: "
              << max_allowed_parallelism << std::endl;
    warned = true;
  }

  return arena;
}
#endif  // GTSAM_USE_TBB

And then i wrap 'void NonlinearOptimizer::defaultOptimize()' with this arena this is kind of a hack, but would you like me to add it to my MR?

@dellaert
Copy link
Copy Markdown
Member

Let me discuss it with Fan on Monday and then we'll comment here.

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Feb 1, 2026

Let me discuss it with Fan on Monday and then we'll comment here.

@dellaert please take a look at ea2efaf

Improves optimizer performance on my test case when tbb enabled with
12 cores by ~27% (117.2s -> 85.7s) by reducing
iteration times from ~5-6s to ~3s.

Before:

 ./build_RelWithDebInfo/simple_gtsam_deserialize
simple_gtsam_deserialize2.cpp
Reading values file took 0.0108 seconds
Reading graph file took 0.1702 seconds
Processing factor lines took 2.7537 seconds
Processing values lines took 0.1056 seconds
Setting up optimizer took 9.1390 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       1.47
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       6.98
   1          inf         0.00       0.00      0       0.89
   1     80163.07       258.81       0.00      1       6.23
   2          inf         0.00       0.00      0       1.07
   2     80113.89        49.18       0.00      1       6.51
   3          inf         0.00       0.00      0       1.09
   3     80096.39        17.50       0.00      1       6.75
   4          inf         0.00       0.00      0       1.10
   4     80089.79         6.60       0.00      1       5.43
   5          inf         0.00       0.00      0       0.86
   5     80086.41         3.38       0.00      1       5.08
   6          inf         0.00       0.00      0       0.89
   6     80084.44         1.98       0.00      1       5.02
   7          inf         0.00       0.00      0       0.88
   7     80083.05         1.39       0.00      1       5.06
   8          inf         0.00       0.00      0       0.88
   8     80081.84         1.21       0.00      1       5.31
   9          inf         0.00       0.00      0       0.89
   9     80080.52         1.32       0.00      1       5.14
  10          inf         0.00       0.00      0       0.90
  10     80078.92         1.60       0.00      1       5.20
  11          inf         0.00       0.00      0       0.87
  11     80076.99         1.93       0.00      1       5.27
  12          inf         0.00       0.00      0       0.90
  12     80074.94         2.05       0.00      1       5.25
  13          inf         0.00       0.00      0       0.88
  13     80073.22         1.72       0.00      1       5.23
  14          inf         0.00       0.00      0       0.88
  14     80071.87         1.34       0.00      1       5.08
  15          inf         0.00       0.00      0       0.87
  15     80071.04         0.83       0.00      1       5.09
  16          inf         0.00       0.00      0       0.90
  16     80070.76         0.28       0.00      1       5.22
Running gtsam optimizer took 117.2392 seconds

After:

./build_RelWithDebInfo/simple_gtsam_deserialize
simple_gtsam_deserialize2.cpp
Reading values file took 0.0124 seconds
Reading graph file took 0.1912 seconds
Processing factor lines took 3.0148 seconds
Processing values lines took 0.1227 seconds
Setting up optimizer took 10.6574 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       1.96
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       5.44
   1          inf         0.00       0.00      0       0.97
   1     80163.07       258.81       0.00      1       3.14
   2          inf         0.00       0.00      0       0.92
   2     80113.89        49.18       0.00      1       2.98
   3          inf         0.00       0.00      0       0.92
   3     80096.39        17.50       0.00      1       3.09
   4          inf         0.00       0.00      0       0.91
   4     80089.79         6.60       0.00      1       2.98
   5          inf         0.00       0.00      0       0.91
   5     80086.41         3.38       0.00      1       2.94
   6          inf         0.00       0.00      0       0.90
   6     80084.44         1.98       0.00      1       3.00
   7          inf         0.00       0.00      0       0.90
   7     80083.05         1.39       0.00      1       2.99
   8          inf         0.00       0.00      0       0.91
   8     80081.84         1.21       0.00      1       3.09
   9          inf         0.00       0.00      0       0.90
   9     80080.52         1.32       0.00      1       3.06
  10          inf         0.00       0.00      0       1.12
  10     80078.92         1.60       0.00      1       3.69
  11          inf         0.00       0.00      0       1.27
  11     80076.99         1.93       0.00      1       3.72
  12          inf         0.00       0.00      0       1.19
  12     80074.94         2.05       0.00      1       3.94
  13          inf         0.00       0.00      0       1.19
  13     80073.22         1.72       0.00      1       3.81
  14          inf         0.00       0.00      0       1.15
  14     80071.87         1.34       0.00      1       3.60
  15          inf         0.00       0.00      0       1.15
  15     80071.04         0.83       0.00      1       3.86
  16          inf         0.00       0.00      0       1.15
  16     80070.76         0.28       0.00      1       3.85
Running gtsam optimizer took 85.6852 seconds

I think we should probably test it on other use cases, we can also create a flag only for this

@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 2, 2026

Hi @tzvist thank you for the PR, it looks amazing! I'm gonna implement a benchmark CI which will run the TBB benchmarks on different architectures tonight, and we can merge this immediately after :)

@tzvist tzvist force-pushed the feature/tzvist/tbb-mem-opt branch from ea2efaf to 7c8121d Compare February 3, 2026 14:49
@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 4, 2026

@tzvist Could you rebase on develop?

Introduce GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option (OFF by default)
to disable parallel tree traversal when memory usage is a concern.
Parallel tree traversal can cause significant memory growth (e.g., from
~4GB to ~12GB in tested scenarios).

Changes:
- Add GTSAM_TBB_BOUNDED_MEMORY_GROWTH option in HandleGeneralOptions.cmake
- Set GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG when option is enabled
- Conditionally disable parallel traversal in treeTraversal-inst.h
Parallelize the updateHessian operation when forming A'*A in the
HessianFactor merge constructor. For large matrices (>50 rows), the
work is split across columns using TBB parallel_for, with each thread
updating a disjoint set of block columns.

Key changes:
- Add column-range overload of updateHessian() to GaussianFactor,
  HessianFactor, JacobianFactor, and RegularImplicitSchurFactor
- Add setZeroColumns() method to SymmetricBlockMatrix for efficient
  column-wise zeroing using memset

In my toy example it reducded updateHessian time from ~72s to ~41s.
(with 10 threads and GTSAM_TBB_BOUNDED_MEMORY_GROWTH=OFF)
Remove conditional check for GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG to
always enable TBB parallelization when GTSAM_USE_TBB is defined.

Improves optimizer performance on my test case with
12 cores by ~27% (117.2s -> 85.7s) by reducing
iteration times from ~5-6s to ~3s.

Before:
```
Running gtsam optimizer took 117.2392 seconds
```

After:
```
Running gtsam optimizer took 85.6852 seconds
```
@tzvist tzvist force-pushed the feature/tzvist/tbb-mem-opt branch from 7c8121d to 87098f7 Compare February 4, 2026 09:19
@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 4, 2026

/bench

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 4, 2026

timeSFMBAL benchmark

  • Head: 87098f7ed8eef94f41da219aa5d1f913c396ca73
  • Base: 57bc31052de0e1010c274956c4957cd27875b890
Runner Metric Base (s) Head (s) Delta (s) Change
linux-arm64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky 1.967130 2.080884 +0.113755 +5.78%
linux-arm64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver 1.155350 1.223113 +0.067764 +5.87%
linux-x64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky 2.173623 2.249035 +0.075412 +3.47%
linux-x64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver 1.570863 1.587257 +0.016394 +1.04%
macos-arm64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky 1.932898 3.511656 +1.578758 +81.68%
macos-arm64 timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver 1.559810 1.894650 +0.334840 +21.47%

Worker runs

Role Runner SHA Conclusion
head linux-x64 87098f7ed8eef94f41da219aa5d1f913c396ca73 success
base linux-x64 57bc31052de0e1010c274956c4957cd27875b890 success
head linux-arm64 87098f7ed8eef94f41da219aa5d1f913c396ca73 success
base linux-arm64 57bc31052de0e1010c274956c4957cd27875b890 success
head macos-arm64 87098f7ed8eef94f41da219aa5d1f913c396ca73 success
base macos-arm64 57bc31052de0e1010c274956c4957cd27875b890 success

@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 4, 2026

/bench

@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 5, 2026

The macOS results seems to be very noisy, I reran the benchmarks it seems to fluctuate in ~1s ranges. Maybe this is related to how GitHub run its macOS workers.

@ProfFan ProfFan merged commit 4a4b06c into borglab:develop Feb 5, 2026
34 checks passed
@dellaert
Copy link
Copy Markdown
Member

dellaert commented Feb 5, 2026

Hmmm - I was not ready to merge yet :-/ I think the benchmark was inconclusive: we need a larger dataset and comparison of tbb/no-tbb :-) But maybe we can at least do that after the fact?

@dellaert
Copy link
Copy Markdown
Member

dellaert commented Feb 5, 2026

By the way, @ProfFan , On this PR, the benchmark results were not even mixed, right? They were all increasing in time! Or am I reading the results wrong?

@tzvist
Copy link
Copy Markdown
Contributor Author

tzvist commented Feb 5, 2026

I don't think that multifrontal solver is effected by this change, looks like a lot of what we see is just noise.

By the way, @ProfFan , On this PR, the benchmark results were not even mixed, right? They were all increasing in time! Or am I reading the results wrong?

@ProfFan
Copy link
Copy Markdown
Collaborator

ProfFan commented Feb 5, 2026

5% difference is normal for GitHub runners (Linux)

The macOS runners are funky and I need to figure out why. Might be that GitHub scheduled the tasks to different generation Macs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants