perf: TBB with memory growth control. by tzvist · Pull Request #2356 · borglab/gtsam

tzvist · 2026-01-15T15:41:16Z

This MR introduces TBB parallelization for HessianFactor operations and adds a compile-time option to control TBB memory growth.

Changes

1. TBB Parallelization for HessianFactor

Parallelizes updateHessian operation when forming A'*A in the HessianFactor merge constructor
For large matrices (>50 rows), work is split across columns using TBB parallel_for, with each thread updating a disjoint set of block columns
Performance improvement: Reduced updateHessian time from ~72s to ~41s in tested scenarios (with 10 threads)

2. Compile-time Option to Limit TBB Memory Growth

Introduces GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option (OFF by default)
Disables parallel tree traversal when memory usage is a concern
Addresses significant memory growth observed in some scenarios (e.g., ~4GB to ~12GB)

Copilot

Pull request overview

This PR introduces TBB parallelization for HessianFactor operations to improve performance and adds a compile-time option to control TBB memory growth.

Changes:

Adds a new updateHessian method overload with column range parameters to GaussianFactor and its implementations to support parallelized block column updates
Implements TBB-based parallel updates in the HessianFactor merge constructor for large matrices
Introduces GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option to disable parallel tree traversal when memory usage is a concern

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
gtsam/linear/GaussianFactor.h	Adds pure virtual `updateHessian` method with column range parameters
gtsam/linear/HessianFactor.h	Declares new `updateHessian` overload with column range
gtsam/linear/HessianFactor.cpp	Implements TBB parallelization and column-range `updateHessian`
gtsam/linear/JacobianFactor.h	Declares new `updateHessian` overload with column range
gtsam/linear/JacobianFactor.cpp	Implements column-range `updateHessian`
gtsam/slam/RegularImplicitSchurFactor.h	Implements stub throwing exception for new method
gtsam/base/SymmetricBlockMatrix.h	Adds `setZeroColumns` method for efficient column zeroing
gtsam/base/treeTraversal-inst.h	Conditionally disables parallel tree traversal based on memory flag
cmake/HandleGeneralOptions.cmake	Adds `GTSAM_TBB_BOUNDED_MEMORY_GROWTH` option
cmake/HandleTBB.cmake	Sets flag based on CMake option
gtsam/config.h.in	Adds configuration define for bounded memory flag
INSTALL.md	Documents the new CMake option and memory trade-offs
gtsam/linear/tests/testJacobianFactor.cpp	Adds test for column-range `updateHessian`
gtsam/linear/tests/testHessianFactor.cpp	Adds test for column-range `updateHessian`
gtsam/base/tests/testSymmetricBlockMatrix.cpp	Adds test for `setZeroColumns`

Comments suppressed due to low confidence (1)

gtsam/linear/HessianFactor.h:324

Corrected spelling of 'The' - should be 'The' instead of 'THe'.

     * @param keys THe ordered vector of keys for the information matrix to be updated

gtsam/linear/HessianFactor.cpp

gtsam/slam/RegularImplicitSchurFactor.h

gtsam/base/SymmetricBlockMatrix.h

gtsam/linear/HessianFactor.cpp

gtsam/linear/JacobianFactor.cpp

dellaert

This is a pretty brilliant strategy to parallelize the update of the Hessian! It will be good to adopt this in the multi-frontal solver as well, where I have done this with local threads storage instead. But this avoids extra mallocs.

Many of the copilot comments are small but good. I'll merge when addressed.

tzvist · 2026-01-17T15:41:41Z

This is a pretty brilliant strategy to parallelize the update of the Hessian! It will be good to adopt this in the multi-frontal solver as well, where I have done this with local threads storage instead. But this avoids extra mallocs.

Many of the copilot comments are small but good. I'll merge when addressed.

Thanks :)

Fixed the copilot comments. I think we are ready for merging :)

dellaert · 2026-01-21T18:58:51Z

OK, I tool some time today this on my Linux machine (20 cores), and basically got zero difference. I also think there is no flag to enable this, right?
BEFORE:

Reading values file took 0.0048 seconds
Reading graph file took 0.0985 seconds
Processing factor lines took 1.6313 seconds
Processing values lines took 0.0575 seconds
Setting up optimizer took 4.6140 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.51
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.17
   1          inf         0.00       0.00      0       0.47
   1     80163.07       258.81       0.00      1       2.17
   2          inf         0.00       0.00      0       0.45
   2     80113.89        49.18       0.00      1       2.10
   3          inf         0.00       0.00      0       0.46
   3     80096.39        17.50       0.00      1       2.09
   4          inf         0.00       0.00      0       0.45
   4     80089.79         6.60       0.00      1       2.11
   5          inf         0.00       0.00      0       0.47
   5     80086.41         3.38       0.00      1       2.09
   6          inf         0.00       0.00      0       0.44
   6     80084.44         1.98       0.00      1       2.07
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       2.00
   8          inf         0.00       0.00      0       0.44
   8     80081.84         1.21       0.00      1       2.06
   9          inf         0.00       0.00      0       0.45
   9     80080.52         1.32       0.00      1       2.09
  10          inf         0.00       0.00      0       0.46
  10     80078.92         1.60       0.00      1       2.05
  11          inf         0.00       0.00      0       0.46
  11     80076.99         1.93       0.00      1       2.06
  12          inf         0.00       0.00      0       0.46
  12     80074.94         2.05       0.00      1       1.99
  13          inf         0.00       0.00      0       0.45
  13     80073.22         1.72       0.00      1       2.06
  14          inf         0.00       0.00      0       0.44
  14     80071.87         1.34       0.00      1       2.05
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       2.03
  16          inf         0.00       0.00      0       0.45
  16     80070.76         0.28       0.00      1       2.05
Running gtsam optimizer took 45.3219 seconds

AFTER:

Reading values file took 0.0055 seconds
Reading graph file took 0.1001 seconds
Processing factor lines took 1.6415 seconds
Processing values lines took 0.0577 seconds
Setting up optimizer took 4.5054 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.47
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.13
   1          inf         0.00       0.00      0       0.46
   1     80163.07       258.81       0.00      1       2.13
   2          inf         0.00       0.00      0       0.47
   2     80113.89        49.18       0.00      1       2.18
   3          inf         0.00       0.00      0       0.44
   3     80096.39        17.50       0.00      1       2.08
   4          inf         0.00       0.00      0       0.46
   4     80089.79         6.60       0.00      1       2.05
   5          inf         0.00       0.00      0       0.44
   5     80086.41         3.38       0.00      1       2.07
   6          inf         0.00       0.00      0       0.44
   6     80084.44         1.98       0.00      1       2.15
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       2.05
   8          inf         0.00       0.00      0       0.46
   8     80081.84         1.21       0.00      1       2.10
   9          inf         0.00       0.00      0       0.45
   9     80080.52         1.32       0.00      1       2.07
  10          inf         0.00       0.00      0       0.45
  10     80078.92         1.60       0.00      1       2.08
  11          inf         0.00       0.00      0       0.45
  11     80076.99         1.93       0.00      1       2.03
  12          inf         0.00       0.00      0       0.45
  12     80074.94         2.05       0.00      1       2.09
  13          inf         0.00       0.00      0       0.46
  13     80073.22         1.72       0.00      1       2.07
  14          inf         0.00       0.00      0       0.45
  14     80071.87         1.34       0.00      1       2.03
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       2.12
  16          inf         0.00       0.00      0       0.44
  16     80070.76         0.28       0.00      1       2.04
Running gtsam optimizer took 45.4542 seconds

dellaert · 2026-01-21T19:04:48Z

I do know it should make a difference, though - parallelizing the update was a big bump for the MFS as well.

tzvist · 2026-01-21T19:07:57Z

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

dellaert · 2026-01-21T19:55:28Z

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

No. Does parallel update only happen if we set that ?

tzvist · 2026-01-21T20:13:48Z

Did you compile with -DGTSAM_TBB_BOUNDED_MEMORY_GROWTH?

No. Does parallel update only happen if we set that ?

At the moment, yes, it does require the flag. We can change it, but I assumed it would need extensive benchmarking, since in theory it could hurt performance if the parallel tree traversal is using all available threads.

gtsam/gtsam/linear/HessianFactor.cpp

Line 261 in 1e72eb1

#if defined(GTSAM_USE_TBB) && defined(GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG)

dellaert · 2026-01-21T20:52:50Z

When I set that flag, things get worse :-(

Reading values file took 0.0081 seconds
Reading graph file took 0.1122 seconds
Processing factor lines took 1.9739 seconds
Processing values lines took 0.0689 seconds
Setting up optimizer took 4.7204 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.57
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.63
   1          inf         0.00       0.00      0       0.54
   1     80163.07       258.81       0.00      1       3.21
   2          inf         0.00       0.00      0       0.42
   2     80113.89        49.18       0.00      1       3.23
   3          inf         0.00       0.00      0       0.42
   3     80096.39        17.50       0.00      1       3.20
   4          inf         0.00       0.00      0       0.42
   4     80089.79         6.60       0.00      1       3.15
   5          inf         0.00       0.00      0       0.42
   5     80086.41         3.38       0.00      1       3.16
   6          inf         0.00       0.00      0       0.42
   6     80084.44         1.98       0.00      1       3.19
   7          inf         0.00       0.00      0       0.41
   7     80083.05         1.39       0.00      1       3.21
   8          inf         0.00       0.00      0       0.42
   8     80081.84         1.21       0.00      1       3.26
   9          inf         0.00       0.00      0       0.41
   9     80080.52         1.32       0.00      1       3.14
  10          inf         0.00       0.00      0       0.41
  10     80078.92         1.60       0.00      1       3.18
  11          inf         0.00       0.00      0       0.41
  11     80076.99         1.93       0.00      1       3.17
  12          inf         0.00       0.00      0       0.41
  12     80074.94         2.05       0.00      1       3.15
  13          inf         0.00       0.00      0       0.41
  13     80073.22         1.72       0.00      1       3.19
  14          inf         0.00       0.00      0       0.41
  14     80071.87         1.34       0.00      1       3.18
  15          inf         0.00       0.00      0       0.41
  15     80071.04         0.83       0.00      1       3.26
  16          inf         0.00       0.00      0       0.46
  16     80070.76         0.28       0.00      1       3.21
Running gtsam optimizer took 63.6851 seconds

I'm assuming it is because with 20 cores there is quite a bit of parallelism. So any gains you get from the parallel update are negated by the loss in multi-threading.

dellaert · 2026-01-21T21:03:28Z

FYI, if I do TBB and parallel update, I get:

Reading values file took 0.0080 seconds
Reading graph file took 0.1025 seconds
Processing factor lines took 1.6343 seconds
Processing values lines took 0.0577 seconds
Setting up optimizer took 4.5915 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.46
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       1.42
   1          inf         0.00       0.00      0       0.47
   1     80163.07       258.81       0.00      1       1.60
   2          inf         0.00       0.00      0       0.45
   2     80113.89        49.18       0.00      1       1.61
   3          inf         0.00       0.00      0       0.44
   3     80096.39        17.50       0.00      1       1.61
   4          inf         0.00       0.00      0       0.44
   4     80089.79         6.60       0.00      1       1.60
   5          inf         0.00       0.00      0       0.46
   5     80086.41         3.38       0.00      1       1.60
   6          inf         0.00       0.00      0       0.46
   6     80084.44         1.98       0.00      1       1.62
   7          inf         0.00       0.00      0       0.45
   7     80083.05         1.39       0.00      1       1.62
   8          inf         0.00       0.00      0       0.46
   8     80081.84         1.21       0.00      1       1.63
   9          inf         0.00       0.00      0       0.46
   9     80080.52         1.32       0.00      1       1.62
  10          inf         0.00       0.00      0       0.45
  10     80078.92         1.60       0.00      1       1.58
  11          inf         0.00       0.00      0       0.47
  11     80076.99         1.93       0.00      1       1.61
  12          inf         0.00       0.00      0       0.46
  12     80074.94         2.05       0.00      1       1.63
  13          inf         0.00       0.00      0       0.45
  13     80073.22         1.72       0.00      1       1.60
  14          inf         0.00       0.00      0       0.45
  14     80071.87         1.34       0.00      1       1.61
  15          inf         0.00       0.00      0       0.45
  15     80071.04         0.83       0.00      1       1.61
  16          inf         0.00       0.00      0       0.44
  16     80070.76         0.28       0.00      1       1.60
Running gtsam optimizer took 37.5500 seconds

tzvist · 2026-01-21T21:03:48Z

When I set that flag, things get worse :-(

Reading values file took 0.0081 seconds
Reading graph file took 0.1122 seconds
Processing factor lines took 1.9739 seconds
Processing values lines took 0.0689 seconds
Setting up optimizer took 4.7204 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       0.57
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       2.63
   1          inf         0.00       0.00      0       0.54
   1     80163.07       258.81       0.00      1       3.21
   2          inf         0.00       0.00      0       0.42
   2     80113.89        49.18       0.00      1       3.23
   3          inf         0.00       0.00      0       0.42
   3     80096.39        17.50       0.00      1       3.20
   4          inf         0.00       0.00      0       0.42
   4     80089.79         6.60       0.00      1       3.15
   5          inf         0.00       0.00      0       0.42
   5     80086.41         3.38       0.00      1       3.16
   6          inf         0.00       0.00      0       0.42
   6     80084.44         1.98       0.00      1       3.19
   7          inf         0.00       0.00      0       0.41
   7     80083.05         1.39       0.00      1       3.21
   8          inf         0.00       0.00      0       0.42
   8     80081.84         1.21       0.00      1       3.26
   9          inf         0.00       0.00      0       0.41
   9     80080.52         1.32       0.00      1       3.14
  10          inf         0.00       0.00      0       0.41
  10     80078.92         1.60       0.00      1       3.18
  11          inf         0.00       0.00      0       0.41
  11     80076.99         1.93       0.00      1       3.17
  12          inf         0.00       0.00      0       0.41
  12     80074.94         2.05       0.00      1       3.15
  13          inf         0.00       0.00      0       0.41
  13     80073.22         1.72       0.00      1       3.19
  14          inf         0.00       0.00      0       0.41
  14     80071.87         1.34       0.00      1       3.18
  15          inf         0.00       0.00      0       0.41
  15     80071.04         0.83       0.00      1       3.26
  16          inf         0.00       0.00      0       0.46
  16     80070.76         0.28       0.00      1       3.21
Running gtsam optimizer took 63.6851 seconds

I'm assuming it is because with 20 cores there is quite a bit of parallelism. So any gains you get from the parallel update are negated by the loss in multi-threading.

My main goal is to reduce memory usage and not to have to much of performance penalty https://github.com/tzvist/gtsam/blob/tzvist/benchmark/monitor_script.sh. before this commit we where compiling without tbb in order to avoid the memory exploding.

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.

dellaert · 2026-01-22T19:38:57Z

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.
Could you add this change in this PR?

tzvist · 2026-01-24T22:49:07Z

Additionally i saw that i get best performance when i set tbb to use physical number of threads -1.
Could you add this change in this PR?

I had some issues writing code that will do this especially when running inside a docker in K8s, in our fork of this repo i added


#ifdef GTSAM_USE_TBB
#include <oneapi/tbb/global_control.h>
#include <oneapi/tbb/info.h>
#include <oneapi/tbb/task_arena.h>

#include <cstdlib>
#include <memory>

static std::optional<int> read_tbb_num_threads() {
  if (const char* env = std::getenv("TBB_NUM_THREADS")) {
    try {
      return std::stoi(env);
    } catch (const std::invalid_argument&) {
      return std::nullopt;  // invalid value
    }
  }
  return std::nullopt;  // variable not set
}

// Returns a task_arena object that limits TBB parallelism.
static tbb::task_arena make_tbb_task_arena() {
  // oneAPI TBB removed support for TBB_NUM_THREADS but we would like it to keep
  // working
  std::optional<int> num_threads = read_tbb_num_threads();
  
  auto max_threads = num_threads.value_or(tbb::info::default_concurrency() );

  // we do minus one to avoid contention with the main thread
  auto max_concurrency = std::max(max_threads - 1, 1);

  auto arena = tbb::task_arena(max_concurrency, 1);

  int max_allowed_parallelism =
      static_cast<int>(tbb::global_control::active_value(
          tbb::global_control::max_allowed_parallelism));
  static bool warned = false;

  if ((!warned) && 
      (!num_threads.has_value()) &&
      (tbb::info::default_concurrency() == max_allowed_parallelism) &&
      (max_allowed_parallelism > 1)) {
    std::cout << "GTSAM warning: TBB_NUM_THREADS not set. This can cause "
                  "significant slowdown. We recommend as a rule of thumb to "
                  "limit to number of physical cores, TBB using default: "
              << max_allowed_parallelism << std::endl;
    warned = true;
  }

  return arena;
}
#endif  // GTSAM_USE_TBB

And then i wrap 'void NonlinearOptimizer::defaultOptimize()' with this arena this is kind of a hack, but would you like me to add it to my MR?

dellaert · 2026-01-24T23:36:06Z

Let me discuss it with Fan on Monday and then we'll comment here.

tzvist · 2026-02-01T08:52:47Z

Let me discuss it with Fan on Monday and then we'll comment here.

@dellaert please take a look at ea2efaf

Improves optimizer performance on my test case when tbb enabled with
12 cores by ~27% (117.2s -> 85.7s) by reducing
iteration times from ~5-6s to ~3s.

Before:

 ./build_RelWithDebInfo/simple_gtsam_deserialize
simple_gtsam_deserialize2.cpp
Reading values file took 0.0108 seconds
Reading graph file took 0.1702 seconds
Processing factor lines took 2.7537 seconds
Processing values lines took 0.1056 seconds
Setting up optimizer took 9.1390 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       1.47
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       6.98
   1          inf         0.00       0.00      0       0.89
   1     80163.07       258.81       0.00      1       6.23
   2          inf         0.00       0.00      0       1.07
   2     80113.89        49.18       0.00      1       6.51
   3          inf         0.00       0.00      0       1.09
   3     80096.39        17.50       0.00      1       6.75
   4          inf         0.00       0.00      0       1.10
   4     80089.79         6.60       0.00      1       5.43
   5          inf         0.00       0.00      0       0.86
   5     80086.41         3.38       0.00      1       5.08
   6          inf         0.00       0.00      0       0.89
   6     80084.44         1.98       0.00      1       5.02
   7          inf         0.00       0.00      0       0.88
   7     80083.05         1.39       0.00      1       5.06
   8          inf         0.00       0.00      0       0.88
   8     80081.84         1.21       0.00      1       5.31
   9          inf         0.00       0.00      0       0.89
   9     80080.52         1.32       0.00      1       5.14
  10          inf         0.00       0.00      0       0.90
  10     80078.92         1.60       0.00      1       5.20
  11          inf         0.00       0.00      0       0.87
  11     80076.99         1.93       0.00      1       5.27
  12          inf         0.00       0.00      0       0.90
  12     80074.94         2.05       0.00      1       5.25
  13          inf         0.00       0.00      0       0.88
  13     80073.22         1.72       0.00      1       5.23
  14          inf         0.00       0.00      0       0.88
  14     80071.87         1.34       0.00      1       5.08
  15          inf         0.00       0.00      0       0.87
  15     80071.04         0.83       0.00      1       5.09
  16          inf         0.00       0.00      0       0.90
  16     80070.76         0.28       0.00      1       5.22
Running gtsam optimizer took 117.2392 seconds

After:

./build_RelWithDebInfo/simple_gtsam_deserialize
simple_gtsam_deserialize2.cpp
Reading values file took 0.0124 seconds
Reading graph file took 0.1912 seconds
Processing factor lines took 3.0148 seconds
Processing values lines took 0.1227 seconds
Setting up optimizer took 10.6574 seconds
Initial error: 83467.7055, values: 65796
iter      cost      cost_change    lambda  success iter_time
   0          inf         0.00       0.00      0       1.96
iter      cost      cost_change    lambda  success iter_time
   0     80421.88      3045.82       0.00      1       5.44
   1          inf         0.00       0.00      0       0.97
   1     80163.07       258.81       0.00      1       3.14
   2          inf         0.00       0.00      0       0.92
   2     80113.89        49.18       0.00      1       2.98
   3          inf         0.00       0.00      0       0.92
   3     80096.39        17.50       0.00      1       3.09
   4          inf         0.00       0.00      0       0.91
   4     80089.79         6.60       0.00      1       2.98
   5          inf         0.00       0.00      0       0.91
   5     80086.41         3.38       0.00      1       2.94
   6          inf         0.00       0.00      0       0.90
   6     80084.44         1.98       0.00      1       3.00
   7          inf         0.00       0.00      0       0.90
   7     80083.05         1.39       0.00      1       2.99
   8          inf         0.00       0.00      0       0.91
   8     80081.84         1.21       0.00      1       3.09
   9          inf         0.00       0.00      0       0.90
   9     80080.52         1.32       0.00      1       3.06
  10          inf         0.00       0.00      0       1.12
  10     80078.92         1.60       0.00      1       3.69
  11          inf         0.00       0.00      0       1.27
  11     80076.99         1.93       0.00      1       3.72
  12          inf         0.00       0.00      0       1.19
  12     80074.94         2.05       0.00      1       3.94
  13          inf         0.00       0.00      0       1.19
  13     80073.22         1.72       0.00      1       3.81
  14          inf         0.00       0.00      0       1.15
  14     80071.87         1.34       0.00      1       3.60
  15          inf         0.00       0.00      0       1.15
  15     80071.04         0.83       0.00      1       3.86
  16          inf         0.00       0.00      0       1.15
  16     80070.76         0.28       0.00      1       3.85
Running gtsam optimizer took 85.6852 seconds

I think we should probably test it on other use cases, we can also create a flag only for this

ProfFan · 2026-02-02T19:26:19Z

Hi @tzvist thank you for the PR, it looks amazing! I'm gonna implement a benchmark CI which will run the TBB benchmarks on different architectures tonight, and we can merge this immediately after :)

ProfFan · 2026-02-04T06:55:56Z

@tzvist Could you rebase on develop?

Introduce GTSAM_TBB_BOUNDED_MEMORY_GROWTH CMake option (OFF by default) to disable parallel tree traversal when memory usage is a concern. Parallel tree traversal can cause significant memory growth (e.g., from ~4GB to ~12GB in tested scenarios). Changes: - Add GTSAM_TBB_BOUNDED_MEMORY_GROWTH option in HandleGeneralOptions.cmake - Set GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG when option is enabled - Conditionally disable parallel traversal in treeTraversal-inst.h

Parallelize the updateHessian operation when forming A'*A in the HessianFactor merge constructor. For large matrices (>50 rows), the work is split across columns using TBB parallel_for, with each thread updating a disjoint set of block columns. Key changes: - Add column-range overload of updateHessian() to GaussianFactor, HessianFactor, JacobianFactor, and RegularImplicitSchurFactor - Add setZeroColumns() method to SymmetricBlockMatrix for efficient column-wise zeroing using memset In my toy example it reducded updateHessian time from ~72s to ~41s. (with 10 threads and GTSAM_TBB_BOUNDED_MEMORY_GROWTH=OFF)

Remove conditional check for GTSAM_TBB_BOUNDED_MEMORY_GROWTH_FLAG to always enable TBB parallelization when GTSAM_USE_TBB is defined. Improves optimizer performance on my test case with 12 cores by ~27% (117.2s -> 85.7s) by reducing iteration times from ~5-6s to ~3s. Before: ``` Running gtsam optimizer took 117.2392 seconds ``` After: ``` Running gtsam optimizer took 85.6852 seconds ```

ProfFan · 2026-02-04T18:14:06Z

/bench

github-actions · 2026-02-04T18:24:26Z

timeSFMBAL benchmark

Head: 87098f7ed8eef94f41da219aa5d1f913c396ca73
Base: 57bc31052de0e1010c274956c4957cd27875b890

Runner	Metric	Base (s)	Head (s)	Delta (s)	Change
linux-arm64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky`	1.967130	2.080884	+0.113755	+5.78%
linux-arm64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver`	1.155350	1.223113	+0.067764	+5.87%
linux-x64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky`	2.173623	2.249035	+0.075412	+3.47%
linux-x64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver`	1.570863	1.587257	+0.016394	+1.04%
macos-arm64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalCholesky`	1.932898	3.511656	+1.578758	+81.68%
macos-arm64	`timeSFMBAL/dubrovnik-16-22106-pre.txt/MultifrontalSolver`	1.559810	1.894650	+0.334840	+21.47%

Worker runs

Role	Runner	SHA	Conclusion
head	linux-x64	`87098f7ed8eef94f41da219aa5d1f913c396ca73`	success
base	linux-x64	`57bc31052de0e1010c274956c4957cd27875b890`	success
head	linux-arm64	`87098f7ed8eef94f41da219aa5d1f913c396ca73`	success
base	linux-arm64	`57bc31052de0e1010c274956c4957cd27875b890`	success
head	macos-arm64	`87098f7ed8eef94f41da219aa5d1f913c396ca73`	success
base	macos-arm64	`57bc31052de0e1010c274956c4957cd27875b890`	success

ProfFan · 2026-02-04T22:47:49Z

/bench

ProfFan · 2026-02-05T05:06:22Z

The macOS results seems to be very noisy, I reran the benchmarks it seems to fluctuate in ~1s ranges. Maybe this is related to how GitHub run its macOS workers.

dellaert · 2026-02-05T13:07:27Z

Hmmm - I was not ready to merge yet :-/ I think the benchmark was inconclusive: we need a larger dataset and comparison of tbb/no-tbb :-) But maybe we can at least do that after the fact?

dellaert · 2026-02-05T13:22:29Z

By the way, @ProfFan , On this PR, the benchmark results were not even mixed, right? They were all increasing in time! Or am I reading the results wrong?

tzvist · 2026-02-05T13:59:25Z

I don't think that multifrontal solver is effected by this change, looks like a lot of what we see is just noise.

By the way, @ProfFan , On this PR, the benchmark results were not even mixed, right? They were all increasing in time! Or am I reading the results wrong?

ProfFan · 2026-02-05T17:22:33Z

5% difference is normal for GitHub runners (Linux)

The macOS runners are funky and I need to figure out why. Might be that GitHub scheduled the tasks to different generation Macs

tzvist changed the title ~~perf: TBB parallelization for HessianFactor with memory growth control.~~ perf: TBB with memory growth control. Jan 15, 2026

tzvist mentioned this pull request Jan 15, 2026

Optimize Levenberg Marquardt elimination by caching JunctionTree. #2340

Merged

dellaert requested a review from Copilot January 16, 2026 04:56

Copilot started reviewing on behalf of dellaert January 16, 2026 04:57 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

dellaert approved these changes Jan 16, 2026

View reviewed changes

tzvist force-pushed the feature/tzvist/tbb-mem-opt branch 4 times, most recently from c404ae5 to 1e72eb1 Compare January 17, 2026 14:47

tzvist force-pushed the feature/tzvist/tbb-mem-opt branch from ea2efaf to 7c8121d Compare February 3, 2026 14:49

tzvist added 3 commits February 4, 2026 11:15

tzvist force-pushed the feature/tzvist/tbb-mem-opt branch from 7c8121d to 87098f7 Compare February 4, 2026 09:19

ProfFan merged commit 4a4b06c into borglab:develop Feb 5, 2026
34 checks passed

Conversation

tzvist commented Jan 15, 2026

Changes

1. TBB Parallelization for HessianFactor

2. Compile-time Option to Limit TBB Memory Growth

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dellaert left a comment

Choose a reason for hiding this comment

Uh oh!

tzvist commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dellaert commented Jan 21, 2026

Uh oh!

dellaert commented Jan 21, 2026

Uh oh!

tzvist commented Jan 21, 2026

Uh oh!

dellaert commented Jan 21, 2026

Uh oh!

tzvist commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dellaert commented Jan 21, 2026

Uh oh!

dellaert commented Jan 21, 2026

Uh oh!

tzvist commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dellaert commented Jan 22, 2026

Uh oh!

tzvist commented Jan 24, 2026

Uh oh!

dellaert commented Jan 24, 2026

Uh oh!

tzvist commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProfFan commented Feb 2, 2026

Uh oh!

ProfFan commented Feb 4, 2026

Uh oh!

ProfFan commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

timeSFMBAL benchmark

Worker runs

Uh oh!

ProfFan commented Feb 4, 2026

Uh oh!

ProfFan commented Feb 5, 2026

Uh oh!

Uh oh!

dellaert commented Feb 5, 2026

Uh oh!

dellaert commented Feb 5, 2026

Uh oh!

tzvist commented Feb 5, 2026

Uh oh!

ProfFan commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

tzvist commented Jan 17, 2026 •

edited

Loading

tzvist commented Jan 21, 2026 •

edited

Loading

tzvist commented Jan 21, 2026 •

edited

Loading

tzvist commented Feb 1, 2026 •

edited

Loading