Skip to content

Improve std::vector allocations and insertions performance#37880

Draft
mateusznowakTT wants to merge 54 commits intomainfrom
mateusznowakTT/optimize_std_vector_allocations
Draft

Improve std::vector allocations and insertions performance#37880
mateusznowakTT wants to merge 54 commits intomainfrom
mateusznowakTT/optimize_std_vector_allocations

Conversation

@mateusznowakTT
Copy link
Contributor

@mateusznowakTT mateusznowakTT commented Feb 13, 2026

Ticket

#37879

Problem description

Default std::vector allocation strategy is not optimal in case of insertion of multiple elements.
By default, initial capacity is 0, and with each insertion, it is extended to fit new item (g++ strategy is double the capacity), and existing items are migrated (copied or moved) to new underlying buffer. If memory is reserved ahead of insertions this can be hugely improved.

On top of that, there is a performance benefit from constructing vector elements in-place instead of copying or moving elements. Even inline initialization (std::vector v{v1, v2, v3};) suffers from this.

Microbenchmark presenting performance improvements possible https://quick-bench.com/q/4H4CNijgXcyUyAkaPMomSVDprPA

What's changed

std::vector occurrences are accompanied with .reserve whenever it is possible to determine number of insertions.
Further optimizations were done to emplace inline constructed objects and take advantage of techniques like RVO, whenever possible, to reduce allocations and data movement.

Checklist

  • All post-commit tests
  • Blackhole Post commit
  • cpp-unit-tests
  • New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

mateusznowakTT and others added 30 commits February 12, 2026 11:14
Add reserve() calls to std::vector allocations where final size is known upfront to avoid reallocations.

Changes:
- mesh_coord.cpp: 8 reserve() additions in core mesh operations
- core_coord.cpp: 4 reserve() additions in core coordinate operations
- conv2d_op_sharded_program_factory.cpp: 2 reserve() for core vectors
- sdpa_decode_program_factory.cpp: 2 reserve() for core group vectors

All reserves use known or calculable sizes to avoid reallocations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Continue std::vector allocation optimizations in high-usage operations:

- matmul_multicore_reuse_mcast_1d_program_factory: 2 reserve() additions
  * non_idle_cores_vec: reserve(subdevice_cores.ranges().size())
  * ring_list: reserve before insert operation

- all_gather_concat_program_factory: 2 reserve() additions
  * q_cores_vector: reserve(concat_num_cores)
  * sem_cores_vector: reserve(concat_num_cores + 1)

- groupnorm_mcast_program_factory: 2 reserve() additions
  * mcast_groups: reserve(sender_ranges.size())
  * mcast_virtual_groups: reserve(sender_ranges.size())

Part of ongoing effort to eliminate vector reallocations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert 4 instances from list initialization to reserve+emplace pattern.

Changes:
- core_coord.cpp: current_remaining vector in subtract loop
  * Was: {current_range} initialization
  * Now: reserve(1) + emplace_back()

- matmul_multicore_reuse_mcast_1d: shared_cbs vector
  * Was: {cb_src0, cb_src1} initialization
  * Now: reserve(2+size) + emplace_back() for each

- all_gather_concat: input/output/temp tensor vectors (3 instances)
  * Was: {tensor} initialization for each
  * Now: reserve(1) + emplace_back() for each

- conv2d_op_sharded: activation_reuse_dummy_args
  * Was: {0, 0, 0, 0} initialization
  * Now: reserve(4) + emplace_back() loop

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Continue vector allocation optimizations in data movement and CCL:

- transpose_hc_sharded_program_factory: 5 reserve() additions
  * shard_grid_x_map: reserve(num_cores_x)
  * shard_grid_y_map: reserve(num_cores_y)
  * reader_runtime_args: convert list init to reserve+emplace (5 elements)
  * cores: reserve(num_cores)
  * stick_ids_per_core: reserve(num_sticks_per_core)

- ccl_common: 1 reserve() addition
  * combined_tensors: reserve(num_devices) in slice/concat operation

These optimizations target data movement operations (transpose) and
collective communication (CCL) hot paths.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed bug from commit 7ed4ba5ff02 where .ranges() was incorrectly called
on std::set<CoreRange>. std::set doesn't have .ranges() method - that's
only for CoreRangeSet. Changed to use .size() instead.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added 8 reserve() optimizations:

1. groupnorm_no_mcast_program_factory.cpp (2 optimizations):
   - mcast_groups.reserve(sender_groups_count)
   - mcast_virtual_groups.reserve(sender_groups_count)

2. reshard_program_factory_generic.cpp (1 optimization):
   - compressed_blocks.reserve(page_strides.size())

3. permute_tiled_program_factory.cpp (1 optimization):
   - reader_runtime_args: list init → reserve+emplace
   - Total size: 3 + output_shape_view + inv_perm + input_tile_strides

4. llama_1d_mm_fusion.cpp (4 optimizations):
   - non_idle_cores_vec.reserve(subdevice_cores.ranges().size())
   - cb_outputs.reserve(out_buffers.size())
   - output_cb_indices.reserve(out_buffers.size())
   - interm_cb_indices.reserve(out_buffers.size())

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added 5 reserve() optimizations:

1. rms_allgather_program_factory.cpp (4 optimizations):
   - storage_core_noc_x.reserve(storage_core_coords.size())
   - storage_core_noc_y.reserve(storage_core_coords.size())
   - stats_tensor_cores_x.reserve(num_stats_cores)
   - stats_tensor_cores_y.reserve(num_stats_cores)
     where num_stats_cores = calculated from tile range

2. groupnorm_sharded_program_factory.cpp (1 optimization):
   - mcast_groups.reserve(num_sender_cores)
     where num_sender_cores = (num_batches/num_batches_per_core) * (num_groups/num_groups_per_core)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add reserve() calls to eliminate reallocations in:
- SDPA program factory: head_work and reader_args vectors
- SDPA decode: input_tensors with reserve+emplace pattern
- Sort program factory: physical_core_lookup_table_data
- Reduce H/W program factories: cores vector
- Moreh Adam/AdamW: output spec and tensor vectors
- Sliding window: serialize_gather_config output buffer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, tensor

Add reserve() calls and convert list init to reserve+emplace in:
- CCL: command_lowering, sharding_addrgen_helper, all_gather_concat (14 vectors)
- Ring attention all-gather: reader/writer forward/backward rt_args (4 vectors)
- Moreh: helper_functions, matmul, mean_backward, norm_backward, sum_backward
- Concat: program_factory, s2i_program_factory runtime args
- Distributed: mesh_device, mesh_device_view, mesh_command_queue_base
- Infrastructure: metal_context, bfloat16, tensor_ops, tensor_spec_flatbuffer
- Pool: pool_multi_core scalars_per_core
- Llama fusion: bank_ids

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…upnorm

Add reserve() calls to 7 files with 13 optimizations to eliminate
vector reallocations where final size is known.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add reserve() for cores_with_rtargs vector populated in loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gram, ccl

Add reserve() calls before push_back/emplace_back loops with known sizes:
- device.cpp: all_worker_cores_logical (num_cores_x * num_cores_y)
- l1_banking_allocator.cpp: shuffled_bank_id, dram_bank_offsets
- kernel.cpp: file_paths (expected_num_binaries)
- device_manager.cpp: device_ids, device_ids_to_open
- profiler.cpp: virtual_dispatch_cores
- program.cpp: kernel_ids (2 locations)
- ccl_types_args_emitters.cpp: noc row/col maps, emit_rt_args

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add reserve() calls for vectors with known sizes:
- transpose_hc_sharded: 10 reserves for read_cores and non_repeat vectors in loops
- binary_backward: 18 reserves for grad_tensor vectors (consistently 2 elements)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsor vectors

Add reserve(1) to all 64 std::vector<Tensor> and 2 std::vector<ComplexTensor>
grad_tensor declarations in unary_backward.cpp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate temporary variables that are declared, assigned, and immediately
pushed to a vector without further references:
- core_coord.cpp: 4 CoreRange variables (left, right, bottom, top) → direct emplace_back
- binary_backward.cpp: ~15 grad_a/grad_b variables → direct emplace_back
- unary_backward.cpp: ~40 result/grad_result variables → direct emplace_back

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace previously unrolled reserve+emplace patterns and remaining
initializer list patterns with vector_init helper for cleaner code
and optimal performance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing reserve() calls in 8 locations with known loop sizes:
- core_descriptor, metal_soc_descriptor, kernel, program, mesh_device,
  layernorm_pre_all_gather, dram_prefetcher

Convert push_back(T{...}) to emplace_back(T{...}) in 9 files:
- reshard, loss, groupnorm_sharded, layernorm (3 files), sliding_window,
  sdpa, ring_joint_sdpa, system_mesh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- control_plane.cpp: corner_asic_positions reserve(4), corner_fabric_node_ids reserve(4 * num_meshes)
- topology_mapper.cpp: pinning_strs reserve(mesh_pinnings.size())
- layernorm_pre_all_gather_2d_program_factory.cpp: merge_core_ranges_vec reserve(cores_x)
- qa_hal.cpp: objs reserve(4), includes reserve(13)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mateusznowakTT and others added 24 commits February 15, 2026 00:17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erload

Move vector_init.hpp from ttnn/api/ttnn/common/ to tt_stl/tt_stl/ so it
can be used by both tt_metal and ttnn code. Add ttsl namespace, deprecated
tt::stl alias, and runtime reserve size overload using ttsl::vector_size
(a ttsl::StrongType<size_t>). Update all 9 consumers to use new include
path and ttsl:: qualified calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ttsl::vector_init with compile-time N in HAL includes (bh/wh/qa) and
exact-fit in control_plane and core_coord. Use runtime vector_size overload
in concat, transpose, permute, and moreh operation program factories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace reserve+push_back series with ttsl::vector_init runtime overload
in sdpa_program_factory (17 fixed args), sdpa_decode reader/writer args
(15+10 init elements with inserts), and conv2d width-sharded rt_args.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use compile-time N=19 for compute_all_to_all/not_all_to_all init lists
with conditional welford args. Use runtime vector_size for reader_sender,
reader_receiver, and writer_args builders with estimated total sizes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-allocate estimated additions before push_back series in:
- RingSDPAFusedOpSignaler::push_ring_sdpa_fused_op_rt_args (+6)
- AllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores)
- StridedAllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores)
- MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload1 (+6+2*cores)
- MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload2 (+9)
- MatmulFusedOpSignaler::push_llama_rs_rt_args_for_mm (+11 max)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ctories

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ender/idle args

No pattern is too small: add reserve() to push_reduce_scatter_fused_op_rt_args,
push_llama_rs_rt_args_for_rs (1 push_back each), and MinimalMatmulFusedOpSignaler
(12 push_backs, previously missing reserve). Convert mm_in0_sender_args and
mm_in0_idle_args in dram_sharded factory to vector_init.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
container_init<Container, N>(vals...) works with any container that has
reserve() and emplace_back() (std::vector, ttsl::SmallVector, etc.).
Existing vector_init overloads now delegate to container_init — zero
churn on existing call sites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
small_vector_init<T>(vals...) provides the same ergonomics as vector_init
but returns ttsl::SmallVector<T>. Added #include <tt_stl/small_vector.hpp>
to support the wrappers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert brace-init single-element SmallVector<SubDeviceId> patterns to
small_vector_init<SubDeviceId>(id) in all_gather, reduce_scatter, broadcast,
and all_broadcast program factories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace std::vector with ttsl::SmallVector for rank-bounded dimension arrays
(max 4-8 elements) in hot-path slice and transpose program factories:
- slice_program_factory_rm.cpp: 4 dimension arrays
- padded_slice_tile_program_factory.cpp: 6 dimension arrays
- transpose_hc_sharded_program_factory.cpp: 2 grid maps
- padded_slice_rm_program_factory.cpp: 4 dimension arrays

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…L ops

Replace std::vector with ttsl::SmallVector for small bounded local vectors:
- padded_slice_tile: 4 more per-dimension arrays (lines 209-308)
- slice_write (3 files): 6 dimension arrays each (interleaved, sharded, tiled)
- broadcast/all_broadcast: fixed-size mcast args (size 2)

All converted vectors are purely local with known small bounds (≤8 elements).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Optimize vector allocations in 7 softmax program factory files by
replacing std::vector initialization with ttsl::vector_init:

- softmax_program_factory_general_h_small.cpp
- softmax_program_factory_general_h_large.cpp
- softmax_program_factory_general_w_small.cpp
- softmax_program_factory_general_w_large.cpp
- softmax_program_factory_general_c_large.cpp
- softmax_program_factory_attention_optimized.cpp
- softmax_program_factory_attention_optimized_sharded.cpp

Changes:
- Add #include <tt_stl/vector_init.hpp> where missing
- Replace compile-time args vectors with ttsl::vector_init
- Replace runtime args vectors in loops with ttsl::vector_init
- Use auto for type deduction with vector_init calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Optimized compile-time args vectors in:
- batch_norm/device/batch_norm_program_factory.cpp
- batch_norm/device/running_statistics_program_factory.cpp
- groupnorm/device/groupnorm_no_mcast_program_factory.cpp
- groupnorm/device/groupnorm_mcast_program_factory.cpp
- layernorm_distributed/device/layernorm_pre_all_gather_welford_program_factory.cpp
- layernorm_distributed/device/layernorm_post_all_gather_program_factory.cpp
- layernorm_distributed/device/layernorm_pre_all_gather_program_factory.cpp
- layernorm_distributed/device/layernorm_post_all_gather_welford_program_factory.cpp
- layernorm_distributed/device/layernorm_pre_all_gather_2d_program_factory.cpp

Replaced std::vector<uint32_t> with ttsl::vector_init<uint32_t>() for
empty vectors and ttsl::vector_init<uint32_t>(...) for initialized
vectors to optimize memory allocations and reduce binary size.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Optimized vector allocations in all_to_all_combine, all_to_all_dispatch,
and mesh_partition program factory files by replacing std::vector with
ttsl::vector_init. This reduces allocations and reallocations for
compile-time and runtime argument vectors.

Changes:
- Added #include <tt_stl/vector_init.hpp> to all three files
- Replaced reader/writer compile_time_args vectors with vector_init
- Replaced reader/writer runtime_args vectors with vector_init
- Replaced dest_mesh_id/dest_chip_id reserve() with vector_init(vector_size)
- Replaced Shape constructor vectors in mesh_partition with vector_init

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed empty vector_init<T>() followed by append_to() back to
std::vector<T> for better readability. Using vector_init() for
empty vectors adds no value when immediately followed by operations
that handle their own capacity management.

Also reverted incorrect vector_init with vector_size patterns back
to proper std::vector<T> with .reserve() or std::vector<T>(size, value)
for uniform initialization.

Files modified:
- 7 softmax program factories (empty compile-time args)
- 2 groupnorm program factories (empty compile-time args)
- 3 CCL program factories (reserve patterns)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant