Improve std::vector allocations and insertions performance by mateusznowakTT · Pull Request #37880 · tenstorrent/tt-metal

mateusznowakTT · 2026-02-13T19:40:46Z

Ticket

Problem description

Default std::vector allocation strategy is not optimal in case of insertion of multiple elements.
By default, initial capacity is 0, and with each insertion, it is extended to fit new item (g++ strategy is double the capacity), and existing items are migrated (copied or moved) to new underlying buffer. If memory is reserved ahead of insertions this can be hugely improved.

On top of that, there is a performance benefit from constructing vector elements in-place instead of copying or moving elements. Even inline initialization (std::vector v{v1, v2, v3};) suffers from this.

Microbenchmark presenting performance improvements possible https://quick-bench.com/q/4H4CNijgXcyUyAkaPMomSVDprPA

What's changed

std::vector occurrences are accompanied with .reserve whenever it is possible to determine number of insertions.
Further optimizations were done to emplace inline constructed objects and take advantage of techniques like RVO, whenever possible, to reduce allocations and data movement.

Checklist

New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

Add reserve() calls to std::vector allocations where final size is known upfront to avoid reallocations. Changes: - mesh_coord.cpp: 8 reserve() additions in core mesh operations - core_coord.cpp: 4 reserve() additions in core coordinate operations - conv2d_op_sharded_program_factory.cpp: 2 reserve() for core vectors - sdpa_decode_program_factory.cpp: 2 reserve() for core group vectors All reserves use known or calculable sizes to avoid reallocations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Continue std::vector allocation optimizations in high-usage operations: - matmul_multicore_reuse_mcast_1d_program_factory: 2 reserve() additions * non_idle_cores_vec: reserve(subdevice_cores.ranges().size()) * ring_list: reserve before insert operation - all_gather_concat_program_factory: 2 reserve() additions * q_cores_vector: reserve(concat_num_cores) * sem_cores_vector: reserve(concat_num_cores + 1) - groupnorm_mcast_program_factory: 2 reserve() additions * mcast_groups: reserve(sender_ranges.size()) * mcast_virtual_groups: reserve(sender_ranges.size()) Part of ongoing effort to eliminate vector reallocations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Convert 4 instances from list initialization to reserve+emplace pattern. Changes: - core_coord.cpp: current_remaining vector in subtract loop * Was: {current_range} initialization * Now: reserve(1) + emplace_back() - matmul_multicore_reuse_mcast_1d: shared_cbs vector * Was: {cb_src0, cb_src1} initialization * Now: reserve(2+size) + emplace_back() for each - all_gather_concat: input/output/temp tensor vectors (3 instances) * Was: {tensor} initialization for each * Now: reserve(1) + emplace_back() for each - conv2d_op_sharded: activation_reuse_dummy_args * Was: {0, 0, 0, 0} initialization * Now: reserve(4) + emplace_back() loop Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Continue vector allocation optimizations in data movement and CCL: - transpose_hc_sharded_program_factory: 5 reserve() additions * shard_grid_x_map: reserve(num_cores_x) * shard_grid_y_map: reserve(num_cores_y) * reader_runtime_args: convert list init to reserve+emplace (5 elements) * cores: reserve(num_cores) * stick_ids_per_core: reserve(num_sticks_per_core) - ccl_common: 1 reserve() addition * combined_tensors: reserve(num_devices) in slice/concat operation These optimizations target data movement operations (transpose) and collective communication (CCL) hot paths. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed bug from commit 7ed4ba5ff02 where .ranges() was incorrectly called on std::set<CoreRange>. std::set doesn't have .ranges() method - that's only for CoreRangeSet. Changed to use .size() instead. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added 8 reserve() optimizations: 1. groupnorm_no_mcast_program_factory.cpp (2 optimizations): - mcast_groups.reserve(sender_groups_count) - mcast_virtual_groups.reserve(sender_groups_count) 2. reshard_program_factory_generic.cpp (1 optimization): - compressed_blocks.reserve(page_strides.size()) 3. permute_tiled_program_factory.cpp (1 optimization): - reader_runtime_args: list init → reserve+emplace - Total size: 3 + output_shape_view + inv_perm + input_tile_strides 4. llama_1d_mm_fusion.cpp (4 optimizations): - non_idle_cores_vec.reserve(subdevice_cores.ranges().size()) - cb_outputs.reserve(out_buffers.size()) - output_cb_indices.reserve(out_buffers.size()) - interm_cb_indices.reserve(out_buffers.size()) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added 5 reserve() optimizations: 1. rms_allgather_program_factory.cpp (4 optimizations): - storage_core_noc_x.reserve(storage_core_coords.size()) - storage_core_noc_y.reserve(storage_core_coords.size()) - stats_tensor_cores_x.reserve(num_stats_cores) - stats_tensor_cores_y.reserve(num_stats_cores) where num_stats_cores = calculated from tile range 2. groupnorm_sharded_program_factory.cpp (1 optimization): - mcast_groups.reserve(num_sender_cores) where num_sender_cores = (num_batches/num_batches_per_core) * (num_groups/num_groups_per_core) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add reserve() calls to eliminate reallocations in: - SDPA program factory: head_work and reader_args vectors - SDPA decode: input_tensors with reserve+emplace pattern - Sort program factory: physical_core_lookup_table_data - Reduce H/W program factories: cores vector - Moreh Adam/AdamW: output spec and tensor vectors - Sliding window: serialize_gather_config output buffer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, tensor Add reserve() calls and convert list init to reserve+emplace in: - CCL: command_lowering, sharding_addrgen_helper, all_gather_concat (14 vectors) - Ring attention all-gather: reader/writer forward/backward rt_args (4 vectors) - Moreh: helper_functions, matmul, mean_backward, norm_backward, sum_backward - Concat: program_factory, s2i_program_factory runtime args - Distributed: mesh_device, mesh_device_view, mesh_command_queue_base - Infrastructure: metal_context, bfloat16, tensor_ops, tensor_spec_flatbuffer - Pool: pool_multi_core scalars_per_core - Llama fusion: bank_ids Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…upnorm Add reserve() calls to 7 files with 13 optimizations to eliminate vector reallocations where final size is known. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add reserve() for cores_with_rtargs vector populated in loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gram, ccl Add reserve() calls before push_back/emplace_back loops with known sizes: - device.cpp: all_worker_cores_logical (num_cores_x * num_cores_y) - l1_banking_allocator.cpp: shuffled_bank_id, dram_bank_offsets - kernel.cpp: file_paths (expected_num_binaries) - device_manager.cpp: device_ids, device_ids_to_open - profiler.cpp: virtual_dispatch_cores - program.cpp: kernel_ids (2 locations) - ccl_types_args_emitters.cpp: noc row/col maps, emit_rt_args Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add reserve() calls for vectors with known sizes: - transpose_hc_sharded: 10 reserves for read_cores and non_repeat vectors in loops - binary_backward: 18 reserves for grad_tensor vectors (consistently 2 elements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nsor vectors Add reserve(1) to all 64 std::vector<Tensor> and 2 std::vector<ComplexTensor> grad_tensor declarations in unary_backward.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Eliminate temporary variables that are declared, assigned, and immediately pushed to a vector without further references: - core_coord.cpp: 4 CoreRange variables (left, right, bottom, top) → direct emplace_back - binary_backward.cpp: ~15 grad_a/grad_b variables → direct emplace_back - unary_backward.cpp: ~40 result/grad_result variables → direct emplace_back Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace previously unrolled reserve+emplace patterns and remaining initializer list patterns with vector_init helper for cleaner code and optimal performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add missing reserve() calls in 8 locations with known loop sizes: - core_descriptor, metal_soc_descriptor, kernel, program, mesh_device, layernorm_pre_all_gather, dram_prefetcher Convert push_back(T{...}) to emplace_back(T{...}) in 9 files: - reshard, loss, groupnorm_sharded, layernorm (3 files), sliding_window, sdpa, ring_joint_sdpa, system_mesh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ze_std_vector_allocations

- control_plane.cpp: corner_asic_positions reserve(4), corner_fabric_node_ids reserve(4 * num_meshes) - topology_mapper.cpp: pinning_strs reserve(mesh_pinnings.size()) - layernorm_pre_all_gather_2d_program_factory.cpp: merge_core_ranges_vec reserve(cores_x) - qa_hal.cpp: objs reserve(4), includes reserve(13) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…erload Move vector_init.hpp from ttnn/api/ttnn/common/ to tt_stl/tt_stl/ so it can be used by both tt_metal and ttnn code. Add ttsl namespace, deprecated tt::stl alias, and runtime reserve size overload using ttsl::vector_size (a ttsl::StrongType<size_t>). Update all 9 consumers to use new include path and ttsl:: qualified calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use ttsl::vector_init with compile-time N in HAL includes (bh/wh/qa) and exact-fit in control_plane and core_coord. Use runtime vector_size overload in concat, transpose, permute, and moreh operation program factories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace reserve+push_back series with ttsl::vector_init runtime overload in sdpa_program_factory (17 fixed args), sdpa_decode reader/writer args (15+10 init elements with inserts), and conv2d width-sharded rt_args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use compile-time N=19 for compute_all_to_all/not_all_to_all init lists with conditional welford args. Use runtime vector_size for reader_sender, reader_receiver, and writer_args builders with estimated total sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-allocate estimated additions before push_back series in: - RingSDPAFusedOpSignaler::push_ring_sdpa_fused_op_rt_args (+6) - AllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores) - StridedAllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores) - MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload1 (+6+2*cores) - MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload2 (+9) - MatmulFusedOpSignaler::push_llama_rs_rt_args_for_mm (+11 max) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ctories Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ender/idle args No pattern is too small: add reserve() to push_reduce_scatter_fused_op_rt_args, push_llama_rs_rt_args_for_rs (1 push_back each), and MinimalMatmulFusedOpSignaler (12 push_backs, previously missing reserve). Convert mm_in0_sender_args and mm_in0_idle_args in dram_sharded factory to vector_init. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

container_init<Container, N>(vals...) works with any container that has reserve() and emplace_back() (std::vector, ttsl::SmallVector, etc.). Existing vector_init overloads now delegate to container_init — zero churn on existing call sites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

small_vector_init<T>(vals...) provides the same ergonomics as vector_init but returns ttsl::SmallVector<T>. Added #include <tt_stl/small_vector.hpp> to support the wrappers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert brace-init single-element SmallVector<SubDeviceId> patterns to small_vector_init<SubDeviceId>(id) in all_gather, reduce_scatter, broadcast, and all_broadcast program factories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace std::vector with ttsl::SmallVector for rank-bounded dimension arrays (max 4-8 elements) in hot-path slice and transpose program factories: - slice_program_factory_rm.cpp: 4 dimension arrays - padded_slice_tile_program_factory.cpp: 6 dimension arrays - transpose_hc_sharded_program_factory.cpp: 2 grid maps - padded_slice_rm_program_factory.cpp: 4 dimension arrays Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…L ops Replace std::vector with ttsl::SmallVector for small bounded local vectors: - padded_slice_tile: 4 more per-dimension arrays (lines 209-308) - slice_write (3 files): 6 dimension arrays each (interleaved, sharded, tiled) - broadcast/all_broadcast: fixed-size mcast args (size 2) All converted vectors are purely local with known small bounds (≤8 elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ze_std_vector_allocations

Optimize vector allocations in 7 softmax program factory files by replacing std::vector initialization with ttsl::vector_init: - softmax_program_factory_general_h_small.cpp - softmax_program_factory_general_h_large.cpp - softmax_program_factory_general_w_small.cpp - softmax_program_factory_general_w_large.cpp - softmax_program_factory_general_c_large.cpp - softmax_program_factory_attention_optimized.cpp - softmax_program_factory_attention_optimized_sharded.cpp Changes: - Add #include <tt_stl/vector_init.hpp> where missing - Replace compile-time args vectors with ttsl::vector_init - Replace runtime args vectors in loops with ttsl::vector_init - Use auto for type deduction with vector_init calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Optimized compile-time args vectors in: - batch_norm/device/batch_norm_program_factory.cpp - batch_norm/device/running_statistics_program_factory.cpp - groupnorm/device/groupnorm_no_mcast_program_factory.cpp - groupnorm/device/groupnorm_mcast_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_welford_program_factory.cpp - layernorm_distributed/device/layernorm_post_all_gather_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_program_factory.cpp - layernorm_distributed/device/layernorm_post_all_gather_welford_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_2d_program_factory.cpp Replaced std::vector<uint32_t> with ttsl::vector_init<uint32_t>() for empty vectors and ttsl::vector_init<uint32_t>(...) for initialized vectors to optimize memory allocations and reduce binary size. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Optimized vector allocations in all_to_all_combine, all_to_all_dispatch, and mesh_partition program factory files by replacing std::vector with ttsl::vector_init. This reduces allocations and reallocations for compile-time and runtime argument vectors. Changes: - Added #include <tt_stl/vector_init.hpp> to all three files - Replaced reader/writer compile_time_args vectors with vector_init - Replaced reader/writer runtime_args vectors with vector_init - Replaced dest_mesh_id/dest_chip_id reserve() with vector_init(vector_size) - Replaced Shape constructor vectors in mesh_partition with vector_init Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changed empty vector_init<T>() followed by append_to() back to std::vector<T> for better readability. Using vector_init() for empty vectors adds no value when immediately followed by operations that handle their own capacity management. Also reverted incorrect vector_init with vector_size patterns back to proper std::vector<T> with .reserve() or std::vector<T>(size, value) for uniform initialization. Files modified: - 7 softmax program factories (empty compile-time args) - 2 groupnorm program factories (empty compile-time args) - 3 CCL program factories (reserve patterns) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mateusznowakTT and others added 30 commits February 12, 2026 11:14

Optimize vector allocations: unary, chunk, fold, ccl, flatbuffer, gro…

c74e7f5

…upnorm Add reserve() calls to 7 files with 13 optimizations to eliminate vector reallocations where final size is known. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Optimize vector allocation in typecast program factory

c94891b

Add reserve() for cores_with_rtargs vector populated in loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Further std::vector optimizations

482794a

Optimize vector allocations: unary_backward reserve(1) for 66 grad_te…

f9fe5bf

…nsor vectors Add reserve(1) to all 64 std::vector<Tensor> and 2 std::vector<ComplexTensor> grad_tensor declarations in unary_backward.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use emplace_back instead of push_back in case of inline object creation

5e3e596

Add vector_init helper and example use in binary_backward

e7bd491

Convert unrolled initializer lists to vector_init across 4 files

554df65

Replace previously unrolled reserve+emplace patterns and remaining initializer list patterns with vector_init helper for cleaner code and optimal performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add reserve() to mesh_device_view: devices_in_region and neighbors

f96a179

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix vector_init

313128d

fix

93c9cc1

Merge remote-tracking branch 'origin/main' into mateusznowakTT/optimi…

6c8d1e6

…ze_std_vector_allocations

fix

2b5b60a

Allow vector_init to preallocate more if needed

898dc27

Replace reserve+emplace patterns with vector_init

1bd9f99

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace push_back series with vector_init in groupnorm and rms_allgather

74bf0ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

improve

3a8a8ca

mateusznowakTT and others added 24 commits February 15, 2026 00:17

Use vector_init<T, N> with compile-time overallocation in sdpa_decode

a45d7d0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Optimize vector in ring_joint_sdpa_factory

0fc44c6

Apply vector_init to CCL and matmul program factories

7878cfe

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply vector_init to slice_rm, all_reduce_async, and llama_sharded fa…

be88677

…ctories Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Improve

fee09e7

move requires to single place

6cbab50

fix

3436e8b

Merge remote-tracking branch 'origin/main' into mateusznowakTT/optimi…

b32cc34

…ze_std_vector_allocations

optimize

441b4bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve std::vector allocations and insertions performance#37880

Improve std::vector allocations and insertions performance#37880
mateusznowakTT wants to merge 54 commits intomainfrom
mateusznowakTT/optimize_std_vector_allocations

mateusznowakTT commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mateusznowakTT commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Model tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mateusznowakTT commented Feb 13, 2026 •

edited

Loading