Improve std::vector allocations and insertions performance#37880
Draft
mateusznowakTT wants to merge 54 commits intomainfrom
Draft
Improve std::vector allocations and insertions performance#37880mateusznowakTT wants to merge 54 commits intomainfrom
mateusznowakTT wants to merge 54 commits intomainfrom
Conversation
Add reserve() calls to std::vector allocations where final size is known upfront to avoid reallocations. Changes: - mesh_coord.cpp: 8 reserve() additions in core mesh operations - core_coord.cpp: 4 reserve() additions in core coordinate operations - conv2d_op_sharded_program_factory.cpp: 2 reserve() for core vectors - sdpa_decode_program_factory.cpp: 2 reserve() for core group vectors All reserves use known or calculable sizes to avoid reallocations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Continue std::vector allocation optimizations in high-usage operations: - matmul_multicore_reuse_mcast_1d_program_factory: 2 reserve() additions * non_idle_cores_vec: reserve(subdevice_cores.ranges().size()) * ring_list: reserve before insert operation - all_gather_concat_program_factory: 2 reserve() additions * q_cores_vector: reserve(concat_num_cores) * sem_cores_vector: reserve(concat_num_cores + 1) - groupnorm_mcast_program_factory: 2 reserve() additions * mcast_groups: reserve(sender_ranges.size()) * mcast_virtual_groups: reserve(sender_ranges.size()) Part of ongoing effort to eliminate vector reallocations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert 4 instances from list initialization to reserve+emplace pattern.
Changes:
- core_coord.cpp: current_remaining vector in subtract loop
* Was: {current_range} initialization
* Now: reserve(1) + emplace_back()
- matmul_multicore_reuse_mcast_1d: shared_cbs vector
* Was: {cb_src0, cb_src1} initialization
* Now: reserve(2+size) + emplace_back() for each
- all_gather_concat: input/output/temp tensor vectors (3 instances)
* Was: {tensor} initialization for each
* Now: reserve(1) + emplace_back() for each
- conv2d_op_sharded: activation_reuse_dummy_args
* Was: {0, 0, 0, 0} initialization
* Now: reserve(4) + emplace_back() loop
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Continue vector allocation optimizations in data movement and CCL: - transpose_hc_sharded_program_factory: 5 reserve() additions * shard_grid_x_map: reserve(num_cores_x) * shard_grid_y_map: reserve(num_cores_y) * reader_runtime_args: convert list init to reserve+emplace (5 elements) * cores: reserve(num_cores) * stick_ids_per_core: reserve(num_sticks_per_core) - ccl_common: 1 reserve() addition * combined_tensors: reserve(num_devices) in slice/concat operation These optimizations target data movement operations (transpose) and collective communication (CCL) hot paths. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed bug from commit 7ed4ba5ff02 where .ranges() was incorrectly called on std::set<CoreRange>. std::set doesn't have .ranges() method - that's only for CoreRangeSet. Changed to use .size() instead. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added 8 reserve() optimizations: 1. groupnorm_no_mcast_program_factory.cpp (2 optimizations): - mcast_groups.reserve(sender_groups_count) - mcast_virtual_groups.reserve(sender_groups_count) 2. reshard_program_factory_generic.cpp (1 optimization): - compressed_blocks.reserve(page_strides.size()) 3. permute_tiled_program_factory.cpp (1 optimization): - reader_runtime_args: list init → reserve+emplace - Total size: 3 + output_shape_view + inv_perm + input_tile_strides 4. llama_1d_mm_fusion.cpp (4 optimizations): - non_idle_cores_vec.reserve(subdevice_cores.ranges().size()) - cb_outputs.reserve(out_buffers.size()) - output_cb_indices.reserve(out_buffers.size()) - interm_cb_indices.reserve(out_buffers.size()) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added 5 reserve() optimizations:
1. rms_allgather_program_factory.cpp (4 optimizations):
- storage_core_noc_x.reserve(storage_core_coords.size())
- storage_core_noc_y.reserve(storage_core_coords.size())
- stats_tensor_cores_x.reserve(num_stats_cores)
- stats_tensor_cores_y.reserve(num_stats_cores)
where num_stats_cores = calculated from tile range
2. groupnorm_sharded_program_factory.cpp (1 optimization):
- mcast_groups.reserve(num_sender_cores)
where num_sender_cores = (num_batches/num_batches_per_core) * (num_groups/num_groups_per_core)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add reserve() calls to eliminate reallocations in: - SDPA program factory: head_work and reader_args vectors - SDPA decode: input_tensors with reserve+emplace pattern - Sort program factory: physical_core_lookup_table_data - Reduce H/W program factories: cores vector - Moreh Adam/AdamW: output spec and tensor vectors - Sliding window: serialize_gather_config output buffer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, tensor Add reserve() calls and convert list init to reserve+emplace in: - CCL: command_lowering, sharding_addrgen_helper, all_gather_concat (14 vectors) - Ring attention all-gather: reader/writer forward/backward rt_args (4 vectors) - Moreh: helper_functions, matmul, mean_backward, norm_backward, sum_backward - Concat: program_factory, s2i_program_factory runtime args - Distributed: mesh_device, mesh_device_view, mesh_command_queue_base - Infrastructure: metal_context, bfloat16, tensor_ops, tensor_spec_flatbuffer - Pool: pool_multi_core scalars_per_core - Llama fusion: bank_ids Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…upnorm Add reserve() calls to 7 files with 13 optimizations to eliminate vector reallocations where final size is known. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add reserve() for cores_with_rtargs vector populated in loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gram, ccl Add reserve() calls before push_back/emplace_back loops with known sizes: - device.cpp: all_worker_cores_logical (num_cores_x * num_cores_y) - l1_banking_allocator.cpp: shuffled_bank_id, dram_bank_offsets - kernel.cpp: file_paths (expected_num_binaries) - device_manager.cpp: device_ids, device_ids_to_open - profiler.cpp: virtual_dispatch_cores - program.cpp: kernel_ids (2 locations) - ccl_types_args_emitters.cpp: noc row/col maps, emit_rt_args Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add reserve() calls for vectors with known sizes: - transpose_hc_sharded: 10 reserves for read_cores and non_repeat vectors in loops - binary_backward: 18 reserves for grad_tensor vectors (consistently 2 elements) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsor vectors Add reserve(1) to all 64 std::vector<Tensor> and 2 std::vector<ComplexTensor> grad_tensor declarations in unary_backward.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate temporary variables that are declared, assigned, and immediately pushed to a vector without further references: - core_coord.cpp: 4 CoreRange variables (left, right, bottom, top) → direct emplace_back - binary_backward.cpp: ~15 grad_a/grad_b variables → direct emplace_back - unary_backward.cpp: ~40 result/grad_result variables → direct emplace_back Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace previously unrolled reserve+emplace patterns and remaining initializer list patterns with vector_init helper for cleaner code and optimal performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing reserve() calls in 8 locations with known loop sizes:
- core_descriptor, metal_soc_descriptor, kernel, program, mesh_device,
layernorm_pre_all_gather, dram_prefetcher
Convert push_back(T{...}) to emplace_back(T{...}) in 9 files:
- reshard, loss, groupnorm_sharded, layernorm (3 files), sliding_window,
sdpa, ring_joint_sdpa, system_mesh
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ze_std_vector_allocations
- control_plane.cpp: corner_asic_positions reserve(4), corner_fabric_node_ids reserve(4 * num_meshes) - topology_mapper.cpp: pinning_strs reserve(mesh_pinnings.size()) - layernorm_pre_all_gather_2d_program_factory.cpp: merge_core_ranges_vec reserve(cores_x) - qa_hal.cpp: objs reserve(4), includes reserve(13) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erload Move vector_init.hpp from ttnn/api/ttnn/common/ to tt_stl/tt_stl/ so it can be used by both tt_metal and ttnn code. Add ttsl namespace, deprecated tt::stl alias, and runtime reserve size overload using ttsl::vector_size (a ttsl::StrongType<size_t>). Update all 9 consumers to use new include path and ttsl:: qualified calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ttsl::vector_init with compile-time N in HAL includes (bh/wh/qa) and exact-fit in control_plane and core_coord. Use runtime vector_size overload in concat, transpose, permute, and moreh operation program factories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace reserve+push_back series with ttsl::vector_init runtime overload in sdpa_program_factory (17 fixed args), sdpa_decode reader/writer args (15+10 init elements with inserts), and conv2d width-sharded rt_args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use compile-time N=19 for compute_all_to_all/not_all_to_all init lists with conditional welford args. Use runtime vector_size for reader_sender, reader_receiver, and writer_args builders with estimated total sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-allocate estimated additions before push_back series in: - RingSDPAFusedOpSignaler::push_ring_sdpa_fused_op_rt_args (+6) - AllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores) - StridedAllGatherFusedOpSignaler::push_all_gather_fused_op_rt_args (+6+2*cores) - MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload1 (+6+2*cores) - MatmulFusedOpSignaler::push_matmul_fused_op_rt_args overload2 (+9) - MatmulFusedOpSignaler::push_llama_rs_rt_args_for_mm (+11 max) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ctories Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ender/idle args No pattern is too small: add reserve() to push_reduce_scatter_fused_op_rt_args, push_llama_rs_rt_args_for_rs (1 push_back each), and MinimalMatmulFusedOpSignaler (12 push_backs, previously missing reserve). Convert mm_in0_sender_args and mm_in0_idle_args in dram_sharded factory to vector_init. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
container_init<Container, N>(vals...) works with any container that has reserve() and emplace_back() (std::vector, ttsl::SmallVector, etc.). Existing vector_init overloads now delegate to container_init — zero churn on existing call sites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
small_vector_init<T>(vals...) provides the same ergonomics as vector_init but returns ttsl::SmallVector<T>. Added #include <tt_stl/small_vector.hpp> to support the wrappers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert brace-init single-element SmallVector<SubDeviceId> patterns to small_vector_init<SubDeviceId>(id) in all_gather, reduce_scatter, broadcast, and all_broadcast program factories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace std::vector with ttsl::SmallVector for rank-bounded dimension arrays (max 4-8 elements) in hot-path slice and transpose program factories: - slice_program_factory_rm.cpp: 4 dimension arrays - padded_slice_tile_program_factory.cpp: 6 dimension arrays - transpose_hc_sharded_program_factory.cpp: 2 grid maps - padded_slice_rm_program_factory.cpp: 4 dimension arrays Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…L ops Replace std::vector with ttsl::SmallVector for small bounded local vectors: - padded_slice_tile: 4 more per-dimension arrays (lines 209-308) - slice_write (3 files): 6 dimension arrays each (interleaved, sharded, tiled) - broadcast/all_broadcast: fixed-size mcast args (size 2) All converted vectors are purely local with known small bounds (≤8 elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ze_std_vector_allocations
Optimize vector allocations in 7 softmax program factory files by replacing std::vector initialization with ttsl::vector_init: - softmax_program_factory_general_h_small.cpp - softmax_program_factory_general_h_large.cpp - softmax_program_factory_general_w_small.cpp - softmax_program_factory_general_w_large.cpp - softmax_program_factory_general_c_large.cpp - softmax_program_factory_attention_optimized.cpp - softmax_program_factory_attention_optimized_sharded.cpp Changes: - Add #include <tt_stl/vector_init.hpp> where missing - Replace compile-time args vectors with ttsl::vector_init - Replace runtime args vectors in loops with ttsl::vector_init - Use auto for type deduction with vector_init calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Optimized compile-time args vectors in: - batch_norm/device/batch_norm_program_factory.cpp - batch_norm/device/running_statistics_program_factory.cpp - groupnorm/device/groupnorm_no_mcast_program_factory.cpp - groupnorm/device/groupnorm_mcast_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_welford_program_factory.cpp - layernorm_distributed/device/layernorm_post_all_gather_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_program_factory.cpp - layernorm_distributed/device/layernorm_post_all_gather_welford_program_factory.cpp - layernorm_distributed/device/layernorm_pre_all_gather_2d_program_factory.cpp Replaced std::vector<uint32_t> with ttsl::vector_init<uint32_t>() for empty vectors and ttsl::vector_init<uint32_t>(...) for initialized vectors to optimize memory allocations and reduce binary size. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Optimized vector allocations in all_to_all_combine, all_to_all_dispatch, and mesh_partition program factory files by replacing std::vector with ttsl::vector_init. This reduces allocations and reallocations for compile-time and runtime argument vectors. Changes: - Added #include <tt_stl/vector_init.hpp> to all three files - Replaced reader/writer compile_time_args vectors with vector_init - Replaced reader/writer runtime_args vectors with vector_init - Replaced dest_mesh_id/dest_chip_id reserve() with vector_init(vector_size) - Replaced Shape constructor vectors in mesh_partition with vector_init Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed empty vector_init<T>() followed by append_to() back to std::vector<T> for better readability. Using vector_init() for empty vectors adds no value when immediately followed by operations that handle their own capacity management. Also reverted incorrect vector_init with vector_size patterns back to proper std::vector<T> with .reserve() or std::vector<T>(size, value) for uniform initialization. Files modified: - 7 softmax program factories (empty compile-time args) - 2 groupnorm program factories (empty compile-time args) - 3 CCL program factories (reserve patterns) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ticket
#37879
Problem description
Default std::vector allocation strategy is not optimal in case of insertion of multiple elements.
By default, initial capacity is 0, and with each insertion, it is extended to fit new item (g++ strategy is double the capacity), and existing items are migrated (copied or moved) to new underlying buffer. If memory is reserved ahead of insertions this can be hugely improved.
On top of that, there is a performance benefit from constructing vector elements in-place instead of copying or moving elements. Even inline initialization (std::vector v{v1, v2, v3};) suffers from this.
Microbenchmark presenting performance improvements possible https://quick-bench.com/q/4H4CNijgXcyUyAkaPMomSVDprPA
What's changed
std::vector occurrences are accompanied with .reserve whenever it is possible to determine number of insertions.
Further optimizations were done to emplace inline constructed objects and take advantage of techniques like RVO, whenever possible, to reduce allocations and data movement.
Checklist
Model tests
If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers
models-mandatoryandmodels-extendedpresets.The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.
models-mandatorypreset (runs: Device perf regressions and Frequent model and ttnn tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Unit tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Quick tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)