[GPU] Defer allocations of inputs#35126
[GPU] Defer allocations of inputs#35126intbf wants to merge 15 commits intoopenvinotoolkit:masterfrom
Conversation
|
build_jenkins |
57b2748 to
45351f8
Compare
Lyamin-Roman
left a comment
There was a problem hiding this comment.
Have there been any measurements of the impact of this change on performance?
And if there is an impact, maybe add a property like "disable_input_preallocation", that you will use in your application
I haven't observed any performance drop/changes in the test phi_silica_app (on dummy weights). Also @Kotomi-Du checked the app on real weights and the performance and conformance was good. |
|
I confirmed the TPS for 2nd+ token is same as before; the PR should potentially impact first token latency because memory allocation for all nodes are deferred to the very first inference. |
|
@intbf @Kotomi-Du please take a look on dynamic tests errors: |
e838f77 to
f73832f
Compare
|
SIGABRT for 14 tests... |
yes, the main issue is that the memory is allocated/updated lazily in set_data, but some tests doesn't call this. So in some cases I will try to update the tests and allocate memory so that no AV happens. |
Correction: this change already applies to Parameters, as during the compilation ov::op::v0::Parameter converts/creates cldnn::input_layout primitive. |
| std::vector<event::ptr> set_output_memory(const primitive_id& id, memory::ptr mem, bool is_remote = false); | ||
|
|
||
| std::vector<std::shared_ptr<primitive_inst>> const& get_outputs() { return _outputs; } | ||
|
|
There was a problem hiding this comment.
random spot)
- Based on the description, I think the performance impact might be observed in warm-up phase of static-shape model. Could you check that? maybe especially for large-input models, such as detection model.
- How does this interact with memory pool reusing? Does it impact memory usage? or the input_layout memory was just not reused from memory pool and it does not have memory usage impact?
There was a problem hiding this comment.
I updated the description with some benchmark runs: for psd2, psd7, psr, yolo... is that good enough? Or should I add more tests?
As for the second question: I'll try to prepare a unit test that would show if the memory is reused (following our discussion on chat)
ab418f8 to
5382f0b
Compare
|
Several tests are still failing on DG2 GPU: [2026-04-08T15:40:49.481Z] [2026-04-08 15:40:49,080] [124175102043712] cldnn_unit_tests_dg2-0 INFO: [7010/24717] quantize_smoke/quantize_random_test.random/22 (596 ms) |
43b6fc1 to
d209e8b
Compare
|
more tests failing: smoke/DynamicShapeStatefulModelDefault.smoke_Run_Stateful_Dynamic_Default/0 I'll work on them today |
481e216 to
1ef0611
Compare
| auto& eng = get_engine(); | ||
|
|
||
| if (p_inst->output_memory_ptr()) | ||
| _in_out_shared_mem_types.push_back(p_inst->output_memory_ptr()->get_internal_params().mem_type); |
There was a problem hiding this comment.
why it needs to store the mem_type information into this object?
There was a problem hiding this comment.
[_in_out_shared_mem_types] is the cached vector of shared_mem_type enum values for all inputs/outputs. It is used in network.cpp execute() to check if it requires GPU surface locking before execution. The data is stored in network.cpp allocate_primitive_instance which must be called before execution. Why does deferred allocation have impact on this?
There was a problem hiding this comment.
You're right, those two locations were wrong. The original motivation was: lazy input_layout nodes have null output_memory_ptr() at allocate_primitive_instance() time, so the normal push_back there is skipped. If the user later provides a shared surface (VA/DX11) via set_input_data(), _in_out_shared_mem_types would never record it
I moved it to set_input_data,
|
build_jenkins |
| for (auto const& input : _inputs) ret.push_back(input->output_memory_ptr()->get_layout()); | ||
| for (auto const& input : _inputs) { | ||
| if (input->output_memory_ptr()) | ||
| _in_out_shared_mem_types.push_back(input->output_memory_ptr()->get_internal_params().mem_type); |
84c5fce to
7a6c73b
Compare
|
build_jenkins |
In the description I added some results from benchmark runs, is that good enough? Since there's not much perf impact maybe there's no need to introduce this extra ov flag? |
There was a problem hiding this comment.
Shouldn't we extend the unit tests cope verifying that the proper input memory address is indeed set to the next node input?
Check if it correctly propagated though a chain of optimized ops (again verifying the address).
Rebinding external memory (set a new input memory multiple times).
Possible implicit memory binding when the output memory of the previous run is fed into the input of the second run. So we can end up in a situation when the input and the output of the same primitive is the same memory address, therefore it can overwrite its own input. Should be covered with the existing tests though, but we need to double check if these checks validate this new code path.
Multi output, when the lazy allocated memory is reused across several consumers.
There was a problem hiding this comment.
thanks for the comment, I addressed those scenarios in memory_test.cpp, please have a look in the latest commit/update
d6d78be to
51dab1b
Compare
|
build_jenkins |
1141b51 to
8660895
Compare
|
build_jenkins |
In typed_primitive_inst force "allocate mem" to false so that we can avoid allocations of large inputs. Handle cases where the inputs are expected to be present (check for null, or allocate temp buffer for simplicity)
…y allocated (like loop primitive)
previously the tests expected that the memory would be preallocated for the network's input layouts, now it's not, so the tests were adjusted for that
…roperly update dependencies and internals, ensure _reset_arguments is called on set_input_data
the primitive_inst::set_output_memory function has short circuit logic that might do nothing when pointers are the same, but in some cases the same pointer can be set with different layout, and in those cases the old state wouldn't be properly updated.
…ed, code style and small refactor for unit tests
…mem, simplify handling of _in_out_shared_mem_types
…ccessing input_memory_ptr in update_output_memory functions
…the push_back so that the new type is recorded even for mem change
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Cover the reviewer's checklist for the lazy input allocation PR: - Buffer propagation to direct consumers after set_input_data - Multi-consumer fan-out from a single lazy input - Chain of optimized ops aliasing through the lazy buffer - Rebinding external memory across multiple inferences - Feeding previous output back as next inference input
8660895 to
7be2342
Compare
In
input_layout_nodetry to skip early mem allocations so that we can avoid mem increase for large inputs.This optimization saves the total memory peak by the phi silica application from 10gb down to 6gb.
Details:
Before this change the allocate_mem was skipped (set to false) for example for dynamic shapes and internal networks. The PR forces it always to be false and also handle cases where the inputs are expected to be present (check for null, or allocate temp buffer for simplicity).
See the early version of the presentation: https://intel-my.sharepoint.com/:p:/p/bartlomiej_filipek/IQCJ4tTQG0XHQYjMm_FAUlyIAf2FeLPfsthPN6xxJt4TD-I?e=DOG6DL
Tickets:
CVS-178139
AI Assistance:
Perf/mem Comparison:
Using benchmark_app.exe, LunarLake 5 236V, 16GB, iGPU,
PR - binaries compiled with this PR
Master - OpenVino Master, as of 14th April, 2075ff4