Enable prefetching of SW kernel instructions after the first SW task#199
Enable prefetching of SW kernel instructions after the first SW task#199Kepontry wants to merge 18 commits intoopenvinotoolkit:developfrom
Conversation
|
@Kepontry hello! Thanks for your PR! Could you please ensure your changes include test coverage by adding tests to https://github.com/openvinotoolkit/npu_compiler/blob/develop/tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_40XX.mlir and maybe some functional tests? |
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Show resolved
Hide resolved
@Kepontry could you please look into this comment? Thank you! |
|
Apologies for the delay; I missed the email notification for this thread. I am currently working on adding the test. Could you provide some guidance or documentation on how to use the lit test framework within the NPU compiler? |
|
Functional test added. |
|
Hi @Kepontry! Please adhere to the clang-format guidelines. You will find the automatically fixed code style in the job logs: https://github.com/openvinotoolkit/npu_compiler/actions/runs/19637113134/job/56230572067?pr=199 I also noticed that some LIT tests failed. Could you please verify if these failures are due to your changes? |
|
Hi @Maxim-Doronin , the failure of LIT test is caused by the
|
Hello @Kepontry! Thanks for adding the tests and sharing your findings regarding pre-commit failures! |
|
@Kepontry I managed to reproduce the issue: -> Will research it a bit and get back! In the meantime, could you please share with us why you set this particular value? |
|
Hi, @DariaMityagina , thanks for your assistance regarding this issue. Since this PR enables prefetching regardless of the first SHAVE task's start time threshold, it exposed some existing bugs in certain test cases. These bugs were previously hidden because there wasn't enough time slack to trigger the prefetch logic. I adjusted the threshold to simulate a scenario that forces prefetch insertion, confirming that these test cases fail without this PR. |
|
Hi @DariaMityagina , I have fixed the failed tests. The root cause was that I fixed this by manually adding the |
Thanks a lot for the updates! |
|
The failed log indicates an
I suspect this issue is related to the recently inserted MLIR code. I am currently uncertain whether the root cause involves the specific values (1326182 or 1473536) or the reserved memory allocation. Since the code runs on both NPU37XX and NPU40XX, I made modifications according to the implementation found in feasible_allocation.mlir. npu_compiler/tests/lit/NPU/dialect/VPUIP/passes/feasible_allocation.mlir Lines 123 to 129 in 9f8730d I hope this resolves the issue.Alternatively, could you provide the scripts necessary to reproduce this experiment in the CI environment? |
|
Fixed the failing tests on NPU37XX by adjusting the offset of the reserved memory. Similar changes were applied to the NPU40XX tests. Note: I'm currently uncertain why the |
|
Hi @DariaMityagina , the prefetching for the |
|
Hi @DariaMityagina , since |
Hello @Kepontry! Great! Thank you! |
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Show resolved
Hide resolved
tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_mid_execution_40XX.mlir
Outdated
Show resolved
Hide resolved
|
Hi, @DariaMityagina , I refactored the code as requested. I moved the variables into the class and added the comments. Thanks for pointing these out. |
Hello @Kepontry! Thank you! |
|
Hello @Kepontry! Apologies for the delay. Due to the holiday season, the next review will be postponed a little. In the meantime, our validation process has identified some issues with the changes in this PR. I'll analyze these issues and share my findings here. |
|
Hi @DariaMityagina , Happy New Year! I wanted to follow up on this PR. You mentioned there were some validation issues identified earlier—could you please share the details/logs when you get a chance? I’d like to address those fixes so we can proceed with the review. |
|
Hello @Kepontry! Merry Christmas and Happy New Year 🎄 Here is the problem found in the logs: Let me find a model you can use for debugging. |
@Kepontry hi! You can try to reproduce the issue using this model: -> |
@Kepontry hello! When I tested your branch directly, the issue didn't occur. Let me dig deeper to figure out what caused it to show up in our validation environment. |
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
| size_t dynamicExecTile = _dynamicPrefetchTileCounter % numClusters; | ||
| _dynamicPrefetchTileCounter++; | ||
|
|
||
| auto newPrefetchKernel = insertDummyKernelOpBeforeFirstKernelTask(insertBeforeOp, mlir::ValueRange(), |
There was a problem hiding this comment.
newPrefetchKernel is set empty update barriers in this case, and insertDummyKernelOpBeforeFirstKernelTask will make the wait barriers empty too. 🤔
I am not sure if they will be scheduled in the slots as we expect. I suppose tasks without wait barriers will be executed in the very beginning? @DariaMityagina Can we have someone confirm this?
If so, the insert function will need a proper wait barrier(maybe also an update barrier) instead of empty for this use case.
There was a problem hiding this comment.
You are correct that the update and wait barriers are empty. However, according to my observations from hardware SHAVE profiling, the dummy inst prefetch op is executed in the inserted non-saturated position. If the dummy op were inserted in a saturated position, the original SHAVE task would be postponed or even scheduled to another SHAVE unit. But if you confirm any scheduling behavior is not as expected, adding the barriers is also fine with me.
There was a problem hiding this comment.
We should add wait barriers. Not sure why shave task executes at the position you want it to execute but later passes would be free to reorder that prefetch to the beginning of the schedule
|
Hi @Kepontry ! Thank you for your great contribution! I left some comments. Please take a look when you have time 😄 |
|
Hi @liyihao-1ntel ! Thank you for the valuable feedback. I have addressed your comments in the latest update. Please let me know if you have any further questions. |
liyihao-1ntel
left a comment
There was a problem hiding this comment.
Discussion about gap finding:
-
From the PR, we are deciding the gap size by:
gapSize=targetTaskStart/saturationTaskGroupStart-insertionTaskStart
In this case, during the gap theinsertTaskis still being executed, which might leave limited cache space for dummy kernel.
WillgapSize=targetTaskStart/saturationTaskGroupStart-insertionTaskEndmake more sense? -
Question: Given multiple target shave tasks on the same time, theoretically, how will our insertion function work? Will there be multiple dummy kernel tasks inserted?
src/vpux_compiler/src/dialect/VPUIP/transforms/passes/add_sw_kernel_instruction_prefetch.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Could you elaborate in comments why we pick a specific tile _targetInsertTileDuringExec here?
There was a problem hiding this comment.
This variable is used solely as a reference for gap finding, not for the actual insertion (which, as mentioned, follows a round-robin strategy). I agree the name is slightly misleading, so I plan to rename it.
We chose a specific tile (Tile 1) for two reasons:
-
When multiple kernels with the same operator execute concurrently (e.g., across Tiles 0-3), the schedule is symmetric. We don't need to calculate the gap for every tile; checking one representative tile is sufficient.
-
We selected Index 1 (instead of 0) to ensure instruction prefetching is enabled only when the kernel spans at least two tiles, which provides more insertion slots and yields better performance gains.
|
Hi @malbecki! This PR is introducing in-the-middle-prefetch-tasks-insertion. Would you help take a look at it when you have time? Would love to know your opinion on L2$ utilization on general cases and on this new approach. 😄 Thanks a lot! |
|
Hi @liyihao-1ntel , I suspect the GitHub notification might have slipped through for @malbecki . Would you mind pinging him via internal channels to check if he has time to take a look? Thanks! |
Hi @Kepontry ! I have reached out to malbecki. Meanwhile, I would like to share some updates IMHO: Pro 1: This in-the-middle-prefetch-tasks-insertion does provide more chances to improve shave perf in theory. I think CI results can help us know better about its scale and impact. Concern 1: The support for dummy kernels profiling on LNL is still incomplete as far as I know, so changes in this PR can bring some invisible dummy shaves in the gaps of shave tasks. Too many potential invisible tasks might compromise profiling effectiveness. This may bring confusion to those developers who are unfamiliar with such feature. Concern 2: Current scheduling is based on simulator and cost model which will be upgraded timely(Plz correct me if any parts of this statement is wrong). That is to say we are introducing const threshold very cautiously. Insertion in the middle of tasks might be impactful even when we set proper barriers for these new tasks. Checking current CI perf is the first step. |
|
Hello, Really sorry for the delay in response. I do agree that prefetching in the middle of the schedule would be a great feature however there is a reason why this pass was constrained to only work at the beginning of the schedule and only with intervals that have 250k free cycles on SHAVE and that reason is that our cost model is not very good at estimating certain workloads. I did have a version with prefetch in the middle of the schedule and that version introduced both regressions and improvements on the models we test in our internal CI with models slowing down a bit on average while version with prefetch at the begging with 250k free cycles did eliminate some of the improvements but also eliminated all of the regressions which is why we went with it. I will try to get data on how this PR affects models we test internally and I think we can make a decision whether we can merge it based on that. Of course if there is a good improvement on a specific model like Qwen2(didn't see such an improvement originally) then even if we see overall regression maybe we can talk about merging this code under a compiler option. Regarding L2$ utilization: Regarding profiling: |
| _log.trace("insertPoint: {0}, bestReleaseCycle: {1}", *firstShaveTaskInIR, bestReleaseCycle); | ||
| newPrefetchKernels = insertPrefetchTasks(funcOp, kernelsToPrefetch, firstShaveTaskInIR, bestUpdateBarrier); | ||
| } else if (_useDummyKernelForInstructionPrefetch) { | ||
| newPrefetchKernels = insertPrefetchTasksDuringExec(funcOp, kernelsToPrefetch, allTasks); |
There was a problem hiding this comment.
It would be great if we could make the logic common between prefetching at the start and prefetching in the middle of the schedule since the former is just a special case of latter(at least conceptually). Though I understand that this is probably done to limit the impact on other platforms. If results are good we can follow up on this internally.
| uint64_t prevTargetTileTaskStartTime = 0; | ||
|
|
||
| // find the largest gap between a non-saturated SW task and a saturated SW task / the kernel to be prefetched | ||
| for (size_t i = 0; i < allTasks.size(); ++i) { |
There was a problem hiding this comment.
I think this code assumes there is no eviction event between the start of schedule and kernel start time. This might be true for existing platforms but might not be true for future. If we get good results we can accept it as is but for now let's at least put a comment here that we make such an assumption.
| _log.trace("Kernel '{0}': Found best gap of {1} cycles. Inserting relative to task {2}.", kernelName, | ||
| bestGap.lookaheadGap, bestGap.insertionPointTaskIndex); | ||
|
|
||
| if (bestGap.insertionPointTaskIndex < 0 || |
| size_t dynamicExecTile = _dynamicPrefetchTileCounter % numClusters; | ||
| _dynamicPrefetchTileCounter++; | ||
|
|
||
| auto newPrefetchKernel = insertDummyKernelOpBeforeFirstKernelTask(insertBeforeOp, mlir::ValueRange(), |
There was a problem hiding this comment.
We should add wait barriers. Not sure why shave task executes at the position you want it to execute but later passes would be free to reorder that prefetch to the beginning of the schedule
| } | ||
|
|
||
| if (prevTargetTileTaskIndex != -1) { | ||
| size_t simultaneousSwKernels = getSwKernelCountAtTime(prevTargetTileTaskStartTime, allTasks); |
There was a problem hiding this comment.
I don't think allTasks accounts for prefetch kernels that were inserted in the schedule so I think it is possible that this algorithm inserts more prefetch kernels into a slot theoretically exceeding the number of shaves on NPU. This might cause multiple prefetch kernels on a single SHAVE in a single slot which in turn means that the threshold of 50k might be exceeded.
|
Update on results: Ported this change to internal develop branch and got the results. Good news is that I see improvement(~3%) on a few models and I don't see any major regressions. The bad news is that several models failed to compile, possibly due to changes in scheduler that were made on internal branch. This will require some fixing which I have scheduled internally but I can't provide an ETA for right now since we still didn't decide on the priority. Not sure if open source review can go ahead and diverge with internal branch. |
Summary
This PR enhances the AddSwKernelInstructionPrefetchPass to enable prefetching of SHAVE kernel instructions after the first SHAVE task, if the initial slack is insufficient.
Currently, instruction prefetching is skipped if the slack before the first SHAVE task is insufficient. This limitation is suboptimal when initial insertion slots (tiles) are limited or L2 cache capacity is constrained.
Based on the observation that SHAVE utilization is often low, we propose this change to prefetch opportunistically later in the schedule. This approach has demonstrated a ~3% performance gain on models such as Qwen2-1.5b and Qwen3-0.6b.
Target Platform For Release Notes
Classification of this Pull Request
Implementation Details
Additional Fixes & Enhancements
We also noted that the previous 250K-cycle threshold is overly conservative for our platform (Ultra 258V). Our analysis shows that prefetching provides benefits even with a slack as low as 170K cycles.