ORT 1.24.1 release cherry pick round 1#27229
Merged
tianleiwu merged 6 commits intorel-1.24.1from Feb 3, 2026
Merged
Conversation
…#27157) Replaces the deprecated pkg_resources library with importlib.metadata to fix ModuleNotFoundError.
Related issue: #26985
The logic of interleaved NEON kernel is not correct from code review:
1. **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:
```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
std::vector<float> sin_data(table_len);
std::vector<float> cos_data(table_len);
```
For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.
2. **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i / 2`:
```c++
float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**
3. **NEON (fp16) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i`:
```c++
// Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
//...
// Inside loop, for next iteration:
sin_val = MlasLoadFloat16x8(sin + i + 16);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**
### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```
Before applying the fix, the test failed:
```
[ FAILED ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.
### Summary
The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.
The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description Refer to V1 of the fix here: #27214 This PR includes all fixes from the V1 PR + logic to invalidate the lhs cache pointers in case the pad buffer's underlying buffer has changed due to a resize. The ARM team will look at potentially enhancing this logic after the 1.24.0 release. ### Motivation and Context Fix #26669
### Description According to https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftmoe, > swiglu_limit : float The limit used to clamp in SwiGLU. No clamp when limit is not provided. However, currently, the default is set to 0, meaning we clamp to 0 if no limit is provided. ### Motivation and Context Fixes #27220. See there for bug description and reproduction. Hoping to get this in before 1.24.0 releases. cc @guschmue
adrianlizarraga
approved these changes
Feb 3, 2026
edgchen1
approved these changes
Feb 3, 2026
hariharans29
approved these changes
Feb 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick round 1 for 1.24.1 release:
#27157: Fix: Replace pkg_resources with importlib.metadata in machine_info.py
#27124: Remove x86 from nuget (#27124)
#26390: [MLAS] Fix rotary interleaved NEON kernel
#27215: Fix Conv LHS packing padding/uninitialized ptrs V2
#27221: Fix WebGPU MoE swiglu_limit (default to infinity)
#26994: Fix for https://github.com/microsoft/onnxruntime/issues/25145