Skip to content

ORT 1.24.1 release cherry pick round 1#27229

Merged
tianleiwu merged 6 commits intorel-1.24.1from
tlwu/rel-1.24.0_cherry_pick_round_5
Feb 3, 2026
Merged

ORT 1.24.1 release cherry pick round 1#27229
tianleiwu merged 6 commits intorel-1.24.1from
tlwu/rel-1.24.0_cherry_pick_round_5

Conversation

tianleiwu and others added 6 commits February 3, 2026 11:44
…#27157)

Replaces the deprecated pkg_resources library with importlib.metadata to
fix ModuleNotFoundError.
The logic of interleaved NEON kernel is not correct from code review:

1.  **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:

    ```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
    std::vector<float> sin_data(table_len);
    std::vector<float> cos_data(table_len);
    ```

For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.

2.  **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
    This kernel loads the `sin`/`cos` data using an index of `i / 2`:

    ```c++
    float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
    float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
    ```

This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**

3.  **NEON (fp16) Kernel Logic (`interleaved = true`):**
    This kernel loads the `sin`/`cos` data using an index of `i`:

    ```c++
    // Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
    //...
    // Inside loop, for next iteration:
    sin_val = MlasLoadFloat16x8(sin + i + 16); 
    ```

    This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**

### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```

Before applying the fix, the test failed:
```
[  FAILED  ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
  Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.

### Summary

The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.

The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description

Refer to V1 of the fix here:
#27214

This PR includes all fixes from the V1 PR + logic to invalidate the lhs
cache pointers in case the pad buffer's underlying buffer has changed
due to a resize. The ARM team will look at potentially enhancing this
logic after the 1.24.0 release.

### Motivation and Context
Fix #26669
### Description

According to
https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftmoe,

> swiglu_limit : float
The limit used to clamp in SwiGLU. No clamp when limit is not provided.

However, currently, the default is set to 0, meaning we clamp to 0 if no
limit is provided.


### Motivation and Context

Fixes #27220. See there for bug description and reproduction.

Hoping to get this in before 1.24.0 releases. cc @guschmue
### Description
Fixed for fallback provider logic bug when creating inference session
can lead to losing GPU acceleration



### Motivation and Context
Fixing this for the PR here
[#25145](#25145)
@tianleiwu tianleiwu merged commit 326c98c into rel-1.24.1 Feb 3, 2026
72 of 77 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.24.0_cherry_pick_round_5 branch February 3, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants