Fix Phi long context issue #1504

helena-intel · 2025-10-30T19:01:51Z

This is #1297 updated to latest main branch.

Currently inference on Phi-3-mini and Phi-4-mini returns bad outputs (random characters) when context gets larger than about 2000 tokens. This PR, contributed by @eaidova , fixes that. This is not my code. The original PR is no longer being updated; I'm making this a new PR to make it easier to discuss and add updates.

I saw no negative impact on inference speed. I see slightly different outputs with shorter contexts on SPR (on inference with the model exported with the PR vs the model exported with main). Any suggestions to fix that would be much appreciated.

Draft PR for now, awaiting some feedback and testing, but I hope we can merge this soon.

HuggingFaceDocBuilderDev · 2025-10-30T19:06:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/intel/openvino/modeling_decoder.py

optimum/exporters/openvino/model_patcher.py

nikita-savelyevv · 2025-10-31T09:40:08Z

I see slightly different outputs with shorter contexts on SPR (on inference with the model exported with the PR vs the model exported with main).

I believe minor differences are expected on SPR. But if possible, WWB similarity should be run to see if the difference is significant or not.

optimum/exporters/openvino/model_patcher.py

rkazants · 2025-11-03T14:41:39Z

optimum/intel/openvino/modeling_decoder.py

+        logits_to_keep=None,
+        **kwargs,
+    ):
+        # Overwritten -- this model may need to switch between short and long rope, invalidating the cache in the


Am I correct that we have a problem when we have short and long prompts in consecutive generate calls?
We can't re-initialize inv_freqs from long_inv_freqs to short_inv_freqs and vise-versa? How this problem is solved?

As discussed offline: this is handled in optimum-intel by resetting the kv-cache when the number of input tokens is equal to the long rope boundary (e.g. 4096). This is done the same way in transformers code. Tested that this works as expected in chat context with https://gist.github.com/helena-intel/b55522cda91d9d61a644f153e71f0f98 .

optimum/exporters/openvino/model_patcher.py

echarlaix

Thanks a lot @helena-intel !!

echarlaix · 2025-11-03T13:57:53Z

optimum/intel/openvino/modeling_decoder.py



+class OVPhi3ForCausalLM(OVModelForCausalLM):
+    def prepare_inputs_for_generation(


would you mind adding a link to the original code
https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/phi3/modeling_phi3.py#L493

echarlaix · 2025-11-03T15:41:40Z

optimum/exporters/openvino/model_patcher.py

-        super().__enter__()
+        # Call OVDecoderModelPatcher.__enter__() directly to skip Phi3ModelPatcher's longrope logic
+        # PhiMoE has a different rotary embedding structure, longrope is not yet supported


why do we need to add all this modifications to PhiMoEModelPatcher? (if longrope is not yet supported then self._model.model.rotary_emb will never be set to "longrope") If we want to make sure we can raise an error in case it's ever the case

Initially tests failed for phi_moe, see https://github.com/huggingface/optimum-intel/actions/runs/18952102871/job/54119192964 . We should have longrope support for the MoE model too but not in this PR. I would be happy with a simpler solution to not enable longrope for the MoE model (but still have it working as it is now).

I will fix this in a better way.

I added a _disable_longrope property instead of the previous code.

echarlaix · 2025-11-03T15:52:07Z

optimum/exporters/openvino/model_patcher.py

+    return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor)
+
+
+def long_rope(self, x, position_ids, seq_len=None):


would you mind adding a link to original code (https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/models/phi3/modeling_phi3.py#L324 ?)

echarlaix · 2025-11-03T16:03:39Z

optimum/exporters/openvino/model_patcher.py

+        scaling_factor = 1.0
+    else:
+        scaling_factor = math.sqrt(1 + math.log(scale) / math.log(original_max_position_embeddings))
+    cos = emb.cos() * scaling_factor


here we can't use self.attention_scaling ? https://github.com/huggingface/transformers/blob/63fbd50fb4ff7b586ab1b59b67f7464e62f9df69/src/transformers/modeling_rope_utils.py#L519

Yes, it is a good point. @helena-intel, please use it.

echarlaix · 2025-11-03T16:11:11Z

optimum/exporters/openvino/model_patcher.py

+    # Force float32 since bfloat16 loses precision on long contexts
+    # See https://github.com/huggingface/transformers/pull/29285
+    device_type = x.device.type
+    device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"


not used and should we ensure fp32 dtype also ?

Thanks! Added the autocast line with enabled=False.

rkazants · 2025-11-14T10:17:51Z

optimum/exporters/openvino/model_patcher.py

+    return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor)
+
+
+def long_rope(self, x, position_ids, seq_len=None):


@helena-intel, you actually patch this function https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py#L442
but I don't see that short_factor from model config is used in the patch. Please clarify it.

@helena-intel, I think we need to re-write this patch more accurately to be aligned with https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py#L442 for longrope

short_factor is in the select_ext_factor function: return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor)

I agree it would be clearer to rewrite - but it is functionally working now. We see the same outputs as transformers, for both short and long context.

@rkazants I refactored the function and added more comments. I think it is clearer now, please review.

rkazants · 2025-11-14T10:24:55Z

optimum/intel/openvino/modeling_decoder.py

+        ):
+            past_length = cache_position[0]
+            if past_length <= self.config.original_max_position_embeddings:
+                past_key_values = None


please add a link to https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi3/modeling_phi3.py#L522 and comment that it is aligned with phi3 for long context.
And add a comment that we reset KV cache and it means that the next step will be prefill for extended (computed so far) tokens.

Added the link a few lines above. The comment that was there was copied verbatim from the transformers code. I modified the second line a bit to make it clearer (transformers comment references "current failure" but it is not clear what that is).

rkazants · 2025-11-14T10:29:34Z

@helena-intel, also it is needed to create tiny-model phi3 that has small values original_max_embedding < max_embedding, for example, equal to 10 and 20. This is how we test KV cache reset and applying new scaling factors based on it. And you can easily embed this tiny model into existing tests in test_decoder and it will automatically test this scenario.

helena-intel · 2025-11-14T15:56:02Z

@helena-intel, also it is needed to create tiny-model phi3 that has small values original_max_embedding < max_embedding, for example, equal to 10 and 20. This is how we test KV cache reset and applying new scaling factors based on it. And you can easily embed this tiny model into existing tests in test_decoder and it will automatically test this scenario.

Yes, added model yesterday (https://huggingface.co/optimum-intel-internal-testing/tiny-random-phi-4-mini-instruct) and just added a test that fails in main branch and passes with this PR.

rkazants · 2025-11-14T16:01:07Z

tests/openvino/test_decoder.py

                f"values are not close for {dtype if dtype is not None else 'None'}, max diff = {torch.abs(ov_logits - ref_logits).max()}",
            )
+
+    def test_phi3_longrope_support(self):


no need in this new test. Just add your model into SUPPORTED_ARCHITECTURES above. All requited testing will be activated. You also need to add model id into util_tests.py

rkazants · 2025-11-14T16:04:21Z

Yes, added model yesterday (https://huggingface.co/optimum-intel-internal-testing/tiny-random-phi-4-mini-instruct) and just added a test that fails in main branch and passes with this PR.

Why phi4? Also, please initialize to some small original_max_position_embeddings to value I suggested. Just name it as tiny-random-phi3, make sure that rope_type is longrope.

helena-intel · 2025-11-17T09:19:43Z

Why phi4?

I wanted to use a model that is being used by people who reported this issue, and I figured it would be useful to have a phi-4 tiny model too. I can change it if needed.

Just name it as tiny-random-phi3

So it should replace the existing model? https://github.com/huggingface/optimum-intel/blob/main/tests/openvino/utils_tests.py#L152

no need in this new test. Just add your model into SUPPORTED_ARCHITECTURES above. All requited testing will be activated.

I think it's useful to test both short and long context because it is also relevant to know if short context starts failing. And long context should be tested with prompts above the threshold value so if we rely on existing tests we should always remember that the generic model input needs to exceed the long context threshold. If someone changes the existing "same output as transformers" test, or the tiny model, the test may miss issues.

Also, please initialize to some small original_max_position_embeddings to value I suggested. Just name it as tiny-random-phi3, make sure that rope_type is longrope.

I will look into that. Values probably need to be a bit higher, but can be lower than default. We can't just set the values to 10 and 20, the model is sensitive to parameters and it's easy to get collapsing outputs or differences between PyTorch and OpenVINO.

- Explicitly disable torch.autocast to ensure float32 precision - Add sources for adapted code - Use self.attention_scaling instead of manual computation - Save and restore original _orig_max_position_embeddings - Modify F32_CONFIG to use EXECUTION_MODE_HINT

Exclude longrope for phi3-moe with _disable_longrope

- Add more comments - Remove superfluous select_ext_factor function - Rename long_rope to _phi3_longrope_forward for clarity

IlyasMoutawwakil · 2025-11-20T10:40:25Z

tests/openvino/test_decoder.py

+        )
+
+        # Creating model inputs with more than original max position embeddings and enough variation for varied output tokens
+        tokens = torch.as_tensor(list(tokenizer.get_vocab().values())[: original_max_pos + 50]).unsqueeze(0)


the tokenizer is not really needed here, you can use torch.randint with model.config.vocab_size
also shouldn't we test staring with less than max position embeddings and generating enough to surpass it (to trigger cache re-computation)

Changed test to use randint, and now we test both scenarios, where input tokens exceeds original_max_pos and where generation tokens exceeds it.

IlyasMoutawwakil · 2025-11-20T10:40:58Z

tests/openvino/utils_tests.py

+# With this config, inference runs in f32 and optimizations that may influence accuracy are disabled
+F32_CONFIG = {"EXECUTION_MODE_HINT": "ACCURACY"}


do we wanna change it for all models ?

I think we should not change it globally. Let have it only for phi3 models.

rkazants · 2025-11-20T14:37:41Z

tests/openvino/test_decoder.py

+            model_id, export=True, ov_config=F32_CONFIG, device=OPENVINO_DEVICE
+        )
+
+        # Creating model inputs with more than original max position embeddings and enough variation for varied output tokens


please test two cases when input_ids length exceeds threshold and when only max_new_tokens exceeds threshold

rkazants · 2025-11-20T14:42:10Z

tests/openvino/test_decoder.py

+    def test_phi3_longrope_support(self):
+        """Test LongRoPE support for Phi3 with inputs > 4096 tokens."""
+        set_seed(SEED)
+        model_id = "optimum-intel-internal-testing/tiny-random-phi-4-mini-instruct"


please change model card id. Now it is quite confusing with phi-4 but this is not phi-4

- rename tiny model to phi3 - add test for cumulative context - revert F32_CONFIG change

- Set MIN_TRANSFORMERS_VERSION to 4.49 for Phi3 - Remove code specific for transformers<4.49 - Disable trust-remote-code for Phi3

IlyasMoutawwakil

LGTM ! Thanks for the awesome fix !

rkazants

@IlyasMoutawwakil, please let me review before merge. Thanks!

AlexanderKalistratov · 2025-12-03T16:34:49Z

@helena-intel @rkazants is there any ETA for this PR?

rkazants · 2025-12-04T06:24:39Z

tests/openvino/test_decoder.py

+        if is_transformers_version("<", "4.49"):
+            self.skipTest("Incompatible transformers version: Phi3 longrope requires transformers>=4.49")
+        set_seed(SEED)
+        model_id = "optimum-intel-internal-testing/tiny-random-phi3-longrope"


you have not-representable short factor in your tiny model:

Please regenerate model with different values in short factor. For example, real model has short factor with different values.

Tiny model is based on phi-4-mini-instruct, which has only 1.0 for short factor: https://huggingface.co/microsoft/Phi-4-mini-instruct/blob/main/config.json#L85
I can add another model with different short factors, but I think it's useful to keep phi-4-mini-instruct too.

please don't mix up tiny models for phi-3 and phi4. For phi4, it should be separate

I initially named this model tiny-phi-4 but was asked to rename. phi-4-mini-instruct is phi3 architecture, so it is not that this tiny model is not representative for the phi3 model type. I will add a tiny model that is based on a phi-3- model.

I updated the tiny model and now used a base model that does not just use phi3 architecture as before but that also specifically has "phi-3-" in the model name.

rkazants · 2025-12-04T06:34:18Z

optimum/exporters/openvino/model_patcher.py

+    Note: In transformers, the @dynamic_rope_update decorator replaces self.inv_freq before the forward pass.
+    Here we use torch.where to select between inv_freq and long_inv_freq and add the selection logic into the model graph.
+    """
+    seq_len = torch.max(position_ids) + 1


I don't see that short_factor is extracted and used anyhow in this patch.
Please check reference impl.: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py#L454C5-L454C18

We need to be aligned with HF

short_factor is not used explicitly because inv_freq is used directly. inv_freq is set here https://github.com/huggingface/transformers/blob/v4.55.1/src/transformers/models/phi3/modeling_phi3.py#L313C1-L314C69 . It calls compute_longrope_parameters in https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_rope_utils.py . During model export we will not use inputs exceeding original_max_position_embeddings, so inv_freq will be set based on short factors.

This code with explicit short and long factors is needed for transformers because they use this for inference, but for model export inv_freq will be set with short factor correctly.

I see that short_factor is used in the code

Now we are not aligned with this code and can expect additional bugs in the future

We are using this code during model export, as part of model loading code. During model loading, compute_longrope_embeddings is actually called with seq_len=None, so self.inv_freq will be set with short_factor (see your screenshot: if seq_len is None, ext_factors is set according to short_factor, which is defined earlier in the function from model config, and inv_freq is than computed based on this short_factor. So, self.inv_freq is always set with short factor, self.long_inv_freq with long factor, and then in the _phi3_longrope_forward inv_freq is set to self.long_inv_freq if seq is long and else to self.inv_freq. This is aligned with transformers.

As I understand, it should not be executed during exporting of model. This code should be executed in run-time for each input. Otherwise, it is strange that this model is exported for some concrete seq_len.

The short_factor is used to compute inv_freq. We compute long_inv_freq in the patcher, but short/default inv_freq is computed correctly during model initialization, so self.inv_freq will already be set correctly. And then during inference, we do "if seq len > max_pos: use long_inv_freq else: use default inv_freq".

I replaced self.inv_freq with self.original_inv_freq in forward:

This is initialized here https://github.com/huggingface/transformers/blob/v4.55-release/src/transformers/models/phi3/modeling_phi3.py#L315 right after initializing self.inv_freq with short factors. We never update self.inv_freq (we override forward()) so self.original_inv_freq == self.inv_freq and self.original_inv_freq is clearer.

rkazants

not aligned with HF. For example, short_factor is not extracted from config and not used for scaling frequencies

- self.original_inv_freq is the same as self.inv_freq because self.inv_freq does not get updated during model export

eaidova and others added 7 commits May 12, 2025 16:31

test longrope phi4

4440904

update prepare_inputs_for_generation

72cd3c8

change condition

8aa5978

Merge branch 'main' into ea/lonrope_exp

19feb0b

Merge branch 'main' into ea/lonrope_exp

822664a

Merge remote-tracking branch 'origin/main' into ea/lonrope_exp

4426e18

Merge remote-tracking branch 'origin/main' into ea/lonrope_exp

9f0394a

helena-intel commented Oct 30, 2025

View reviewed changes

optimum/intel/openvino/modeling_decoder.py Outdated Show resolved Hide resolved

Skip longrope for phi_moe for now

c8adca6

helena-intel added the openvino-slow Runs OpenVINO slow tests with different versions of transformers label Oct 30, 2025

rkazants reviewed Oct 31, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

helena-intel marked this pull request as ready for review October 31, 2025 10:24

rkazants reviewed Oct 31, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

helena-intel added 2 commits October 31, 2025 21:19

Remove commented out code

75af74c

Merge remote-tracking branch 'upstream/main' into ea/lonrope_exp

8185565

nikita-savelyevv requested review from IlyasMoutawwakil and echarlaix November 1, 2025 12:10

rkazants reviewed Nov 3, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

rkazants reviewed Nov 3, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

rkazants reviewed Nov 3, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

rkazants reviewed Nov 3, 2025

View reviewed changes

optimum/exporters/openvino/model_patcher.py Outdated Show resolved Hide resolved

echarlaix reviewed Nov 3, 2025

View reviewed changes

rkazants reviewed Nov 14, 2025

View reviewed changes

helena-intel added 2 commits November 14, 2025 15:02

Merge remote-tracking branch 'upstream/main' into ea/lonrope_exp

650a7ab

Add test for phi3 longrope

1d70db3

rkazants reviewed Nov 14, 2025

View reviewed changes

helena-intel added 3 commits November 17, 2025 12:05

Apply review suggestions

b018402

- Explicitly disable torch.autocast to ensure float32 precision - Add sources for adapted code - Use self.attention_scaling instead of manual computation - Save and restore original _orig_max_position_embeddings - Modify F32_CONFIG to use EXECUTION_MODE_HINT

Add _disable_longrope property

77ca45f

Exclude longrope for phi3-moe with _disable_longrope

Refactor long_rope to align more with transformers

6b74fb8

- Add more comments - Remove superfluous select_ext_factor function - Rename long_rope to _phi3_longrope_forward for clarity

IlyasMoutawwakil reviewed Nov 20, 2025

View reviewed changes

rkazants reviewed Nov 20, 2025

View reviewed changes

helena-intel added 4 commits November 20, 2025 16:16

Rename test model, add test, revert ov_config

6fafcce

- rename tiny model to phi3 - add test for cumulative context - revert F32_CONFIG change

Modify longrope test to use torch.randint

231a9a6

Limit Phi3 to transformers>=4.49, disable remote-code

fa088fe

- Set MIN_TRANSFORMERS_VERSION to 4.49 for Phi3 - Remove code specific for transformers<4.49 - Disable trust-remote-code for Phi3

Skip phi3 tests for transformers<4.49

0041491

helena-intel removed the openvino-slow Runs OpenVINO slow tests with different versions of transformers label Nov 21, 2025

IlyasMoutawwakil approved these changes Nov 21, 2025

View reviewed changes

rkazants reviewed Nov 21, 2025

View reviewed changes

AlexanderKalistratov mentioned this pull request Nov 25, 2025

[NPUW] Replacing longrope pattern with precalculated values openvinotoolkit/openvino#33011

Draft

rkazants reviewed Dec 4, 2025

View reviewed changes

rkazants suggested changes Dec 4, 2025

View reviewed changes

helena-intel added 3 commits December 4, 2025 10:23

Add phi3-longrope to decoder tests

9327d0d

Merge remote-tracking branch 'upstream/main' into ea/lonrope_exp

4049dcf

Use self.original_inv_freq in longrope forward

ae45e07

- self.original_inv_freq is the same as self.inv_freq because self.inv_freq does not get updated during model export



		class OVPhi3ForCausalLM(OVModelForCausalLM):
		def prepare_inputs_for_generation(

		return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor)


		def long_rope(self, x, position_ids, seq_len=None):

		# With this config, inference runs in f32 and optimizations that may influence accuracy are disabled
		F32_CONFIG = {"EXECUTION_MODE_HINT": "ACCURACY"}

Fix Phi long context issue #1504

Are you sure you want to change the base?

Fix Phi long context issue #1504

Uh oh!

Conversation

helena-intel commented Oct 30, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

nikita-savelyevv commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

echarlaix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants commented Nov 14, 2025

Uh oh!

helena-intel commented Nov 14, 2025

Uh oh!

rkazants Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

helena-intel commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

rkazants Nov 14, 2025 •

edited

Loading

rkazants Nov 14, 2025 •

edited

Loading

rkazants commented Nov 14, 2025 •

edited

Loading

helena-intel commented Nov 17, 2025 •

edited

Loading

rkazants Nov 20, 2025 •

edited

Loading