Runtime log shows, it takes 25mins to run prefill on reference model for seq_len=128.
This is observed on dual and quad machine
Test Command: pytest -svv models/demos/deepseek_v3/tests/test_model.py::test_forward_pass[mode_decode_seq_1_batch_32_pos_random-True-device_params0]
[1,0]<stderr>: 2026-02-11 00:25:53.604 | INFO | models.demos.deepseek_v3.utils.test_utils:run_reference_with_attention:315 - Reference attention config: seq_len=126 chunk_size=2080 use_chunked_processing=False
[1,0]<stderr>: 2026-02-11 00:50:49.658 | INFO | models.demos.deepseek_v3.tests.test_model:generate_reference_io:141 - Reference model output shape: torch.Size([256, 1, 129280])