Skip to content

Fix SDPA TT_METAL_WATCHER issues#37928

Open
pavlejosipovic wants to merge 3 commits intomainfrom
pjosipovic/sdpa_watcher_fixes
Open

Fix SDPA TT_METAL_WATCHER issues#37928
pavlejosipovic wants to merge 3 commits intomainfrom
pjosipovic/sdpa_watcher_fixes

Conversation

@pavlejosipovic
Copy link
Contributor

@pavlejosipovic pavlejosipovic commented Feb 16, 2026

Summary

  • generate_reduce_scaler hardcoded 2048 bytes and 4 faces, assuming full 32x32 bf16 tiles. When circular buffers use half tiles (1024B, 2 faces), this overwrites adjacent L1 memory causing watcher-detected corruption.
  • Restore the half_tile template parameter so the zero-fill size and face iteration adapt to the actual tile dimensions. Also fix idle core runtime args count mismatch in sdpa_decode_program_factory.
  • Remove watcher skips in test_sdpa_prefill.py (now passing with the fix)
  • Restore watcher skip in test_flash_mla.py for Blackhole (separate issue)

Fixes: #37631
Fixes: #29225

Replaces #37833 (closed due to bad rebase)

Test plan

🤖 Generated with Claude Code

Pavle Josipovic and others added 3 commits February 16, 2026 10:37
`generate_reduce_scaler` hardcoded 2048 bytes and 4 faces, assuming
full 32x32 bf16 tiles. When circular buffers use half tiles (1024B,
2 faces), this overwrites adjacent L1 memory causing watcher-detected
corruption.

Restore the `half_tile` template parameter (previously removed in
cleanup) so the zero-fill size and face iteration adapt to the actual
tile dimensions. Also fix idle core runtime args count mismatch in
sdpa_decode_program_factory.

Fixes: #37631
Fixes: #29225

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watcher skip for issue #37631 was prematurely removed. Restore it
until the underlying issue is fully resolved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 16, 2026 10:39
@pavlejosipovic pavlejosipovic requested review from a team as code owners February 16, 2026 10:39
@pavlejosipovic
Copy link
Contributor Author

/codeowners ping

@tenstorrent-github-bot
Copy link

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 2 pending groups, 0 approved groups

Group Information:


Note: At least one approval from each group is sufficient.

@tenstorrent-github-bot
Copy link

Hi Evan Smal (@esmalTT), Raymond Kim (@tt-rkim), this PR Fix SDPA TT_METAL_WATCHER issues by Pavle Josipović (@pavlejosipovic) needs your approval/review to merge this.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes TT_METAL_WATCHER-detected corruption/issues in SDPA decode by ensuring the reduce-scaler generation logic matches the actual circular buffer tile size (full vs half tiles) and by aligning idle-core runtime argument counts with what the decode reader kernel expects. It also updates SDPA prefill unit tests to remove watcher skips now that the underlying issue is addressed.

Changes:

  • Restore half-tile awareness for generate_reduce_scaler and pass the correct half/full-tile mode from the SDPA decode writer kernel.
  • Fix idle-core reader runtime-arg vector length in SdpaDecodeProgramFactory to match the reader kernel’s expected arg reads.
  • Remove TT_METAL_WATCHER skip decorators from test_sdpa_prefill.py (per PR description: now passing with the fix).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/sdpa_decode_program_factory.cpp Fix idle-core reader runtime-arg count to prevent watcher OOB runtime-arg access.
ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp Detect half-tile scalar CBs and invoke generate_reduce_scaler with the correct template mode.
ttnn/cpp/ttnn/kernel/dataflow/generate_reduce_scaler.hpp Reintroduce half_tile template parameter to size the zero-fill and face-looping correctly.
tests/ttnn/unit_tests/operations/sdpa/test_sdpa_prefill.py Remove watcher-enabled skips now that corruption/OOM issues should be resolved by the kernel fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants