Skip to content

Variant of PR6672 (Use avx512 in move rank)#6678

Open
mstembera wants to merge 1 commit into
official-stockfish:masterfrom
mstembera:PR6672_Variant
Open

Variant of PR6672 (Use avx512 in move rank)#6678
mstembera wants to merge 1 commit into
official-stockfish:masterfrom
mstembera:PR6672_Variant

Conversation

@mstembera

@mstembera mstembera commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

https://tests.stockfishchess.org/tests/view/69b822411860ed703cccef9b
STC: LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 45440 W: 11919 L: 11601 D: 21920
Ptnml(0-2): 148, 4839, 12422, 5169, 142

Precompute a combined history score for every (piece, square) pair.

Variant of #6672 trying to make the intrinsics portion simpler and less co-mingled with the rest of the code. May address some of the concerns/discussion in PR6672. I leave it to the maintainers to decide if it's a worthwhile tradeoff. All credit to @AliceRoselia.

No functional change
bench: 2559757

@mstembera mstembera changed the title Variant of PR6672 Variant of PR6672 (Use avx512 in move rank) Mar 19, 2026
@github-actions

github-actions Bot commented Mar 19, 2026

Copy link
Copy Markdown

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 24213874657 / attempt 1)

@anematode

anematode commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Honestly, it might make sense to tune it and use the scalar version when there's a small number of pieces on the board? If we're in an endgame we will be wasting a whole ton of effort

Still a bit concerned about usability for other folks... e.g. an experiment like https://tests.stockfishchess.org/tests/live_elo/69bbeee1d7d60419badf331e would require knowing how to implement the vectorized multiplication, or a custom base branch which is annoying

That said, it could also make such experiments easier to pass, depending on the worker mix...

@mstembera

mstembera commented Mar 19, 2026

Copy link
Copy Markdown
Contributor Author

Agree. I don't want to spend too much effort and resources just yet till we know what the maintainers think about this patch.
[Edit] ok https://tests.stockfishchess.org/tests/view/69bd0a548c3d089c399d1af0 scheduled

@mstembera mstembera force-pushed the PR6672_Variant branch 2 times, most recently from d9a669a to c06bb62 Compare April 9, 2026 21:19
@coderabbitai

coderabbitai Bot commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a1f3bd3f-f287-482e-bc56-9eb950e75220

📥 Commits

Reviewing files that changed from the base of the PR and between d05b916 and bf3adec.

📒 Files selected for processing (2)
  • src/misc.h
  • src/movepick.cpp
✅ Files skipped from review due to trivial changes (1)
  • src/misc.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/movepick.cpp

📝 Walkthrough

Walkthrough

MultiArray class declaration in src/misc.h now includes alignas(32). In src/movepick.cpp an AVX-512-only helper init_quiet_hist_buffer() was added to precompute combined pawn and continuation history scores into an aligned histBuffer; MovePicker::score() allocates and fills this buffer under USE_AVX512 and uses histBuffer[pt][to] when scoring quiet moves, leaving the non-AVX512 scalar path unchanged.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately references PR6672 and describes the main change—a variant approach to using AVX512 in move ranking with simplified intrinsics.
Description check ✅ Passed The description clearly explains the purpose of the PR as a variant of PR6672, mentions precomputing history scores, notes no functional change, and provides test results and benchmarks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/movepick.cpp (2)

164-179: ⚠️ Potential issue | 🟡 Minor

Fix clang-format violations flagged by CI.

The pipeline is failing due to clang-format violations in this block. The #if/#endif preprocessor directives and the indented code within the if constexpr block need to follow the project's formatting conventions.

Suggested fix for formatting

Run clang-format on this file or manually ensure:

  • Preprocessor directives (#if, #else, #endif) are not indented within function scope
  • Code within conditional blocks follows consistent indentation
     [[maybe_unused]] Bitboard threatByLesser[KING + 1];
-    `#if` defined(USE_AVX512)
-    [[maybe_unused]] alignas(64) int histBuffer[KING + 1][SQUARE_NB];
-    `#endif`
+#if defined(USE_AVX512)
+    [[maybe_unused]] alignas(64) int histBuffer[KING + 1][SQUARE_NB];
+#endif
     if constexpr (Type == QUIETS)
     {
         threatByLesser[PAWN]   = 0;
         threatByLesser[KNIGHT] = threatByLesser[BISHOP] = pos.attacks_by<PAWN>(~us);
         threatByLesser[ROOK] =
           pos.attacks_by<KNIGHT>(~us) | pos.attacks_by<BISHOP>(~us) | threatByLesser[KNIGHT];
         threatByLesser[QUEEN] = pos.attacks_by<ROOK>(~us) | threatByLesser[ROOK];
         threatByLesser[KING]  = 0;

-        `#if` defined(USE_AVX512)
-            init_quiet_hist_buffer(histBuffer, pos, continuationHistory, sharedHistory);
-        `#endif`
+#if defined(USE_AVX512)
+        init_quiet_hist_buffer(histBuffer, pos, continuationHistory, sharedHistory);
+#endif
     }

198-212: ⚠️ Potential issue | 🟡 Minor

Logic is correct; fix formatting for CI.

The AVX-512 and scalar paths are functionally equivalent:

  • Both compute 2 * mainHistory + 2 * pawnHistory + continuationHistory[0,1,2,3,5]
  • The precomputed histBuffer[pt][to] correctly maps piece types to values computed with the proper Piece index

The same clang-format issues apply here—preprocessor directives should not be indented:

Suggested formatting fix
             m.value = 2 * (*mainHistory)[us][m.raw()];

-            `#if` defined(USE_AVX512)
-                m.value += histBuffer[pt][to];
-            `#else`
-                m.value += 2 * sharedHistory->pawn_entry(pos)[pc][to];
-                m.value += (*continuationHistory[0])[pc][to];
-                m.value += (*continuationHistory[1])[pc][to];
-                m.value += (*continuationHistory[2])[pc][to];
-                m.value += (*continuationHistory[3])[pc][to];
-                m.value += (*continuationHistory[5])[pc][to];
-            `#endif`
+#if defined(USE_AVX512)
+            m.value += histBuffer[pt][to];
+#else
+            m.value += 2 * sharedHistory->pawn_entry(pos)[pc][to];
+            m.value += (*continuationHistory[0])[pc][to];
+            m.value += (*continuationHistory[1])[pc][to];
+            m.value += (*continuationHistory[2])[pc][to];
+            m.value += (*continuationHistory[3])[pc][to];
+            m.value += (*continuationHistory[5])[pc][to];
+#endif
🧹 Nitpick comments (1)
src/movepick.cpp (1)

135-150: Consider endgame fallback (echoing PR discussion).

As noted in the PR comments, the vectorized approach precomputes scores for all piece types regardless of how many pieces remain. In endgames with few pieces, this could waste work compared to the scalar path. A future optimization might conditionally bypass the vectorized path when the piece count is below a threshold.

This is not a blocking concern for this PR—just noting the architectural consideration that was already raised.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69e6a567-6721-4912-93ed-c95b73f03ce2

📥 Commits

Reviewing files that changed from the base of the PR and between bb4eb04 and c06bb62.

📒 Files selected for processing (2)
  • src/misc.h
  • src/movepick.cpp

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 25455458115 / attempt 1)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/movepick.cpp (1)

234-282: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Address the clang-format pipeline failures.

The Clang-Format CI job fails on lines 234, 236, 245, 247–248, and 271, 273–281 (the #if defined(USE_AVX512) guarded blocks and the surrounding code). Please run the project's clang-format target so the new preprocessor blocks match the existing style for indentation/spacing of #if/#else/#endif directives inside function bodies.

🧹 Nitpick comments (2)
src/movepick.cpp (1)

191-222: ⚡ Quick win

Document or refactor alignment assumptions in init_quiet_hist_buffer.

This function relies on three implicit alignment guarantees:

  • buf must be 64-byte aligned for _mm512_store_epi32 (line 218). The parameter type int buf[][SQUARE_NB] decays to a plain pointer, making this contract invisible.
  • &ph[sq] and &(*continuationHistory[j])[pc][sq] must be 32-byte aligned for _mm256_load_si256 (lines 213, 216), currently relying on alignas(32) in MultiArray (misc.h).

The code currently works correctly—histBuffer is declared alignas(64) at the call site—but the alignment requirements are implicit. If MultiArray's alignment is narrowed or removed, the loads silently become undefined behavior.

Two options:

  1. Use unaligned intrinsics (_mm256_loadu_si256 / _mm512_storeu_epi32) to make the function independent of external alignment. Modern CPUs incur zero penalty on aligned data, eliminating the cross-file coupling with misc.h and reducing fragility.
  2. Take buf as a reference-to-array (int (&buf)[KING + 1][SQUARE_NB]) so alignment information is preserved in the type system, and document the alignment requirement.
src/misc.h (1)

250-251: 🏗️ Heavy lift

Consider narrowing the scope of alignas(32) or using unaligned SIMD loads instead.

Adding alignas(32) to the MultiArray class template propagates to every instantiation in the codebase, at every nesting level. For nested arrays whose natural size isn't a multiple of 32 (e.g., inner MultiArray<int16_t, N> rows < 32 bytes), the compiler must pad each element so sizeof is a multiple of 32, which can noticeably grow memory for tables that have nothing to do with the init_quiet_hist_buffer AVX-512 optimization.

Two less invasive alternatives are worth considering:

  1. Apply alignment surgically — only the histories actually consumed by init_quiet_hist_buffer (PawnHistory and the PieceToHistory rows reached via ContinuationHistory) need 32-byte alignment. A typed alias or wrapper for those, or explicit alignas on the specific field/typedef, avoids touching all MultiArray users.
  2. Drop the alignment requirement entirely and use _mm256_loadu_si256 (and _mm512_storeu_epi32) in movepick.cpp. On Skylake and later, vmovdqu has identical performance to vmovdqa on naturally aligned addresses, so the SIMD fast path is preserved without paying a memory cost for unrelated tables.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 21d224e8-5b0e-4cde-a41b-3cd4c8174ef2

📥 Commits

Reviewing files that changed from the base of the PR and between c06bb62 and 6d07f13.

📒 Files selected for processing (2)
  • src/misc.h
  • src/movepick.cpp

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 25520695615 / attempt 1)

@mstembera

Copy link
Copy Markdown
Contributor Author

@vondele What do u think about this especially in light of the #6768 commit. I can also retest since it's been a while for the original test.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 25521121574 / attempt 1)

@anematode

Copy link
Copy Markdown
Contributor

Hm. Maybe there's some stuff we can do here with templates or operator overloading to make it really easy for folks to work on this part of the code. The painful part is that the avx512 code path hoists everything out of the loop, so it'll probably require some ceremony...

Like we could have something looking like

PtToHistories pcToHists = 2 * PtToHistory(sharedEntry->pawn_entry(pos)) + PtToHistory(continuationHistory[0]) + ...;

and then in the loop we'd just do

m.value += pcToHists[pc][to];

The implementation would be a bit thorny, but manageable I think

No functional change
bench: 2559757
@github-actions

Copy link
Copy Markdown

clang-format 20 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/plucky/clang-format-20.

(execution 26001017050 / attempt 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants