Variant of PR6672 (Use avx512 in move rank)#6678
Conversation
|
clang-format 20 needs to be run on this PR. (execution 24213874657 / attempt 1) |
|
Honestly, it might make sense to tune it and use the scalar version when there's a small number of pieces on the board? If we're in an endgame we will be wasting a whole ton of effort Still a bit concerned about usability for other folks... e.g. an experiment like https://tests.stockfishchess.org/tests/live_elo/69bbeee1d7d60419badf331e would require knowing how to implement the vectorized multiplication, or a custom base branch which is annoying That said, it could also make such experiments easier to pass, depending on the worker mix... |
|
Agree. I don't want to spend too much effort and resources just yet till we know what the maintainers think about this patch. |
d9a669a to
c06bb62
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughMultiArray class declaration in src/misc.h now includes alignas(32). In src/movepick.cpp an AVX-512-only helper init_quiet_hist_buffer() was added to precompute combined pawn and continuation history scores into an aligned histBuffer; MovePicker::score() allocates and fills this buffer under USE_AVX512 and uses histBuffer[pt][to] when scoring quiet moves, leaving the non-AVX512 scalar path unchanged. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/movepick.cpp (2)
164-179:⚠️ Potential issue | 🟡 MinorFix clang-format violations flagged by CI.
The pipeline is failing due to clang-format violations in this block. The
#if/#endifpreprocessor directives and the indented code within theif constexprblock need to follow the project's formatting conventions.Suggested fix for formatting
Run
clang-formaton this file or manually ensure:
- Preprocessor directives (
#if,#else,#endif) are not indented within function scope- Code within conditional blocks follows consistent indentation
[[maybe_unused]] Bitboard threatByLesser[KING + 1]; - `#if` defined(USE_AVX512) - [[maybe_unused]] alignas(64) int histBuffer[KING + 1][SQUARE_NB]; - `#endif` +#if defined(USE_AVX512) + [[maybe_unused]] alignas(64) int histBuffer[KING + 1][SQUARE_NB]; +#endif if constexpr (Type == QUIETS) { threatByLesser[PAWN] = 0; threatByLesser[KNIGHT] = threatByLesser[BISHOP] = pos.attacks_by<PAWN>(~us); threatByLesser[ROOK] = pos.attacks_by<KNIGHT>(~us) | pos.attacks_by<BISHOP>(~us) | threatByLesser[KNIGHT]; threatByLesser[QUEEN] = pos.attacks_by<ROOK>(~us) | threatByLesser[ROOK]; threatByLesser[KING] = 0; - `#if` defined(USE_AVX512) - init_quiet_hist_buffer(histBuffer, pos, continuationHistory, sharedHistory); - `#endif` +#if defined(USE_AVX512) + init_quiet_hist_buffer(histBuffer, pos, continuationHistory, sharedHistory); +#endif }
198-212:⚠️ Potential issue | 🟡 MinorLogic is correct; fix formatting for CI.
The AVX-512 and scalar paths are functionally equivalent:
- Both compute
2 * mainHistory + 2 * pawnHistory + continuationHistory[0,1,2,3,5]- The precomputed
histBuffer[pt][to]correctly maps piece types to values computed with the properPieceindexThe same clang-format issues apply here—preprocessor directives should not be indented:
Suggested formatting fix
m.value = 2 * (*mainHistory)[us][m.raw()]; - `#if` defined(USE_AVX512) - m.value += histBuffer[pt][to]; - `#else` - m.value += 2 * sharedHistory->pawn_entry(pos)[pc][to]; - m.value += (*continuationHistory[0])[pc][to]; - m.value += (*continuationHistory[1])[pc][to]; - m.value += (*continuationHistory[2])[pc][to]; - m.value += (*continuationHistory[3])[pc][to]; - m.value += (*continuationHistory[5])[pc][to]; - `#endif` +#if defined(USE_AVX512) + m.value += histBuffer[pt][to]; +#else + m.value += 2 * sharedHistory->pawn_entry(pos)[pc][to]; + m.value += (*continuationHistory[0])[pc][to]; + m.value += (*continuationHistory[1])[pc][to]; + m.value += (*continuationHistory[2])[pc][to]; + m.value += (*continuationHistory[3])[pc][to]; + m.value += (*continuationHistory[5])[pc][to]; +#endif
🧹 Nitpick comments (1)
src/movepick.cpp (1)
135-150: Consider endgame fallback (echoing PR discussion).As noted in the PR comments, the vectorized approach precomputes scores for all piece types regardless of how many pieces remain. In endgames with few pieces, this could waste work compared to the scalar path. A future optimization might conditionally bypass the vectorized path when the piece count is below a threshold.
This is not a blocking concern for this PR—just noting the architectural consideration that was already raised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 69e6a567-6721-4912-93ed-c95b73f03ce2
📒 Files selected for processing (2)
src/misc.hsrc/movepick.cpp
|
clang-format 20 needs to be run on this PR. (execution 25455458115 / attempt 1) |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/movepick.cpp (1)
234-282:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAddress the clang-format pipeline failures.
The Clang-Format CI job fails on lines 234, 236, 245, 247–248, and 271, 273–281 (the
#if defined(USE_AVX512)guarded blocks and the surrounding code). Please run the project's clang-format target so the new preprocessor blocks match the existing style for indentation/spacing of#if/#else/#endifdirectives inside function bodies.
🧹 Nitpick comments (2)
src/movepick.cpp (1)
191-222: ⚡ Quick winDocument or refactor alignment assumptions in
init_quiet_hist_buffer.This function relies on three implicit alignment guarantees:
bufmust be 64-byte aligned for_mm512_store_epi32(line 218). The parameter typeint buf[][SQUARE_NB]decays to a plain pointer, making this contract invisible.&ph[sq]and&(*continuationHistory[j])[pc][sq]must be 32-byte aligned for_mm256_load_si256(lines 213, 216), currently relying onalignas(32)inMultiArray(misc.h).The code currently works correctly—
histBufferis declaredalignas(64)at the call site—but the alignment requirements are implicit. IfMultiArray's alignment is narrowed or removed, the loads silently become undefined behavior.Two options:
- Use unaligned intrinsics (
_mm256_loadu_si256/_mm512_storeu_epi32) to make the function independent of external alignment. Modern CPUs incur zero penalty on aligned data, eliminating the cross-file coupling with misc.h and reducing fragility.- Take
bufas a reference-to-array (int (&buf)[KING + 1][SQUARE_NB]) so alignment information is preserved in the type system, and document the alignment requirement.src/misc.h (1)
250-251: 🏗️ Heavy liftConsider narrowing the scope of
alignas(32)or using unaligned SIMD loads instead.Adding
alignas(32)to theMultiArrayclass template propagates to every instantiation in the codebase, at every nesting level. For nested arrays whose natural size isn't a multiple of 32 (e.g., innerMultiArray<int16_t, N>rows < 32 bytes), the compiler must pad each element sosizeofis a multiple of 32, which can noticeably grow memory for tables that have nothing to do with theinit_quiet_hist_bufferAVX-512 optimization.Two less invasive alternatives are worth considering:
- Apply alignment surgically — only the histories actually consumed by
init_quiet_hist_buffer(PawnHistoryand thePieceToHistoryrows reached viaContinuationHistory) need 32-byte alignment. A typed alias or wrapper for those, or explicitalignason the specific field/typedef, avoids touching allMultiArrayusers.- Drop the alignment requirement entirely and use
_mm256_loadu_si256(and_mm512_storeu_epi32) inmovepick.cpp. On Skylake and later,vmovdquhas identical performance tovmovdqaon naturally aligned addresses, so the SIMD fast path is preserved without paying a memory cost for unrelated tables.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 21d224e8-5b0e-4cde-a41b-3cd4c8174ef2
📒 Files selected for processing (2)
src/misc.hsrc/movepick.cpp
|
clang-format 20 needs to be run on this PR. (execution 25520695615 / attempt 1) |
|
clang-format 20 needs to be run on this PR. (execution 25521121574 / attempt 1) |
|
Hm. Maybe there's some stuff we can do here with templates or operator overloading to make it really easy for folks to work on this part of the code. The painful part is that the avx512 code path hoists everything out of the loop, so it'll probably require some ceremony... Like we could have something looking like and then in the loop we'd just do The implementation would be a bit thorny, but manageable I think |
No functional change bench: 2559757
|
clang-format 20 needs to be run on this PR. (execution 26001017050 / attempt 1) |
https://tests.stockfishchess.org/tests/view/69b822411860ed703cccef9b
STC: LLR: 2.94 (-2.94,2.94) <0.00,2.00>
Total: 45440 W: 11919 L: 11601 D: 21920
Ptnml(0-2): 148, 4839, 12422, 5169, 142
Precompute a combined history score for every (piece, square) pair.
Variant of #6672 trying to make the intrinsics portion simpler and less co-mingled with the rest of the code. May address some of the concerns/discussion in PR6672. I leave it to the maintainers to decide if it's a worthwhile tradeoff. All credit to @AliceRoselia.
No functional change
bench: 2559757