-
Notifications
You must be signed in to change notification settings - Fork 112
Feature/prefetch2 #1604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
maddyscientist
wants to merge
124
commits into
develop
Choose a base branch
from
feature/prefetch2
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,143
−573
Open
Feature/prefetch2 #1604
Changes from all commits
Commits
Show all changes
124 commits
Select commit
Hold shift + click to select a range
63b7ff4
Initial support for prefetching (over fetching) added to load instruc…
maddyscientist 191105b
Fix for half precision
maddyscientist 5b41229
Apply some missing OMP parallelization to host functions
maddyscientist a2efb44
Fix for fine-grained accessor vector loads
maddyscientist c815076
Add prefetching instructions for CUDA
maddyscientist 177c18b
Optimizaiton of neighbor indexing for dslash kernels: use bitwise ins…
maddyscientist eae953d
Add support for creating a backward gauge field
maddyscientist 2540a1b
Some small improvedments to shift(GaugeField) function
maddyscientist e686437
Gauge shift should encode shift value in aux_string
maddyscientist 676c643
Add support for experimental double storage of gauge fields - disable…
maddyscientist 9c2025b
Fix some issues with gauge shift: fix single-GPU builds and add half/…
maddyscientist 721fbd5
make doBulk and doHalo constexpr
maddyscientist 02a4cb9
Add target::is_thread_zero and target::is_lane_zero helper functions …
maddyscientist 33b5f2f
Expose prefetching instructions
maddyscientist ccf7a55
Add prefetching support to gauge and colorspinor fields
maddyscientist 0642f63
Add L2 gauge-field prefetching support to both Wilson and staggered d…
maddyscientist 72a001f
QUDA_DSLASH_DOUBLE_STORE is now a CMake parameter
maddyscientist 02e7bc3
Add TMA prefetch support for Wilson and staggered fermions (enabled w…
maddyscientist 7bb5cdc
Add target::uniform helper which is used to create warp-uniform varia…
maddyscientist f42a507
Fix typo in last commit
maddyscientist e2df25f
Fix bug with non-double-store staggered dslash
maddyscientist 3010aa6
Fix bug with parity setting
maddyscientist acfaf5b
Fix bulk prefetch of phase
maddyscientist 67f8ce4
Add 3-d and 4-d TMA prefetch instructions
maddyscientist 946bed0
first version of tensor descriptor TMA prefetch - almost certainly buggy
maddyscientist d772d5f
Fix some warnings and set Uback tensor descriptor for wilson dslash
maddyscientist 60894ec
Add 5-d tensor prefetch instruction to CUDA. Introduce 3-operand var…
maddyscientist 9910869
colorspinor::FloatNOrder load/save functions use 3-operand vector_loa…
maddyscientist b9a4d5f
Continued improvements to tensor TMA prefetch variant and gauge::Floa…
maddyscientist 23992e0
Guard TMA tensor descriptor creation with __COMPUTE_CAPABILITY__ >= 900
maddyscientist f0f9afd
Optimization for fixed point gauge field load with QUDA_RECONSTRUCT_N…
maddyscientist cfaa705
Optimization of fixed-point phase rescaling
maddyscientist 17d349c
Small optimziation to recon-8 unpack, reduces reconstruct by 4 multip…
maddyscientist a5abce8
Fix backward hopping ghost boundary check in staggered dslash
maddyscientist c265884
Fix UBSAN error: avoid pointer arithmetic on null pointers
maddyscientist aee623d
Optimize vector_load/vector_store in gauge_field_order.h to reduce 64…
maddyscientist 6cfc18a
Fix double-store dslash kernels when we have T partitioning - boundar…
maddyscientist 168f097
Fix performance when using double-store gauge field: shifted gauge fi…
maddyscientist f11bd84
Dslash prefetch should distinguish in the aux string
maddyscientist a2a9b24
Added experimental optimization: replace parity * offset with bitmask…
maddyscientist 7d17452
Optimization for staggered packing kernels: ensure we do division by …
maddyscientist 27b725d
Optimize scale_inv multiplication in gauge field reconstruction
maddyscientist 2e12a2c
Optimize the alternate path for i2f: with a pre-computed shift consta…
maddyscientist b67b9fb
Merge origin/feature/prefetch2
maddyscientist abed9ac
Revert "Added experimental optimization: replace parity * offset with…
maddyscientist 50cc09a
Optimize FFMA2 issuance
maddyscientist 4c9fa83
Add experiment with L1 prefetching for staggered dslash
maddyscientist 9daba3f
No bank conflicts when doing L1 prefetch
maddyscientist 8427323
Fix last commit
maddyscientist daa5a4f
Disable L1 prefetch experiment on in dslash_staggered
maddyscientist 4b0600a
Fix 32-byte alignment when gauge field is padded
maddyscientist bbd8ac6
Fix a double4 compiler conflict
maddyscientist 1ed2db1
Fix conflict between block_size definitions
maddyscientist 9de5021
Forbid NVSHMEM and TMA prefetching. Fix autotuner so that only valid…
maddyscientist 30ae502
Fix ambiguity from multi-inheritance with fused DWF kernel
maddyscientist 79934bb
Cleanup of abstraction of TMA to allow for clean building on modern a…
maddyscientist 573d0be
Merge branch 'develop' of github.com:lattice/quda into feature/prefetch2
maddyscientist 04b4fae
We should only be aligning the stride with native gauge fields
maddyscientist 0cf1286
Remove FMA optimied I2F, as it introduces floating point rounding tha…
maddyscientist aaa629d
We only ever need to resize the pad when creating a gauge field from …
maddyscientist 5653947
Tweak block CG tolerance for staggered eigensovler. Laplace eigensol…
maddyscientist c5cd669
Fix issue with MRHS Shamir DWF operator (pre-computed constant should…
maddyscientist 20a70e4
Fix warning
maddyscientist 74dd488
Fix bug in mdw_dslash5_tensor_core (was ignorant of the reworked acce…
maddyscientist b2e6e88
Minor optimization mdw_dslash5_tensor_core.cuh and fix quarter precision
maddyscientist 9b5545f
Reduce carve-out autotuner overhead - default carve out step size is …
maddyscientist d7568e6
Backwards gauge tensor descriptor copy only done if double store enabled
maddyscientist c92f3cd
Hopefully fix compiler warning
maddyscientist 35da04f
Fix HIP compilation
maddyscientist 6041ec6
Always use ::cuda::maximum() now that we install our own CCCL
maddyscientist 982f41b
Always use ::cuda::maximum() now that we install our own CCCL
maddyscientist 60a746b
Update cub block interfaces
maddyscientist 4918c98
Fix HIP load_store.h
maddyscientist af2be33
Fix compilation warning with CUDA clang
maddyscientist 02baeaa
Add missing target_device.h
maddyscientist 4b8352c
Fix clang warning
maddyscientist 13a192b
Fix HIP function call
maddyscientist 274cbad
Fix TMA instruction exposure
maddyscientist 89e8886
Fix clang warning
maddyscientist 866a389
Fix clang error
maddyscientist 63b97b9
Fix another clang error
maddyscientist bcfaa50
Hopefully the last clang error
maddyscientist b95f9b4
I2F is encoded in half precision fields
maddyscientist 1b73643
Remove LEGACY_ACCESSOR_NORM path from colorspinor::FloatNOrder, and o…
maddyscientist 510b0a2
Use CCCL 3.1.4 instead of latest main branch commit
maddyscientist 55ee7cc
Add some clarifying comments
maddyscientist 9d21752
Fix compiler warning in domain_decomposition.h
maddyscientist 8d04ac1
Add prefetching support for native staggered
maddyscientist ca2a85a
Remove stray debug asserts
maddyscientist 0bc3ad3
Small clean up to tune_key
maddyscientist 44b9000
tensor descriptor cache should work as expected now
maddyscientist 96a3912
CMake will error out if TMA prefetch is requested but double-store is…
maddyscientist 6360e16
Small cleanup to Wilson dslash
maddyscientist 48e870b
indexfromFaceIndexStaggered should not be constexpr
maddyscientist 051dd43
Fix compilation issue tripping up some CI
maddyscientist 16a787c
Add 2-d TMA prefetch accessors
maddyscientist 32fd0c3
Add run-time launch check when TMA is enabled to ensure parity is blo…
maddyscientist 9fb3260
Cleanup of staggered dslash kernel
maddyscientist cc6e837
Add FloatNOrder raw_load and raw_save functions
maddyscientist 3229363
Gauge shift now operates on raw packed elements
maddyscientist 35e734a
Matrix::L1/L2/Linf method should be const qualified
maddyscientist a5055cc
Fix printing bug with LatticeField
maddyscientist e38501a
Add kernel_param::comms_dim_partitioned which mirrors comm_dim_partit…
maddyscientist a8d4a0a
Gause shift kernel now fills in the ghost region of the shifted field…
maddyscientist 73f46af
When double-store is enabled, when doing the halo update always read …
maddyscientist 37cfc7b
Fix bug with staggered dslash test where partitioning was being reset…
maddyscientist e223bfa
Selecting the type of prefetching to use is now more verbose.
maddyscientist ea36ced
Runtime warning if dslash prefetch distance exceeds max for naive sta…
maddyscientist 3b25ff5
Fix ROCm compilation
maddyscientist 9b83fde
Make HIP shared memory helpers match CUDA versions
maddyscientist 709b7f9
Blackwell now defaults to using BULK TMA prefetching with a prefetch …
maddyscientist 305884e
Signficant cleanup of TENSOR variant of prefetching. Descriptor not …
maddyscientist 06413d0
Fix CI
maddyscientist dd77fc0
Fix type with twisted mass
maddyscientist a265269
Increase TuneKey::aux_n to prevent buffer overflow
maddyscientist f92570e
value to reference - fixes clang compilation issue
maddyscientist 3ada421
Add git to docker file for CSCS
maddyscientist 2125574
Fix deprecation warning with recent CUDA 13.1 regarding NVML temperat…
maddyscientist 951a3ee
Make the NVML temperature query more robust for the change in interface
maddyscientist 3c8ed1a
Fix CLI11 for modern compilers
maddyscientist 8c7ba4d
Temporary change of default prefetch type on sm100 while doing some b…
weinbe2 a510234
Fix bug in gauge shift when writing its halo. Add some sanity checks…
maddyscientist d460006
Merge branch 'feature/prefetch2' of github.com:lattice/quda into feat…
maddyscientist b0f2a86
Revert "Temporary change of default prefetch type on sm100 while doin…
maddyscientist File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.