[ROCm] Enable inference quantization tests on ROCm (Float8, Int8, per-token)#4044
Open
brucechanglongxu wants to merge 1 commit intopytorch:mainfrom
Open
Conversation
…-token) Remove blanket @skip_if_rocm from tests that already pass on MI300X: - test_workflow_e2e_numerics: Float8WeightOnly, Int8DynActInt8Weight, Int8WeightOnly all pass. Float8DynActFloat8Weight with default PerTensor is skipped due to a _is_128_128_scaled false positive when the linear is exactly 128x128 (block_size coincides with shape); PerRow granularity works fine. - test_per_token_linear_cuda: per-token int8 dynamic quantization works on ROCm across float32/float16/bfloat16. - test_flatten_unflatten: Int8 quantized tensor flatten/unflatten roundtrip works on ROCm. Remaining skips with updated reasons: - test_print_quantized_module: hipSPARSELt reports available via torch.backends but fails at runtime on MI300X. - test_int4_weight_only_quant_subclass_api_grouped: _weight_int4pack_mm hits a qScaleAndZeros size assertion on the small N=16/N=8 shapes used in this test. - test_int8_weight_only_quant_with_freeze: kept as-is (flaky).
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4044
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Removes blanket
@skip_if_rocm("ROCm enablement in progress")from inference quantization tests that already pass on MI300X (gfx942). These were skipped during the initial ROCm bringup but the underlying quantization configs work fine now.test_workflow_e2e_numericsintest_quant_api.pyruns end-to-end quantize + inference + SQNR checks for several configs. On ROCm, Float8WeightOnly, Int8DynActInt8Weight, and Int8WeightOnly all pass. GemliteUIntX is already gated byhas_gemlite. Float8DynActFloat8Weight with default PerTensor granularity is skipped -- see note below.test_per_token_linear_cudaintest_integration.pytests_quant_int8_dynamic_per_token_linearon GPU across float32/float16/bfloat16. Passes on MI300X with SQNR >= 39 on all dtypes.test_flatten_unflattenintest_affine_quantized.pytests__tensor_flatten__/__tensor_unflatten__roundtrip on Int8 quantized tensors. Passes on both CPU and CUDA on ROCm.Tests that remain skipped with updated reasons:
test_print_quantized_module:torch.backends.cusparselt.is_available()returns True on this ROCm machine (hipSPARSELt backend detected) butSemiSparseLayoutfails at runtime with "hipSPARSELt not supported on your machine". Updated the skip message to reflect the actual blocker.test_int4_weight_only_quant_subclass_api_grouped: see note below.test_int8_weight_only_quant_with_freeze: unchanged, marked flaky.Two bugs discovered during this investigation (not ROCm-specific but surfaced here because the
mslkcodepath is unavailable on AMD):_is_128_128_scaledfalse positive on per-tensor scaled tensors: when a weight tensor happens to be exactly 128x128,PerTensorgranularity producesblock_size=[128, 128]._is_128_128_scaledchecksb[0] == 128 and b[1] == 128and returns True, even though_is_tensorwise_scaledalso returns True. Thetorchkernel path infloat8_tensor.pythen hitselif _is_128_128_scaled(weight_tensor)before checking tensorwise, and asserts that the input must be_is_1_128_scaled, which fails. On CUDA with SM90+ themslkpath is taken instead so this never triggers. The fix would be to check_is_tensorwise_scaledbefore_is_128_128_scaledin the dispatch chain, or to have_is_128_128_scaledexclude the tensorwise case.Float8DynActFloat8WeightwithPerRowgranularity works fine on ROCm (SQNR=28.9)._weight_int4pack_mmassertion on small output dimensions:test_int4_weight_only_quant_subclass_api_groupeduses test shapes with N=16 and N=8. The tinygemm kernel (_weight_int4pack_mm) assertsqScaleAndZeros.size(1) == nand fails on these shapes. Int4WeightOnly withtile_packed_to_4dworks on normal-sized shapes (tested 128x128 with group_size=32, SQNR=24.4). This is likely a pre-existing constraint of the tinygemm packing format on ROCm rather than something introduced by these changes.Tested on MI300X (gfx942) in a ROCm Docker container.