Skip to content

Conversation

@lantudou
Copy link

@lantudou lantudou commented Dec 17, 2025

Motivation

Floating-point atomic add operations (atomicAdd / red.add.f32) on GPUs are non-deterministic due to the unpredictable order of thread execution, causing slight variations in results. This non-deterministic behavior is problematic for scenarios requiring strict reproducibility, such as debugging, testing, and scientific computing.

This PR enables deterministic LoRA activation reduction by replacing floating-point reduction with fixed-point integer arithmetic, ensuring identical outputs for the same inputs across runs.

There also are some issues refer to this feature: #546 #229 #294
spooknik/nunchaku-chroma#6
Modifications

  • gemm_utils.cuh: Added an int overload of reduce_add_pred using red.relaxed.gpu.global.add.s32 for deterministic integer atomic addition
  • lora.cuh:
    • Changed lora_act data type from float* to int*
    • Implemented 16-bit fractional precision (FRAC_BITS = 16) fixed-point representation in reduce_lora_act, scaling float values to integers before reduction
    • In apply_lora_up, dequantized fixed-point integers back to floats with the inverse scale factor fused into scales to avoid extra computation overhead
  • gemm_w4a4.cuh / gemm_w4a4_launch_impl.cuh: Updated relevant type declarations and pointer casts

@lantudou lantudou closed this Dec 17, 2025
@lantudou lantudou changed the title Ensure deterministic LoRA activation reduction with fixed-point arithmetic feat: Enable deterministic LoRA activation reduction using fixed-point arithmetic Dec 17, 2025
@lantudou lantudou reopened this Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant