Skip to content

Conversation

@Bluear7878
Copy link
Contributor

This is a PR that supports float16 for the v2 flux model.

Key change

  • Fixed CUDA kernel: Updated AWQ CUDA kernel to use input tensor dtype for dispatch instead of scaling_factors dtype
  • Added dtype utilities: Created nunchaku/dtype_utils.py for centralized dtype management and automatic GPU-based dtype selection
  • Test coverage: Added float16/bfloat16 parametrized tests in test_flux_dev.py

@lmxyy lmxyy changed the title v2 float16 support. feat: Python Backend Float16 Support for Turing GPUs Sep 12, 2025
@Bluear7878 Bluear7878 closed this Sep 12, 2025
@Bluear7878 Bluear7878 reopened this Sep 12, 2025
@Juste-Leo2
Copy link

Would it be possible to also add support for the Pascal architecture? Even if the computation speed is slower (since int4 computation isn’t supported), could an on-the-fly conversion to Float16 be performed?


precision = get_precision() # auto-detect your precision is 'int4' or 'fp4' based on your GPU

torch_dtype = torch.float16 # Auto-selects bfloat16 on Ampere+ GPUs, float16 otherwise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable seems to have no use

@lmxyy
Copy link
Collaborator

lmxyy commented Sep 13, 2025

I think you can choose the torch_dtype by passing torch_dtype in

f"nunchaku-tech/nunchaku-flux.1-dev/svdq-{precision}_r32-flux.1-dev.safetensors"
.

@Zhenyi-Wang
Copy link

Hi is this for flux only or for others like qwen-image as well?

@Bluear7878
Copy link
Contributor Author

@lmxyy Thank you for the helpful suggestion. It allowed me to fix the torch_dtype related code.
@Zhenyi-Wang Not directly at the moment, but it should be applicable to qwen-image after some minor tweaks. I'll look into it soon.

@Zhenyi-Wang
Copy link

@lmxyy Thank you for the helpful suggestion. It allowed me to fix the torch_dtype related code. @Zhenyi-Wang Not directly at the moment, but it should be applicable to qwen-image after some minor tweaks. I'll look into it soon.

That would be fantastic! You Dalaos are so great!

Copy link
Collaborator

@lmxyy lmxyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to support FP16 Qwen-Image.

Tensor gemv_awq(
Tensor _in_feats, Tensor _kernel, Tensor _scaling_factors, Tensor _zeros, int m, int n, int k, int group_size) {
return dispatchFloat16(_scaling_factors.scalar_type(), [&]<typename half_t>() {
return dispatchFloat16(_in_feats.scalar_type(), [&]<typename half_t>() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to change this?

transformer.load_state_dict(converted_state_dict)

# Convert all quantization buffers to model dtype
convert_awq_buffers_to_dtype(transformer, transformer._dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to convert the dtype? We can initialize the modules with the correct dtype by passing torch_dtype in _patch_model.

# Load the wtscale from the converted state dict.
# Match floating-point tensors to model dtype
v = converted_state_dict[k]
if isinstance(v, torch.Tensor) and v.is_floating_point():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the model on the FP4 models? This doesn't look correct.

@Bluear7878 Bluear7878 closed this Sep 15, 2025
@Zhenyi-Wang
Copy link

Closed?

@Bluear7878 Bluear7878 reopened this Sep 25, 2025
@Bluear7878
Copy link
Contributor Author

Bluear7878 commented Sep 25, 2025

image

I have identified critical overflow errors and image quality degradation when using low-precision quantization schemes (FP4, INT4) with torch.float16. The issues seem to originate from intermediate tensor values exceeding the maximum representable range of float16, leading to either crashes or corrupted outputs.

  1. FP4 + Float16 Causes Overflow:

    • When employing FP4 quantization, operations consistently result in an overflow error when float16 is used for computations. This suggests that the dynamic range of the model's activations or weights is too high for the limited precision of float16.
  2. Qwen-Image Model Overflow with INT4 + Float16:

    • A similar overflow issue was observed specifically with the Qwen-image model using INT4 quantization and float16.
    • As shown in the attached screenshot, the values significantly exceed the maximum limit of float16 (~65,504), confirming that an overflow is occurring.
  3. Image Degradation in Flux Model with INT4 + Float16:

    • While the combination of INT4 quantization, float16, and the Flux model is operational and does not crash, it produces severe image degradation.
    • This is likely due to aggressive clipping, where out-of-range values are forced to the maximum/minimum float16 value instead of throwing an error. This process leads to a significant loss of information and corrupts the final generated image.

@Zhenyi-Wang
Copy link

As a 2080ti user, I really appreciate your work on this float16 support—could this PR also address issue #492, or would it be possible to tackle that together if not?

@Zhenyi-Wang
Copy link

As a 2080ti user, I really appreciate your work on this float16 support—could this PR also address issue #492, or would it be possible to tackle that together if not?

Sorry I mean #420

@Bluear7878
Copy link
Contributor Author

As a 2080ti user, I really appreciate your work on this float16 support—could this PR also address issue #492, or would it be possible to tackle that together if not?

Sorry I mean #420

With this PR, using float16 no longer causes any code-level erros.
However, if the model's required numerical range significantly exceeds what float16 can represent, it will likely result in a black screen due to overflow issues. The Flux model, on the other hand, does generate images successfully with int4+float16, albeit with somewhat degraded quality.

@senb-ent
Copy link

senb-ent commented Oct 14, 2025

I am trying to use this branch.
I compiled the branch with Docker based on Dockerfile.torch28.

Then I tried running qwen-image-lightning.py
but got this error:
TypeError: Qwen2_5_VLForConditionalGeneration.__init__() got an unexpected keyword argument 'offload_state_dict'

My GPU: 2080Ti

@Bluear7878 Do you know what's not working?

Thanks in advance

full log:

  warnings.warn("[Nunchaku] Device does not support bfloat16; falling back to float16.")
The config attributes {'pooled_projection_dim': 768} were passed to NunchakuQwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/nunchaku/qwen-image-lightning.py", line 43, in <module>
    pipe = QwenImagePipeline.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/nunchaku/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/nunchaku/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 1025, in from_pretrained
    loaded_sub_model = load_sub_model(
                       ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/nunchaku/lib/python3.11/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 849, in load_sub_model
    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/nunchaku/lib/python3.11/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/nunchaku/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4974, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Qwen2_5_VLForConditionalGeneration.__init__() got an unexpected keyword argument 'offload_state_dict'

ERROR conda.cli.main_run:execute(127): `conda run /bin/bash -c python qwen-image-lightning.py` failed. (See above for error)```

@otherV
Copy link

otherV commented Nov 8, 2025

any update on this? nunchaku-t5 does not work on Turing GPUs yet.

@silent-rain
Copy link

Looking forward to supporting the 20 series

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 29, 2025

I am using models on 2080 in comfyui. Does this only fix examples? They don't get NaN and work alright.

@otherV
Copy link

otherV commented Nov 29, 2025

@Ph0rk0z This is supposed to bring awq kernel support for turing gpus, so that we can use models like nunchaku-t5 (awq-int4-flux.1-t5xxl.safetensors). I tried this pr, didn't work. If you are using models non-dependent on awq, everything works.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 29, 2025

I thought all nunchaku uses AWQ kernels and what was really missing was the MMQ workarounds in them. I never tried to run T5 with this though. Only full models like flux, etc.

@sarim
Copy link

sarim commented Dec 17, 2025

As a 2080ti user, I really appreciate your work on this float16 support—could this PR also address issue #492, or would it be possible to tackle that together if not?

Sorry I mean #420

With this PR, using float16 no longer causes any code-level erros. However, if the model's required numerical range significantly exceeds what float16 can represent, it will likely result in a black screen due to overflow issues. The Flux model, on the other hand, does generate images successfully with int4+float16, albeit with somewhat degraded quality.

@Bluear7878 So is there any clever workaround, to make turning work with existing model files? I know almost nothing about the underlying science here but only thing I can think of is to quantize models again with custom logic for turning where all weights will be scaled down to fit in float16 range. Is that the only way forward?

@Bluear7878
Copy link
Contributor Author

@senb-ent @otherV

Sorry for late answer. I will take care of this issue.

@sarim
I believe the approach you mentioned can be used to solve this problem, but the goal of nunchaku is to support accerlation without modifying the existing model weight.
I've mentioned this issue to lmxyy and are discussing it, so please wait patiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants