Replies: 32 comments 20 replies
-
Beta Was this translation helpful? Give feedback.
-
|
No quantization, just a simple cast. Condition: Simple cast to
The results seem surprisingly not bad. I don't know why the model initialization time is shorter with LoRA. |
Beta Was this translation helpful? Give feedback.
-
|
Condition: Simple cast to
The blob's face is clearly closer to what it would look like without LoRA.
|
Beta Was this translation helpful? Give feedback.
-
|
Summary of Initial Results The existing Model initialization time doesn't significantly change with quantization. Inference time is slower with simple casting. This is likely because simple casting performs all calculations in fp8. With the mixed quantization approach (some modules remain in bfloat16), some parts are calculated in faster bfloat16, and this speedup likely outweighs the overhead of the de-optimization. Given the quality degradation observed with quantization, I'd like to explore per-channel quantization before investigating the impact of reducing excluded modules. EDIT: The code includes a process to reduce the influence of outliers by calculating the maximum value using percentile instead of max. However, it seems that the simpler max is better than percentile, both per tensor and per channel. You can confirm this by commenting out the code. |
Beta Was this translation helpful? Give feedback.
-
|
Condition: float8_e4m3fn + per-channel scaling + attention+MLP only
In the first image, the table is gone but the pi value is fine. |
Beta Was this translation helpful? Give feedback.
-
|
Condition: float8_e5m2 + per-channel scaling + attention+MLP only
In the first image, the neon sign is simplified and the pi value is incorrect. We will investigate the impact of modules using float8_e4m3fn's per channel scaling. |
Beta Was this translation helpful? Give feedback.
-
|
Amazing research and work thank you so much |
Beta Was this translation helpful? Give feedback.
-
|
Quantizing timestep embedding does not reduce memory usage, so it is best to always exclude it. Condition: float8_e4m3fn + per-channel scaling + attention+MLP + modulation
In the first image, the table is there, but the pi values are messed up. |
Beta Was this translation helpful? Give feedback.
-
|
Amazing! |
Beta Was this translation helpful? Give feedback.
-
|
Summary of Results So Far
Proposal for Implementation
|
Beta Was this translation helpful? Give feedback.
-
|
excellent news thank you so much |
Beta Was this translation helpful? Give feedback.
-
|
I think it is possible to add a test using PyTorch's ao for fp8 scale comparison to confirm that the max val implementation in this repository is consistent. |
Beta Was this translation helpful? Give feedback.
-
|
Condition: TorchAO Float8WeightOnlyConfig with float8_e4m3fn (per-channel quantization) + attention+MLP + modulation
In the first image, the table is there, but the pi value is incorrect. |
Beta Was this translation helpful? Give feedback.
-
|
Condition: TorchAO Int8WeightOnlyConfig (per-channel quantization) + attention+MLP + modulation
The trend is similar to TorchAO fp8, but in the first image, the cup shape and the female pose are close to the baseline. In the second image, the hand shape is slightly different from the baseline. The inference speed is improved. Based on the results so far, Musubi Tuner's per-channel scaling quantization seems slightly better than TorchAO. |
Beta Was this translation helpful? Give feedback.
-
|
Also full INT8 training (full-finetune) is possible if you are interested: https://github.com/Disty0/sdnq/
RTX 5090 and RTX 4090 has 2x faster INT8 than FP8.
|
Beta Was this translation helpful? Give feedback.
-
|
I added the block-wise quantization like I'll add a sample later, but even when applying quantization to modulation layers, the results seem to be pretty good. The memory usage is about 20GB (almost same as per-channel), |
Beta Was this translation helpful? Give feedback.
-
|
Condition: float8_e4m3 + block-wise scaling (block size=64) for all layers + attention+MLP+modulation
In the first image, the pi value is good and there is a table. The inference speed slowed to 104.79 seconds compared to 102.53 seconds per-channel quantization, this was a little over 2% slower. VRAM usage is increased to 19.96 GB compared to 19.34 GB per-channel quantzation. |
Beta Was this translation helpful? Give feedback.
-
|
We've opened a PR to make block-wise scaling the default as #575, and we welcome your feedback. |
Beta Was this translation helpful? Give feedback.
-
|
The results of Wan2.2 inference for per-tensor and block-wise quantization mode for reference. The top row is block-wise, the middle row is bfloat16, and the bottom row is per-tensor. You can see that the difference between block-wise and bfloat16 is smaller than that between per-tensor and bfloat16. |
Beta Was this translation helpful? Give feedback.
-
|
A large company recently released a prune version of qwen-image, which is only 13.3B in size, and it looks like the quality is quite good. |
Beta Was this translation helpful? Give feedback.
-
|
PR #575 has been merged. Thank you for your cooperation! |
Beta Was this translation helpful? Give feedback.
-
|
(sorry, accidentally posted this into the PR. Meant to post this here) Hi @kohya-ss thank you for this, very interesting! I've been working on quants as well, but more from a training speed perspective rather than quality. Not all choices you have outlined like blockwise/tilewise and channelwise model weights are possible efficiently in training, but some are. Here are some LoRA training tests with tensor-wise model weights and channel-wise activations: I've yet to clean up my code somewhat before making it public. |
Beta Was this translation helpful? Give feedback.
-
|
Here are some test with SDNQ on Nvidia RTX 5090, RTX 4090, RTX 3090; AMD RX 7900 XTX and Intel ARC A770 usin this script to benchmark: https://github.com/Disty0/sdnq/blob/main/scripts/benchmark_sdnq.py Some notes: Using row-wise quantization for quality. This makes it have very good quality but also makes it more challenging to run as the weights has to be re-quantized for the backward pass. Naming: SDNQ Float X: Using BF16 matmul with X weights (with group_size=32 on quantized weights). Results:Nvidia RTX 5090:
Nvidia RTX 4090:
Nvidia RTX 3090:
AMD RX 7900 XTX:
Intel ARC A770:
|
Beta Was this translation helpful? Give feedback.
-
|
normally speed difference of 4090 vs 5090 is like 25% when I tested in SwramUI but some insane TFLops difference there @Disty0 |
Beta Was this translation helpful? Give feedback.
-
|
Also FP16 vs BF16 vs INT8 (quantized) vs UINT8 (quantized) vs FP8 E4 (quantized) after 1000000 steps of (x and y are from torch.randn)
|
Beta Was this translation helpful? Give feedback.
-
|
PyTorch int_mm is broken on Nvidia. Switched the mm func from torch._int_mm to triton and RTX 5090 went from 340 TFLOPs to 450 TFLOPs and RTX 3090 went from 110 TFLOPs to 170 TFLOPs. Only Nvidia is affected by this, so triton int mm is only used on Nvidia in the benchmarks. Here are the updated benchmarks: Notes: Using row-wise quantization for quality. This makes it have very good quality but also makes it more challenging to run as the weights has to be re-quantized for the backward pass. Naming: SDNQ Float X: Using BF16 matmul with X weights (with group_size=32 on quantized weights). CKPT: quantize the backward inputs before saving in the forward pass. FP8 TW: use software row-wise mm because the PyTorch implementation of row-wise fp8 mm is broken on RTX 4000 series. Results:Nvidia RTX 3090: Nvidia RTX 4090: Nvidia RTX 5090: Nvidia RTX PRO 6000: Intel ARC A770: AMD RX 7900 XTX: |
Beta Was this translation helpful? Give feedback.
-
|
Sampling quality tests fp8 W8 tensorwise A8 channelwise: |
Beta Was this translation helpful? Give feedback.
-
|
final training benchmarks. turned out better than I expected myself: |
Beta Was this translation helpful? Give feedback.
-
|
https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/fp8_primer.ipynb |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for sharing this information. I haven't checked it yet, but Huawei's SINQ might also be interesting: https://github.com/huawei-csl/SINQ/ |
Beta Was this translation helpful? Give feedback.




































Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
I'm conducting a systematic study on FP8 quantization methods for DiT models (specifically Qwen-Image) to evaluate the impact on inference quality when applying different quantization approaches and target modules. This research aims to optimize memory usage while maintaining generation quality for both inference and LoRA training.
Experimental Approach
Research Variables
Quantization Methods:
Target Modules:
The code for the experiment is implemented in this branch.
Testing is performed on an NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (power limit 250W), Windows 11, PyTorch 2.8.0 + CUDA 12.9.
Test Prompts
Yellow blob LoRA is here.
Baseline Results (bfloat16, current implementation on main branch)
Performance Metrics:
Model initialization times seem to vary quite a bit.
Next Steps
Phase 1 (this topic) will focus on rapid screening across all quantization configurations using visual assessment. Promising configurations may then undergo detailed analysis in Phase 2.
Community Input Welcome
I'd appreciate any feedback on the experimental design or suggestions for additional evaluation criteria. Results will be shared progressively as the study advances.
Generation commands for baseline
Please write them in one line.
Beta Was this translation helpful? Give feedback.
All reactions