[Research] FP8 Quantization Impact Study for DiT Models - Ongoing Results #564

kohya-ss · 2025-09-15T02:34:54Z

kohya-ss
Sep 15, 2025
Maintainer

Overview

I'm conducting a systematic study on FP8 quantization methods for DiT models (specifically Qwen-Image) to evaluate the impact on inference quality when applying different quantization approaches and target modules. This research aims to optimize memory usage while maintaining generation quality for both inference and LoRA training.

Experimental Approach

Target: Weight-only FP8 quantization (activations remain bfloat16)
Model: Qwen-Image
Method: Static quantization without calibration
Evaluation: Inference quality assessment using standardized prompts

Research Variables

Quantization Methods:
- float8_e4m3fn vs float8_e5m2
- Tensor-level vs channel-level scaling
Target Modules:
- Attention layers, MLP/Feed-forward layers (primary targets)
- Time embedding and modulation layers (4 combination patterns)

The code for the experiment is implemented in this branch.

Testing is performed on an NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (power limit 250W), Windows 11, PyTorch 2.8.0 + CUDA 12.9.

Test Prompts

Complex text/multilingual: Coffee shop scene with mixed languages and mathematical symbols
LoRA compatibility: Nature scene using yellow blob LoRA for fine-tuning validation

Yellow blob LoRA is here.

Baseline Results (bfloat16, current implementation on main branch)

Performance Metrics:

Metric	Value
Model initialization time (sec)	19.16
Model initialization time with LoRA (sec)	18.17
Generation time (sec)	85.82
VRAM usage	38.31 GB

Model initialization times seem to vary quite a bit.

Next Steps

Phase 1 (this topic) will focus on rapid screening across all quantization configurations using visual assessment. Promising configurations may then undergo detailed analysis in Phase 2.

Community Input Welcome

I'd appreciate any feedback on the experimental design or suggestions for additional evaluation criteria. Results will be shared progressively as the study advances.

Generation commands for baseline

python src/musubi_tuner/qwen_image_generate_image.py --image_size 928 1664 --infer_steps 50 --seed 1001 
--negative_prompt " " --output_type latent_images --dit path/to/qwen_image_bf16.safetensors 
--vae path/to/diffusion_pytorch_model.safetensors --text_encoder path/to/qwen_2.5_vl_7b.safetensors --attn_mode torch 
--prompt "A coffee shop entrance features a chalkboard sign reading ""Qwen Coffee 😊 $2 per cup,"" with a neon light beside it displaying ""通义千问"". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written ""π≈3.1415926-53589793-23846264-33832795-02384197"". Ultra HD, 4K, cinematic composition." 
--save_path path/to/exp1/baseline_bf16

python src/musubi_tuner/qwen_image_generate_image.py --image_size 928 1664 --infer_steps 50 --seed 1002 
--negative_prompt " " --output_type latent_images --dit path/to/qwen_image_bf16.safetensors" 
--vae path/to/diffusion_pytorch_model.safetensors" --text_encoder path/to/qwen_2.5_vl_7b.safetensors" --attn_mode torch 
--prompt "A blob emoji character hiking through a dense forest trail, wearing a tiny backpack and hat. Sunlight filters through the canopy, creating dappled shadows. Wildflowers and mushrooms dot the forest floor. The background shows towering pine trees and distant mountains." 
--lora_w path/to/qwenimage-blob_emoji-4-s020-6.safetensors --lora_m 1  --save_path path/to/exp1/baseline_bf16

Please write them in one line.

kohya-ss · 2025-09-15T02:40:03Z

kohya-ss
Sep 15, 2025
Maintainer Author

Condition: Current --fp8_scaled implementation on main branch
Time embedding: Excluded
Modulation: Excluded

Metric	Value
Model initialization time (sec)	20.45
Model initialization time with LoRA (sec)	22.93
Generation time (sec)	93.53
VRAM usage	25.66 GB

0 replies

kohya-ss · 2025-09-15T03:31:28Z

kohya-ss
Sep 15, 2025
Maintainer Author

No quantization, just a simple cast.

Condition: Simple cast to float8_e4m3fn
Time embedding: Included (cast)
Modulation: Included (cast)

Metric	Value
Model initialization time (sec)	21.97
Model initialization time with LoRA (sec)	16.38
Generation time (sec)	95.30
VRAM usage	19.28 GB

The results seem surprisingly not bad. I don't know why the model initialization time is shorter with LoRA.

0 replies

kohya-ss · 2025-09-15T03:34:51Z

kohya-ss
Sep 15, 2025
Maintainer Author

Condition: Simple cast to float8_e5m2
Time embedding: Included (cast)
Modulation: Included (cast)

The blob's face is clearly closer to what it would look like without LoRA.

Metric	Value
Model initialization time (sec)	19.10
Model initialization time with LoRA (sec)	16.15
Generation time (sec)	96.70
VRAM usage	19.28 GB

0 replies

kohya-ss · 2025-09-15T04:42:52Z

kohya-ss
Sep 15, 2025
Maintainer Author

Summary of Initial Results

The existing fp8_scaled implementation shows a larger impact on inference quality than initially anticipated (although notably, only fp8_scaled maintains the table in front of the signboard that appears in the bfloat16 baseline for the first image). Surprisingly, simple casting to float8_e4m3fn maintains reasonable quality without any scaling.

Model initialization time doesn't significantly change with quantization. Inference time is slower with simple casting. This is likely because simple casting performs all calculations in fp8. With the mixed quantization approach (some modules remain in bfloat16), some parts are calculated in faster bfloat16, and this speedup likely outweighs the overhead of the de-optimization.

Given the quality degradation observed with quantization, I'd like to explore per-channel quantization before investigating the impact of reducing excluded modules.

EDIT: The code includes a process to reduce the influence of outliers by calculating the maximum value using percentile instead of max. However, it seems that the simpler max is better than percentile, both per tensor and per channel. You can confirm this by commenting out the code.

0 replies

kohya-ss · 2025-09-15T05:17:30Z

kohya-ss
Sep 15, 2025
Maintainer Author

Condition: float8_e4m3fn + per-channel scaling + attention+MLP only
Time embedding: Excluded
Modulation: Excluded

Metric	Value
Model initialization time (sec)	20.21
Model initialization time with LoRA (sec)	19.92
Generation time (sec)	98.29
VRAM usage	25.66 GB

In the first image, the table is gone but the pi value is fine.
In the second image, the blob's hands and backpack pockets look very similar to bfloat16.

0 replies

kohya-ss · 2025-09-15T05:21:09Z

kohya-ss
Sep 15, 2025
Maintainer Author

Condition: float8_e5m2 + per-channel scaling + attention+MLP only
Time embedding: Excluded
Modulation: Excluded

Metric	Value
Model initialization time (sec)	20.34
Model initialization time with LoRA (sec)	20.60
Generation time (sec)	99.41
VRAM usage	25.66 GB

In the first image, the neon sign is simplified and the pi value is incorrect.
In the second image, the blob's hand shape is different.

We will investigate the impact of modules using float8_e4m3fn's per channel scaling.

0 replies

FurkanGozukara · 2025-09-15T08:45:02Z

FurkanGozukara
Sep 15, 2025
Sponsor

Amazing research and work thank you so much

0 replies

kohya-ss · 2025-09-15T08:49:40Z

kohya-ss
Sep 15, 2025
Maintainer Author

Quantizing timestep embedding does not reduce memory usage, so it is best to always exclude it.

Condition: float8_e4m3fn + per-channel scaling + attention+MLP + modulation
Time embedding: Excluded
Modulation: Included (quantized)

Metric	Value
Model initialization time (sec)	21.83
Model initialization time with LoRA (sec)	24.63
Generation time (sec)	102.53
VRAM usage	19.34 GB

In the first image, the table is there, but the pi values are messed up.
In the second image, the hand shape is fine,
VRAM usage has been reduced, but there seems to be some impact on quality.

0 replies

Badnerle · 2025-09-15T09:57:22Z

Badnerle
Sep 15, 2025

Amazing!

0 replies

kohya-ss · 2025-09-15T10:13:19Z

kohya-ss
Sep 15, 2025
Maintainer Author

Summary of Results So Far

Per-channel scaling yields significantly higher image quality than per-tensor scaling. Per-channel scaling has a slight overhead, increasing generation time compared to per-tensor, while VRAM usage remains the same.
For fp8 data types, float8_e4m3fn is superior to float8_e5m2 in terms of image quality.
A simple cast to fp8 (--fp8_base) provides the largest memory savings, while maintaining surprisingly decent quality, making it a valuable option for memory-constrained environments.

Proposal for Implementation

The existing --fp8_base option (simple casting of DiT) will remain unchanged due to its significant memory savings.
The quantization method for the --fp8_scaled option will be upgraded from the current per-tensor scaling to per-channel scaling to provide the best possible image quality.

Metric	baseline	fp8 cast	per-tensor	per-channel
Model initialization time (sec)	19.16	21.97	20.45	20.21
Model initialization time with LoRA (sec)	18.17	16.38	22.93	19.92
Generation time (sec)	85.82	95.30	93.53	98.29
VRAM usage	38.31 GB	19.28 GB	25.66 GB	25.66 GB

0 replies

FurkanGozukara · 2025-09-15T11:59:37Z

FurkanGozukara
Sep 15, 2025
Sponsor

excellent news thank you so much

0 replies

sdbds · 2025-09-15T18:54:51Z

sdbds
Sep 15, 2025

I think it is possible to add a test using PyTorch's ao for fp8 scale comparison to confirm that the max val implementation in this repository is consistent.

4 replies

kohya-ss Sep 15, 2025
Maintainer Author

In a recent pull request, I rewrote the max val retrieval to use PyTorch's finfo.

musubi-tuner/src/musubi_tuner/modules/fp8_optimization_utils.py

Lines 19 to 38 in 0f5bf56

    
           def calculate_fp8_maxval(exp_bits=4, mantissa_bits=3, sign_bits=1): 
        
               """ 
        
               Calculate the maximum representable value in FP8 format. 
        
               Default is E4M3 format (4-bit exponent, 3-bit mantissa, 1-bit sign). Only supports E4M3 and E5M2 with sign bit. 
        
               Args: 
        
                   exp_bits (int): Number of exponent bits 
        
                   mantissa_bits (int): Number of mantissa bits 
        
                   sign_bits (int): Number of sign bits (0 or 1) 
        
               Returns: 
        
                   float: Maximum value representable in FP8 format 
        
               """ 
        
               assert exp_bits + mantissa_bits + sign_bits == 8, "Total bits must be 8" 
        
               if exp_bits == 4 and mantissa_bits == 3 and sign_bits == 1: 
        
                   return torch.finfo(torch.float8_e4m3fn).max 
        
               elif exp_bits == 5 and mantissa_bits == 2 and sign_bits == 1: 
        
                   return torch.finfo(torch.float8_e5m2).max 
        
               else: 
        
                   raise ValueError(f"Unsupported FP8 format: E{exp_bits}M{mantissa_bits} with sign_bits={sign_bits}")

I'll also look into whether quantization using TorchAO is possible.

kohya-ss Sep 16, 2025
Maintainer Author

I've added the quantization with TorchAO.

https://github.com/kohya-ss/musubi-tuner/blob/exp-qwen-image-quantization-method/src/musubi_tuner/modules/torchao_fp8_utils.py

There is no command line option, but it can be enabled by changing a flag in the code.

musubi-tuner/src/musubi_tuner/qwen_image/qwen_image_model.py

Line 1299 in 65960fa

    
           use_torchao = False  # set to True to use torchao fp8 quantization instead of musubi

Disty0 Sep 16, 2025

I've looked into the code a little bit and wondering what is the purpose of quantize_tensor_to_fp8 function?
It is only useful if you want to use non standard fp8 formats like 2 bits for exponent for example and completely useless and waste of compute if you are using pytorch native fp8 formats like fp8 e4 and fp8 e5.
PyTorch already does everything that function does in hardware when you run tensor.to(torch.float8_type):

kohya-ss Sep 16, 2025
Maintainer Author

You're right. These functions originally supported arbitrary exp_bits and mantissa_bits, but now only supports e4m3 and e5m2, so a simple cast is almost the same. I'll simplify it.

kohya-ss · 2025-09-16T09:15:44Z

kohya-ss
Sep 16, 2025
Maintainer Author

Condition: TorchAO Float8WeightOnlyConfig with float8_e4m3fn (per-channel quantization) + attention+MLP + modulation
Time embedding: Excluded
Modulation: Excluded

Metric	Value
Model initialization time (sec)	22.54
Model initialization time with LoRA (sec)	19.32
Generation time (sec)	111.55
VRAM usage	25.67 GB

In the first image, the table is there, but the pi value is incorrect.
In the second image, the yellow blob's hand is slightly different, and the jaw shape is also different.
The quality is not bad, but the per-channel scaling of --fp8_scaled seems slightly better. The slower inference speed is noticeable.

0 replies

kohya-ss · 2025-09-16T09:20:28Z

kohya-ss
Sep 16, 2025
Maintainer Author

Condition: TorchAO Int8WeightOnlyConfig (per-channel quantization) + attention+MLP + modulation
Time embedding: Excluded
Modulation: Excluded

Metric	Value
Model initialization time (sec)	21.99
Model initialization time with LoRA (sec)	17.30
Generation time (sec)	100.27
VRAM usage	25.69 GB

The trend is similar to TorchAO fp8, but in the first image, the cup shape and the female pose are close to the baseline. In the second image, the hand shape is slightly different from the baseline. The inference speed is improved.

Based on the results so far, Musubi Tuner's per-channel scaling quantization seems slightly better than TorchAO.

0 replies

Disty0 · 2025-09-16T15:45:28Z

Disty0
Sep 16, 2025

Also full INT8 training (full-finetune) is possible if you are interested: https://github.com/Disty0/sdnq/
My library also supports FP8 but remains untested since i don't have an FP8 capable GPU to fully test it and FP8 E4 is inferior to INT8 both on speed and quality: https://arxiv.org/pdf/2303.17951

RTX 5090 and RTX 4090 has 2x faster INT8 than FP8.
RTX 5090 matrix compute specs from waredb:

3 replies

kohya-ss Sep 16, 2025
Maintainer Author

Thank you! I feel that full finetuning of INT8/FP8 is the next challenge worth tackling, as the computational speed looks very promising.

sdbds Sep 17, 2025

Is INT8 higher quality than NVFP8/MXFP8? It sounds rather incredible.

Disty0 Sep 17, 2025

Yes. Just done a quick test with 32 block size to emulate MXFP8.
Using FP32 scales here to simplify things and this will make FP8 with 32 block size have better accuracy than MXFP8's FP8 E0 scales.
Yet it still loses to INT8 with no block size:
Also torch.randn has a normal distribution, FP8 is supposed to have an advantage here but it still loses pretty badly.

kohya-ss · 2025-09-18T00:00:11Z

kohya-ss
Sep 18, 2025
Maintainer Author

I added the block-wise quantization like Q8_0 in the recent commit: 9810bdd

I'll add a sample later, but even when applying quantization to modulation layers, the results seem to be pretty good. The memory usage is about 20GB (almost same as per-channel), ~~but the speed drops by about 5-7% from per-channel. It's hard to say whether it's practical or not.~~
I should have compared it with the modulation quantized one. The speed loss seems to be minimal.

0 replies

kohya-ss · 2025-09-18T04:07:31Z

kohya-ss
Sep 18, 2025
Maintainer Author

Condition: float8_e4m3 + block-wise scaling (block size=64) for all layers + attention+MLP+modulation
Time embedding: Excluded
Modulation: Included (quantized)

Metric	Value
Model initialization time (sec)	18.28
Model initialization time with LoRA (sec)	18.49
Generation time (sec)	104.79
VRAM usage	19.96 GB

In the first image, the pi value is good and there is a table.
In the second image, the shape of the blob's hand is different from the baseline. The shape of the backpack is also different.
However, the overall quality seems to be good.

The inference speed slowed to 104.79 seconds compared to 102.53 seconds per-channel quantization, this was a little over 2% slower. VRAM usage is increased to 19.96 GB compared to 19.34 GB per-channel quantzation.

0 replies

kohya-ss · 2025-09-18T10:35:22Z

kohya-ss
Sep 18, 2025
Maintainer Author

We've opened a PR to make block-wise scaling the default as #575, and we welcome your feedback.

0 replies

kohya-ss · 2025-09-20T12:19:43Z

kohya-ss
Sep 20, 2025
Maintainer Author

The results of Wan2.2 inference for per-tensor and block-wise quantization mode for reference. The top row is block-wise, the middle row is bfloat16, and the bottom row is per-tensor. You can see that the difference between block-wise and bfloat16 is smaller than that between per-tensor and bfloat16.

0 replies

sdbds · 2025-09-20T12:48:04Z

sdbds
Sep 20, 2025

A large company recently released a prune version of qwen-image, which is only 13.3B in size, and it looks like the quality is quite good.
https://huggingface.co/OPPOer/Qwen-Image-Pruning

1 reply

kohya-ss Sep 20, 2025
Maintainer Author

The model is very interesting! That is a true pruning model that has been retrained after removing blocks. Musubi Tuner's Qwen-Image implementation already supports models with layers other than the original 60, so it may work without any changes.

musubi-tuner/src/musubi_tuner/qwen_image_train_network.py

Line 412 in 9b722be

    
           parser.add_argument("--num_layers", type=int, default=None, help="Number of layers in the DiT model, default is None (60)")

kohya-ss · 2025-09-23T08:34:48Z

kohya-ss
Sep 23, 2025
Maintainer Author

PR #575 has been merged. Thank you for your cooperation!

0 replies

dxqb · 2025-09-29T16:07:32Z

dxqb
Sep 29, 2025

(sorry, accidentally posted this into the PR. Meant to post this here)

Hi @kohya-ss

thank you for this, very interesting!

I've been working on quants as well, but more from a training speed perspective rather than quality. Not all choices you have outlined like blockwise/tilewise and channelwise model weights are possible efficiently in training, but some are.

Here are some LoRA training tests with tensor-wise model weights and channel-wise activations:

That s/it, and "Baseline fp8" is what we currently do, which is dequantizing to bf16 in the Linear forward (about the same speed as bf16 training without quant)

I've yet to clean up my code somewhat before making it public.

6 replies

dxqb Sep 29, 2025

The INT8 case is potentially interesting considering what @Disty0 said above:

INT8 is 2x faster than FP8 and 4x faster than FP16 / BF16 on all 50xx and 40xx cards.
Also anything with tensor cores does support INT8 too. RTX 20xx, Nvidia T4, AMD RX 7000, Intel Arc and later does have INT8 >support. But these have 2x INT8, not 4x like RTX 40xx and 50xx.

It suggests wide compatibility and good speed, but what is the accuracy like?

There is no noticable quality loss compared to fp8 for inference. I have not done any larger trainings yet to be sure, but for smaller LoRAs the same seems to be the case for training.

I can confirm that raw matmuls of int8 are twice as fast as fp8 but not that it improves much the overall training performance compared to fp8. The majority of the gain comes from quantizing activations and being able to do 8bit matmuls, whethet that's int8 or fp8.
It could be that I am missing something and int8 should be even faster. or it could be that training is now limited by other factors. After all, the backward gradients management by torch is still in bf16 and LoRA weights are still in fp32 in my tests.
My tests were done on an rtx 4070.

Sarania Sep 29, 2025
Sponsor

I have an RTX 4070 Ti Super myself and yeah my guess is there are other limiting factors at play at that point. And I suspected it might be similar to fp8 because 8 bits is 8 bits which means it could be a nice, relatively hardware agnostic boon to have it around. I'm definitely interested in your work on this so I will look forward to the code!

dxqb Sep 30, 2025

similar results on a 5090: "only" about 50% faster or 33% less training time.
Maybe there is more speed-up potential to be found

kohya-ss Oct 1, 2025
Maintainer Author

Thank you for your valuable report. The speedup achieved by FP8/INT8 activation is appealing.

I'm also concerned about the degradation in accuracy. I think it would be possible to first compare the inference results.

Sarania Oct 1, 2025
Sponsor

Thank you for your valuable report. The speedup achieved by FP8/INT8 activation is appealing.

I'm also concerned about the degradation in accuracy. I think it would be possible to first compare the inference results.

Yeah this is my concern too and I agree testing is needed since small errors can stack to become big ones in training. That said I've noticed that for inference, usually fp8 degradation is confined to the background of the scene and usually doesn't hinder the main subject as much so I'm hopeful, especially for character LoRA and such. Additionally, as we've done for weights, could we not selectively quantize activations if a certain part proves more sensitive than the rest(Edit: Of course we can, I already do this by excluding FFNs from FP8 matmul when --fp8_fast is enabled for Wan XD). Of course there wouldn't be as much speed gain, but it might still be a nice boost.

The reason I found INT8 specifically appealing is not only because my own card has 4x according to disty0, but also because it does have fairly wide compatibility at the hardware level whereas fp8 math needs more recent stuff. If as I supposed, "8bits is 8bits" then it could be a boon that's available to many regardless of hardware. Older cards would have support with 2x which should still outperform baseline and newer cards would have 4x theoretically at least. That would be a big win, so it bears investigating I think.

Disty0 · 2025-10-01T16:47:39Z

Disty0
Oct 1, 2025

Here are some test with SDNQ on Nvidia RTX 5090, RTX 4090, RTX 3090; AMD RX 7900 XTX and Intel ARC A770 usin this script to benchmark: https://github.com/Disty0/sdnq/blob/main/scripts/benchmark_sdnq.py

Some notes:
Tests runs both a forward pass and a backward pass on inputs, weights and biases, then prints the end to end tflops results.

Using row-wise quantization for quality. This makes it have very good quality but also makes it more challenging to run as the weights has to be re-quantized for the backward pass.

Naming:
PyTorch Float and SDNQ Float: Using BF16 matmul with BF16 weights.
SDNQ INT8: Using INT8 matmul with INT8 weights.
SDNQ FP8: Using FP8 matmul with FP8 weights.

SDNQ Float X: Using BF16 matmul with X weights (with group_size=32 on quantized weights).
SDNQ INT8 Dynamic X: Using INT8 matmul with X weights (with group_size=32 on quantized weights).
SDNQ FP8 Dynamic X: Using FP8 matmul with X weights (with group_size=32 on quantized weights).

Results:

Nvidia RTX 5090:

Nvidia RTX 4090:

Nvidia RTX 3090:

AMD RX 7900 XTX:

Intel ARC A770:

0 replies

FurkanGozukara · 2025-10-01T16:52:53Z

FurkanGozukara
Oct 1, 2025
Sponsor

normally speed difference of 4090 vs 5090 is like 25% when I tested in SwramUI but some insane TFLops difference there @Disty0

3 replies

Disty0 Oct 1, 2025

FP8 row-wise implementation of RTX 4090 is broken on PyTorch. I can implement software row-wise FP8 for RTX 4090 like i did for inference only quant but i don't want to maintain yet another version of FP8 rn and i don't have an FP8 capable GPU locally so all of these tests were on RunPod, costing money.

But if you compare the raw TFLOPS listed on the spec sheet between RTX 5090 and 4090, 5090 is 35% faster, not 25%.

Disty0 Oct 1, 2025

Implemented software fallback for row-wise quantization:

RTX 5090 row-wise vs software row-wise:

RTX 4090 row-wise vs software row-wise:

Sarania Oct 2, 2025
Sponsor

I appreciate your ongoing work with this research!

Disty0 · 2025-10-01T17:10:55Z

Disty0
Oct 1, 2025

Also FP16 vs BF16 vs INT8 (quantized) vs UINT8 (quantized) vs FP8 E4 (quantized) after 1000000 steps of x.mul_(0.99).add_(y) with group_size 32:

(x and y are from torch.randn)

0 replies

Disty0 · 2025-10-02T17:26:19Z

Disty0
Oct 2, 2025

PyTorch int_mm is broken on Nvidia. Switched the mm func from torch._int_mm to triton and RTX 5090 went from 340 TFLOPs to 450 TFLOPs and RTX 3090 went from 110 TFLOPs to 170 TFLOPs.

Only Nvidia is affected by this, so triton int mm is only used on Nvidia in the benchmarks.
AMD doesn't have this issue so it doesn't need the triton fix.

Here are the updated benchmarks:

Notes:
Tests runs both a forward pass and a backward pass on inputs, weights and biases, then prints the end to end tflops results.

Using row-wise quantization for quality. This makes it have very good quality but also makes it more challenging to run as the weights has to be re-quantized for the backward pass.

Naming:
PyTorch Float and SDNQ Float: Using BF16 matmul with BF16 weights.
SDNQ INT8: Using INT8 matmul with INT8 weights.
SDNQ FP8: Using FP8 matmul with FP8 weights.

SDNQ Float X: Using BF16 matmul with X weights (with group_size=32 on quantized weights).
SDNQ INT8 Dynamic X: Using INT8 matmul with X weights (with group_size=32 on quantized weights).
SDNQ FP8 Dynamic X: Using FP8 matmul with X weights (with group_size=32 on quantized weights).

CKPT: quantize the backward inputs before saving in the forward pass.

FP8 TW: use software row-wise mm because the PyTorch implementation of row-wise fp8 mm is broken on RTX 4000 series.

Results:

Nvidia RTX 3090:

===========================================
GPU: NVIDIA GeForce RTX 3090
Steps: 50 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 74.83
SDNQ Float TFLOPS: 74.88
===========================================
SDNQ INT8 TFLOPS: 169.48
SDNQ FP8 TFLOPS: 0
SDNQ FP8 TW TFLOPS: 0
===========================================
SDNQ Float UINT8 TFLOPS: 73.46
SDNQ Float INT8 TFLOPS: 73.34
SDNQ Float FP8 TFLOPS: 0
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 167.18
SDNQ INT8 Dynamic UINT8 TFLOPS: 166.12
SDNQ INT8 Dynamic INT8 TFLOPS: 167.94
SDNQ INT8 Dynamic FP8 TFLOPS: 0
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 0
SDNQ FP8 Dynamic UINT8 TFLOPS: 0
SDNQ FP8 Dynamic INT8 TFLOPS: 0
SDNQ FP8 Dynamic FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 0
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic FP8 TFLOPS: 0
===========================================
SDNQ INT8 CKPT TFLOPS: 170.25
SDNQ FP8 CKPT TFLOPS: 0
SDNQ FP8 TW CKPT TFLOPS: 0
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 167.31
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 166.82
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 167.19
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 0
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 0
===========================================

Nvidia RTX 4090:

===========================================
GPU: NVIDIA GeForce RTX 4090
Steps: 50 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 162.62
SDNQ Float TFLOPS: 162.82
===========================================
SDNQ INT8 TFLOPS: 323.52
SDNQ FP8 TFLOPS: 83.43
SDNQ FP8 TW TFLOPS: 238.24
===========================================
SDNQ Float UINT8 TFLOPS: 161.39
SDNQ Float INT8 TFLOPS: 161.7
SDNQ Float FP8 TFLOPS: 161.63
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 317.83
SDNQ INT8 Dynamic UINT8 TFLOPS: 316.11
SDNQ INT8 Dynamic INT8 TFLOPS: 315.94
SDNQ INT8 Dynamic FP8 TFLOPS: 315.8
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 83.02
SDNQ FP8 Dynamic UINT8 TFLOPS: 82.71
SDNQ FP8 Dynamic INT8 TFLOPS: 82.74
SDNQ FP8 Dynamic FP8 TFLOPS: 82.84
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 235.87
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 233.61
SDNQ FP8 TW Dynamic INT8 TFLOPS: 233.68
SDNQ FP8 TW Dynamic FP8 TFLOPS: 233.41
===========================================
SDNQ INT8 CKPT TFLOPS: 324.61
SDNQ FP8 CKPT TFLOPS: 83.51
SDNQ FP8 TW CKPT TFLOPS: 238.73
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 319.46
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 316.12
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 315.71
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 315.5
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 83.15
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 82.81
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 82.8
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 82.81
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 235.89
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 233.81
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 233.81
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 233.62
===========================================

Nvidia RTX 5090:

===========================================
GPU: NVIDIA GeForce RTX 5090
Steps: 50 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 219.13
SDNQ Float TFLOPS: 218.22
===========================================
SDNQ INT8 TFLOPS: 448.9
SDNQ FP8 TFLOPS: 362.47
SDNQ FP8 TW TFLOPS: 333.68
===========================================
SDNQ Float UINT8 TFLOPS: 215.95
SDNQ Float INT8 TFLOPS: 214.49
SDNQ Float FP8 TFLOPS: 215.16
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 440.48
SDNQ INT8 Dynamic UINT8 TFLOPS: 438.21
SDNQ INT8 Dynamic INT8 TFLOPS: 434.68
SDNQ INT8 Dynamic FP8 TFLOPS: 439.31
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 359.78
SDNQ FP8 Dynamic UINT8 TFLOPS: 353.72
SDNQ FP8 Dynamic INT8 TFLOPS: 352.88
SDNQ FP8 Dynamic FP8 TFLOPS: 352.79
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 328.28
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 325.62
SDNQ FP8 TW Dynamic INT8 TFLOPS: 325.43
SDNQ FP8 TW Dynamic FP8 TFLOPS: 324.73
===========================================
SDNQ INT8 CKPT TFLOPS: 453.08
SDNQ FP8 CKPT TFLOPS: 363.26
SDNQ FP8 TW CKPT TFLOPS: 331.92
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 445.58
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 436.85
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 438.2
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 436.6
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 358.86
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 353.28
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 351.68
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 351.54
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 328.39
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 323.64
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 322.37
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 321.66
===========================================

Nvidia RTX PRO 6000:

===========================================
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
Steps: 100 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 366.89
SDNQ Float TFLOPS: 367.45
===========================================
SDNQ INT8 TFLOPS: 440.33
SDNQ FP8 TFLOPS: 545.17
SDNQ FP8 TW TFLOPS: 459.41
===========================================
SDNQ Float UINT8 TFLOPS: 363.25
SDNQ Float INT8 TFLOPS: 361.6
SDNQ Float FP8 TFLOPS: 361.44
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 435.63
SDNQ INT8 Dynamic UINT8 TFLOPS: 432.47
SDNQ INT8 Dynamic INT8 TFLOPS: 432.3
SDNQ INT8 Dynamic FP8 TFLOPS: 432.35
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 538.5
SDNQ FP8 Dynamic UINT8 TFLOPS: 531.06
SDNQ FP8 Dynamic INT8 TFLOPS: 530.9
SDNQ FP8 Dynamic FP8 TFLOPS: 530.64
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 453.34
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 447.89
SDNQ FP8 TW Dynamic INT8 TFLOPS: 447.81
SDNQ FP8 TW Dynamic FP8 TFLOPS: 447.32
===========================================
SDNQ INT8 CKPT TFLOPS: 438.92
SDNQ FP8 CKPT TFLOPS: 542.31
SDNQ FP8 TW CKPT TFLOPS: 457.34
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 437.08
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 430.11
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 429.7
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 429.88
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 532.76
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 524.56
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 524.01
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 523.98
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 449.2
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 443.19
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 441.78
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 439.37
===========================================

Intel ARC A770:

===========================================
GPU: Intel(R) Arc(TM) A770 Graphics
Steps: 50 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 77.02
SDNQ Float TFLOPS: 76.74
===========================================
SDNQ INT8 TFLOPS: 96.43
SDNQ FP8 TFLOPS: 0
SDNQ FP8 TW TFLOPS: 0
===========================================
SDNQ Float UINT8 TFLOPS: 80.85
SDNQ Float INT8 TFLOPS: 80.69
SDNQ Float FP8 TFLOPS: 81.14
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 93.76
SDNQ INT8 Dynamic UINT8 TFLOPS: 95.59
SDNQ INT8 Dynamic INT8 TFLOPS: 95.67
SDNQ INT8 Dynamic FP8 TFLOPS: 95.18
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 0
SDNQ FP8 Dynamic UINT8 TFLOPS: 0
SDNQ FP8 Dynamic INT8 TFLOPS: 0
SDNQ FP8 Dynamic FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 0
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic FP8 TFLOPS: 0
===========================================
SDNQ INT8 CKPT TFLOPS: 97.88
SDNQ FP8 CKPT TFLOPS: 0
SDNQ FP8 TW CKPT TFLOPS: 0
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 97.46
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 96.89
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 96.94
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 96.99
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 0
===========================================

AMD RX 7900 XTX:

===========================================
GPU: AMD Radeon RX 7900 XTX
Steps: 50 | MNK: 8192
M: 16384 | N: 8192 | K: 4096
Float: torch.bfloat16
===========================================
PyTorch Float TFLOPS: 104.32
SDNQ Float TFLOPS: 104.61
===========================================
SDNQ INT8 TFLOPS: 96.3
SDNQ FP8 TFLOPS: 0
SDNQ FP8 TW TFLOPS: 0
===========================================
SDNQ Float UINT8 TFLOPS: 103.75
SDNQ Float INT8 TFLOPS: 97.63
SDNQ Float FP8 TFLOPS: 0
===========================================
SDNQ INT8 Dynamic Float TFLOPS: 95.81
SDNQ INT8 Dynamic UINT8 TFLOPS: 95.02
SDNQ INT8 Dynamic INT8 TFLOPS: 94.82
SDNQ INT8 Dynamic FP8 TFLOPS: 0
===========================================
SDNQ FP8 Dynamic Float TFLOPS: 0
SDNQ FP8 Dynamic UINT8 TFLOPS: 0
SDNQ FP8 Dynamic INT8 TFLOPS: 0
SDNQ FP8 Dynamic FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic Float TFLOPS: 0
SDNQ FP8 TW Dynamic UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic FP8 TFLOPS: 0
===========================================
SDNQ INT8 CKPT TFLOPS: 92.48
SDNQ FP8 CKPT TFLOPS: 0
SDNQ FP8 TW CKPT TFLOPS: 0
===========================================
SDNQ INT8 Dynamic CKPT Float TFLOPS: 90.21
SDNQ INT8 Dynamic CKPT UINT8 TFLOPS: 91.58
SDNQ INT8 Dynamic CKPT INT8 TFLOPS: 92.37
SDNQ INT8 Dynamic CKPT FP8 TFLOPS: 0
===========================================
SDNQ FP8 Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 Dynamic CKPT FP8 TFLOPS: 0
===========================================
SDNQ FP8 TW Dynamic CKPT Float TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT UINT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT INT8 TFLOPS: 0
SDNQ FP8 TW Dynamic CKPT FP8 TFLOPS: 0
===========================================

0 replies

dxqb · 2025-10-03T04:27:54Z

dxqb
Oct 3, 2025

Sampling quality tests

bf16:

fp8 W8 tensorwise A8 channelwise:

int8 W8 tensorwise A8 channelwise:

Sampling speeds:

0 replies

dxqb · 2025-10-05T08:13:21Z

dxqb
Oct 5, 2025

final training benchmarks. turned out better than I expected myself:

0 replies

sdbds · 2025-10-14T09:09:49Z

sdbds
Oct 14, 2025

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/fp8_primer.ipynb
NVIDIA has released the NVFP4 and MXFP8 scaling methods based on TransformerEngine, perhaps we can take a look.

1 reply

dxqb Oct 14, 2025

https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/fp8_primer.ipynb NVIDIA has released the NVFP4 and MXFP8 scaling methods based on TransformerEngine, perhaps we can take a look.

MXFP8 is hardware support for what @kohya-ss has called "block-wise scaling" in his study above. On hardware before Blackwell, this cannot be done efficiently. This is why our performance-related tests don't do block-wise.

NVFP4 or any 4-bit quant is currently not worth pursueing I think, unless you change the model's architecture specifically for it, as nunchaku does:

Why no int4 / fp4?
Nunchaku (https://github.com/nunchaku-tech/nunchaku) have shown that 4-bit quants are possible at no noticable quality loss. Even though their matrix math is faster than 8-bit quants, my benchmarks show that there is no performance gain overall for training just by employing 4-bit matrix math. The 8-bit linear math is already fast enough - most of the remaining time in a training step is spent on bf16 attention and other layers.
I believe the additional improved performance achieved by Nunchaku is due to their manual kernel fusion, not due to 4bit matmuls - but happy to be proven wrong.

kohya-ss · 2025-10-14T13:16:28Z

kohya-ss
Oct 14, 2025
Maintainer Author

Thanks for sharing this information.
As I understand it, MXFP8 also performs calculations in FP8. Regarding the calculation part, I didn't really understand the difference between scaled_mm and MXFP8.

I haven't checked it yet, but Huawei's SINQ might also be interesting: https://github.com/huawei-csl/SINQ/

1 reply

dxqb Oct 14, 2025

I didn't really understand the difference between scaled_mm and MXFP8.

scaled_mm can only do tensor-wise, or channel-/token-wise scaling, not block-wise.

block-wise scaling is coming to scaled_mm PR here, pytorch/pytorch#164141 , but I expect it to be slow on non-Blackwell.

Uh oh!

[Research] FP8 Quantization Impact Study for DiT Models - Ongoing Results #564

Uh oh!

Uh oh!

kohya-ss Sep 15, 2025 Maintainer

Replies: 32 comments · 20 replies

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

FurkanGozukara Sep 15, 2025 Sponsor

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

Badnerle Sep 15, 2025

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

FurkanGozukara Sep 15, 2025 Sponsor

Uh oh!

sdbds Sep 15, 2025

Uh oh!

kohya-ss Sep 15, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 16, 2025 Maintainer Author

Uh oh!

Disty0 Sep 16, 2025

Uh oh!

kohya-ss Sep 16, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 16, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 16, 2025 Maintainer Author

Uh oh!

Disty0 Sep 16, 2025

Uh oh!

kohya-ss Sep 16, 2025 Maintainer Author

Uh oh!

sdbds Sep 17, 2025

Uh oh!

Uh oh!

Disty0 Sep 17, 2025

Uh oh!

Uh oh!

kohya-ss Sep 18, 2025 Maintainer Author

Uh oh!

Uh oh!

kohya-ss Sep 18, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 18, 2025 Maintainer Author

Uh oh!

kohya-ss Sep 20, 2025 Maintainer Author

Uh oh!

sdbds Sep 20, 2025

Uh oh!

Uh oh!

kohya-ss Sep 20, 2025 Maintainer Author

Uh oh!

kohya-ss
Sep 15, 2025
Maintainer

Replies: 32 comments 20 replies

kohya-ss
Sep 15, 2025
Maintainer Author

kohya-ss
Sep 15, 2025
Maintainer Author

kohya-ss
Sep 15, 2025
Maintainer Author

kohya-ss
Sep 15, 2025
Maintainer Author

kohya-ss
Sep 15, 2025
Maintainer Author

kohya-ss
Sep 15, 2025
Maintainer Author

FurkanGozukara
Sep 15, 2025
Sponsor

kohya-ss
Sep 15, 2025
Maintainer Author

Badnerle
Sep 15, 2025

kohya-ss
Sep 15, 2025
Maintainer Author

FurkanGozukara
Sep 15, 2025
Sponsor

sdbds
Sep 15, 2025

kohya-ss Sep 15, 2025
Maintainer Author

kohya-ss Sep 16, 2025
Maintainer Author

kohya-ss Sep 16, 2025
Maintainer Author

kohya-ss
Sep 16, 2025
Maintainer Author

kohya-ss
Sep 16, 2025
Maintainer Author

Disty0
Sep 16, 2025

kohya-ss Sep 16, 2025
Maintainer Author

kohya-ss
Sep 18, 2025
Maintainer Author

kohya-ss
Sep 18, 2025
Maintainer Author

kohya-ss
Sep 18, 2025
Maintainer Author

kohya-ss
Sep 20, 2025
Maintainer Author

sdbds
Sep 20, 2025

kohya-ss Sep 20, 2025
Maintainer Author