@@ -81,6 +81,105 @@ with torch.no_grad():
8181print (tokenizer.decode(output_ids[0 ], skip_special_tokens = True ))
8282```
8383
84+ ## Running with Low Precision (FP8/FP4)
85+
86+ The TE-optimized Qwen models support per-layer quantization via two mechanisms: a ** config-level**
87+ ` layer_precision ` list that declares which layers use which precision, and ** constructor-level** recipe
88+ objects (` fp8_recipe ` , ` fp4_recipe ` ) that control the quantization behaviour.
89+
90+ ### Configuration: ` layer_precision `
91+
92+ ` NVQwen2Config.layer_precision ` (and ` NVQwen3Config.layer_precision ` ) is a list of length
93+ ` num_hidden_layers ` where each element is ` "fp8" ` , ` "fp4" ` , or ` None ` (BF16 fallback). When set, it
94+ controls the ` te.autocast ` context used for each transformer layer during both initialization and
95+ forward pass.
96+
97+ ``` python
98+ from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
99+
100+ # All layers in FP8
101+ config = NVQwen3Config.from_pretrained(
102+ " Qwen/Qwen3-0.6B" ,
103+ layer_precision = [" fp8" ] * 28 ,
104+ )
105+ ```
106+
107+ If you pass an ` fp8_recipe ` to the model constructor ** without** setting ` layer_precision ` , it
108+ defaults to ` ["fp8"] * num_hidden_layers ` (all layers FP8). You can also mix precisions, for example
109+ running most layers in FP8 but keeping the first and last layers in BF16:
110+
111+ ``` python
112+ layer_precision = [None ] + [" fp8" ] * 26 + [None ]
113+ config = NVQwen3Config.from_pretrained(
114+ " Qwen/Qwen3-0.6B" ,
115+ layer_precision = layer_precision,
116+ )
117+ ```
118+
119+ ### Constructor arguments: ` fp8_recipe ` and ` fp4_recipe `
120+
121+ The model classes (` NVQwen2Model ` , ` NVQwen2ForCausalLM ` , ` NVQwen3Model ` , ` NVQwen3ForCausalLM ` )
122+ accept ` fp8_recipe ` and ` fp4_recipe ` keyword arguments. These are
123+ ` transformer_engine.common.recipe.Recipe ` objects that configure the quantization algorithm (e.g.,
124+ delayed scaling, block scaling, MXFP8).
125+
126+ ``` python
127+ import transformer_engine.common.recipe as te_recipe
128+
129+ from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
130+
131+ fp8_recipe = te_recipe.DelayedScaling()
132+
133+ config = NVQwen3Config.from_pretrained(
134+ " Qwen/Qwen3-0.6B" ,
135+ layer_precision = [" fp8" ] * 28 ,
136+ )
137+ model = NVQwen3ForCausalLM(config, fp8_recipe = fp8_recipe)
138+ ```
139+
140+ For FP4 (NVFP4) quantization, pass an ` fp4_recipe ` instead and set the corresponding layers to
141+ ` "fp4" ` in ` layer_precision ` :
142+
143+ ``` python
144+ fp4_recipe = te_recipe.NVFP4BlockScaling()
145+
146+ config = NVQwen3Config.from_pretrained(
147+ " Qwen/Qwen3-0.6B" ,
148+ layer_precision = [" fp4" ] * 28 ,
149+ )
150+ model = NVQwen3ForCausalLM(config, fp4_recipe = fp4_recipe)
151+ ```
152+
153+ You can also mix FP8 and FP4 layers by providing both recipes and a mixed ` layer_precision ` list.
154+
155+ The same pattern applies to Qwen2.5 models using ` NVQwen2Config ` and ` NVQwen2ForCausalLM ` .
156+
157+ ### Quantized model initialization: ` use_quantized_model_init `
158+
159+ When ` use_quantized_model_init=True ` is set in the config, layers are created inside a
160+ ` te.quantized_model_init ` context. This tells TransformerEngine to initialize weights directly in
161+ the target quantized format, avoiding a separate quantization step after initialization. This is
162+ primarily useful when loading pre-quantized checkpoints.
163+
164+ ``` python
165+ config = NVQwen3Config.from_pretrained(
166+ " Qwen/Qwen3-0.6B" ,
167+ layer_precision = [" fp4" ] * 28 ,
168+ use_quantized_model_init = True ,
169+ )
170+ model = NVQwen3ForCausalLM(config, fp4_recipe = te_recipe.NVFP4BlockScaling())
171+ ```
172+
173+ ### Notes
174+
175+ - The ` lm_head ` always runs in higher precision (` te.autocast(enabled=False) ` ) regardless of
176+ ` layer_precision ` , to avoid numerical instability in the output logits.
177+ - FP8 requires compute capability 9.0+ (Hopper). MXFP8 requires compute capability 10.0+
178+ (Blackwell).
179+ - If an ` fp8_recipe ` is provided without ` layer_precision ` , all layers default to FP8. Providing
180+ both ` fp8_recipe ` and ` fp4_recipe ` without ` layer_precision ` raises a ` RuntimeError ` .
181+ - An FP4 layer ** requires** an ` fp4_recipe ` ; omitting it raises a ` RuntimeError ` .
182+
84183## Converting Between Model Formats
85184
86185This section explains how to convert between Hugging Face Transformers and Transformer Engine (TE) Qwen model formats.
0 commit comments