Skip to content

Commit 5909160

Browse files
committed
update readmes for new models
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 4986482 commit 5909160

File tree

2 files changed

+196
-0
lines changed

2 files changed

+196
-0
lines changed

bionemo-recipes/models/mixtral/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,103 @@ with torch.no_grad():
5252
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
5353
```
5454

55+
## Running with Low Precision (FP8/FP4)
56+
57+
The TE-optimized Mixtral model supports per-layer quantization via two mechanisms: a **config-level**
58+
`layer_precision` list that declares which layers use which precision, and **constructor-level** recipe
59+
objects (`fp8_recipe`, `fp4_recipe`) that control the quantization behaviour.
60+
61+
### Configuration: `layer_precision`
62+
63+
`NVMixtralConfig.layer_precision` is a list of length `num_hidden_layers` where each element is `"fp8"`,
64+
`"fp4"`, or `None` (BF16 fallback). When set, it controls the `te.autocast` context used for each
65+
transformer layer during both initialization and forward pass.
66+
67+
```python
68+
from modeling_mixtral_te import NVMixtralConfig, NVMixtralForCausalLM
69+
70+
# All layers in FP8
71+
config = NVMixtralConfig(
72+
layer_precision=["fp8"] * 32,
73+
num_hidden_layers=32,
74+
)
75+
```
76+
77+
If you pass an `fp8_recipe` to the model constructor **without** setting `layer_precision`, it
78+
defaults to `["fp8"] * num_hidden_layers` (all layers FP8). You can also mix precisions, for example
79+
running most layers in FP8 but keeping the first and last layers in BF16:
80+
81+
```python
82+
layer_precision = [None] + ["fp8"] * 30 + [None]
83+
config = NVMixtralConfig(
84+
layer_precision=layer_precision,
85+
num_hidden_layers=32,
86+
)
87+
```
88+
89+
### Constructor arguments: `fp8_recipe` and `fp4_recipe`
90+
91+
The model classes (`NVMixtralModel`, `NVMixtralForCausalLM`) accept `fp8_recipe` and `fp4_recipe`
92+
keyword arguments. These are `transformer_engine.common.recipe.Recipe` objects that configure the
93+
quantization algorithm (e.g., delayed scaling, block scaling, MXFP8).
94+
95+
```python
96+
import transformer_engine.common.recipe as te_recipe
97+
98+
from modeling_mixtral_te import NVMixtralConfig, NVMixtralForCausalLM
99+
100+
fp8_recipe = te_recipe.DelayedScaling()
101+
102+
config = NVMixtralConfig(
103+
layer_precision=["fp8"] * 32,
104+
num_hidden_layers=32,
105+
)
106+
model = NVMixtralForCausalLM(config, fp8_recipe=fp8_recipe)
107+
```
108+
109+
For FP4 (NVFP4) quantization, pass an `fp4_recipe` instead and set the corresponding layers to
110+
`"fp4"` in `layer_precision`:
111+
112+
```python
113+
fp4_recipe = te_recipe.NVFP4BlockScaling()
114+
115+
config = NVMixtralConfig(
116+
layer_precision=["fp4"] * 32,
117+
num_hidden_layers=32,
118+
)
119+
model = NVMixtralForCausalLM(config, fp4_recipe=fp4_recipe)
120+
```
121+
122+
You can also mix FP8 and FP4 layers by providing both recipes and a mixed `layer_precision` list.
123+
124+
### Quantized model initialization: `use_quantized_model_init`
125+
126+
When `use_quantized_model_init=True` is set in the config, layers are created inside a
127+
`te.quantized_model_init` context. This tells TransformerEngine to initialize weights directly in
128+
the target quantized format, avoiding a separate quantization step after initialization. This is
129+
primarily useful when loading pre-quantized checkpoints.
130+
131+
```python
132+
config = NVMixtralConfig(
133+
layer_precision=["fp4"] * 32,
134+
num_hidden_layers=32,
135+
use_quantized_model_init=True,
136+
)
137+
model = NVMixtralForCausalLM(config, fp4_recipe=te_recipe.NVFP4BlockScaling())
138+
```
139+
140+
### Notes
141+
142+
- The `lm_head` always runs in higher precision (`te.autocast(enabled=False)`) regardless of
143+
`layer_precision`, to avoid numerical instability in the output logits.
144+
- The MoE router gate (`model.layers.*.mlp.gate`) always runs in BF16 regardless of
145+
`layer_precision`, to maintain stable routing decisions.
146+
- FP8 requires compute capability 9.0+ (Hopper). MXFP8 requires compute capability 10.0+
147+
(Blackwell).
148+
- If an `fp8_recipe` is provided without `layer_precision`, all layers default to FP8. Providing
149+
both `fp8_recipe` and `fp4_recipe` without `layer_precision` raises a `RuntimeError`.
150+
- An FP4 layer **requires** an `fp4_recipe`; omitting it raises a `RuntimeError`.
151+
55152
## Converting Between Model Formats
56153

57154
This section explains how to convert between Hugging Face Transformers and Transformer Engine (TE) Mixtral model

bionemo-recipes/models/qwen/README.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,105 @@ with torch.no_grad():
8181
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
8282
```
8383

84+
## Running with Low Precision (FP8/FP4)
85+
86+
The TE-optimized Qwen models support per-layer quantization via two mechanisms: a **config-level**
87+
`layer_precision` list that declares which layers use which precision, and **constructor-level** recipe
88+
objects (`fp8_recipe`, `fp4_recipe`) that control the quantization behaviour.
89+
90+
### Configuration: `layer_precision`
91+
92+
`NVQwen2Config.layer_precision` (and `NVQwen3Config.layer_precision`) is a list of length
93+
`num_hidden_layers` where each element is `"fp8"`, `"fp4"`, or `None` (BF16 fallback). When set, it
94+
controls the `te.autocast` context used for each transformer layer during both initialization and
95+
forward pass.
96+
97+
```python
98+
from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
99+
100+
# All layers in FP8
101+
config = NVQwen3Config.from_pretrained(
102+
"Qwen/Qwen3-0.6B",
103+
layer_precision=["fp8"] * 28,
104+
)
105+
```
106+
107+
If you pass an `fp8_recipe` to the model constructor **without** setting `layer_precision`, it
108+
defaults to `["fp8"] * num_hidden_layers` (all layers FP8). You can also mix precisions, for example
109+
running most layers in FP8 but keeping the first and last layers in BF16:
110+
111+
```python
112+
layer_precision = [None] + ["fp8"] * 26 + [None]
113+
config = NVQwen3Config.from_pretrained(
114+
"Qwen/Qwen3-0.6B",
115+
layer_precision=layer_precision,
116+
)
117+
```
118+
119+
### Constructor arguments: `fp8_recipe` and `fp4_recipe`
120+
121+
The model classes (`NVQwen2Model`, `NVQwen2ForCausalLM`, `NVQwen3Model`, `NVQwen3ForCausalLM`)
122+
accept `fp8_recipe` and `fp4_recipe` keyword arguments. These are
123+
`transformer_engine.common.recipe.Recipe` objects that configure the quantization algorithm (e.g.,
124+
delayed scaling, block scaling, MXFP8).
125+
126+
```python
127+
import transformer_engine.common.recipe as te_recipe
128+
129+
from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
130+
131+
fp8_recipe = te_recipe.DelayedScaling()
132+
133+
config = NVQwen3Config.from_pretrained(
134+
"Qwen/Qwen3-0.6B",
135+
layer_precision=["fp8"] * 28,
136+
)
137+
model = NVQwen3ForCausalLM(config, fp8_recipe=fp8_recipe)
138+
```
139+
140+
For FP4 (NVFP4) quantization, pass an `fp4_recipe` instead and set the corresponding layers to
141+
`"fp4"` in `layer_precision`:
142+
143+
```python
144+
fp4_recipe = te_recipe.NVFP4BlockScaling()
145+
146+
config = NVQwen3Config.from_pretrained(
147+
"Qwen/Qwen3-0.6B",
148+
layer_precision=["fp4"] * 28,
149+
)
150+
model = NVQwen3ForCausalLM(config, fp4_recipe=fp4_recipe)
151+
```
152+
153+
You can also mix FP8 and FP4 layers by providing both recipes and a mixed `layer_precision` list.
154+
155+
The same pattern applies to Qwen2.5 models using `NVQwen2Config` and `NVQwen2ForCausalLM`.
156+
157+
### Quantized model initialization: `use_quantized_model_init`
158+
159+
When `use_quantized_model_init=True` is set in the config, layers are created inside a
160+
`te.quantized_model_init` context. This tells TransformerEngine to initialize weights directly in
161+
the target quantized format, avoiding a separate quantization step after initialization. This is
162+
primarily useful when loading pre-quantized checkpoints.
163+
164+
```python
165+
config = NVQwen3Config.from_pretrained(
166+
"Qwen/Qwen3-0.6B",
167+
layer_precision=["fp4"] * 28,
168+
use_quantized_model_init=True,
169+
)
170+
model = NVQwen3ForCausalLM(config, fp4_recipe=te_recipe.NVFP4BlockScaling())
171+
```
172+
173+
### Notes
174+
175+
- The `lm_head` always runs in higher precision (`te.autocast(enabled=False)`) regardless of
176+
`layer_precision`, to avoid numerical instability in the output logits.
177+
- FP8 requires compute capability 9.0+ (Hopper). MXFP8 requires compute capability 10.0+
178+
(Blackwell).
179+
- If an `fp8_recipe` is provided without `layer_precision`, all layers default to FP8. Providing
180+
both `fp8_recipe` and `fp4_recipe` without `layer_precision` raises a `RuntimeError`.
181+
- An FP4 layer **requires** an `fp4_recipe`; omitting it raises a `RuntimeError`.
182+
84183
## Converting Between Model Formats
85184

86185
This section explains how to convert between Hugging Face Transformers and Transformer Engine (TE) Qwen model formats.

0 commit comments

Comments
 (0)