update readmes for new models

pstjohn · pstjohn · commit 5909160049fe · 2026-03-10T11:46:32.000-06:00
Signed-off-by: Peter St. John &lt;pstjohn@nvidia.com&gt;
diff --git a/bionemo-recipes/models/mixtral/README.md b/bionemo-recipes/models/mixtral/README.md
@@ -52,6 +52,103 @@ with torch.no_grad():
 print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
 ```
 
+## Running with Low Precision (FP8/FP4)
+
+The TE-optimized Mixtral model supports per-layer quantization via two mechanisms: a **config-level**
+`layer_precision` list that declares which layers use which precision, and **constructor-level** recipe
+objects (`fp8_recipe`, `fp4_recipe`) that control the quantization behaviour.
+
+### Configuration: `layer_precision`
+
+`NVMixtralConfig.layer_precision` is a list of length `num_hidden_layers` where each element is `"fp8"`,
+`"fp4"`, or `None` (BF16 fallback). When set, it controls the `te.autocast` context used for each
+transformer layer during both initialization and forward pass.
+
+```python
+from modeling_mixtral_te import NVMixtralConfig, NVMixtralForCausalLM
+
+# All layers in FP8
+config = NVMixtralConfig(
+    layer_precision=["fp8"] * 32,
+    num_hidden_layers=32,
+)
+```
+
+If you pass an `fp8_recipe` to the model constructor **without** setting `layer_precision`, it
+defaults to `["fp8"] * num_hidden_layers` (all layers FP8). You can also mix precisions, for example
+running most layers in FP8 but keeping the first and last layers in BF16:
+
+```python
+layer_precision = [None] + ["fp8"] * 30 + [None]
+config = NVMixtralConfig(
+    layer_precision=layer_precision,
+    num_hidden_layers=32,
+)
+```
+
+### Constructor arguments: `fp8_recipe` and `fp4_recipe`
+
+The model classes (`NVMixtralModel`, `NVMixtralForCausalLM`) accept `fp8_recipe` and `fp4_recipe`
+keyword arguments. These are `transformer_engine.common.recipe.Recipe` objects that configure the
+quantization algorithm (e.g., delayed scaling, block scaling, MXFP8).
+
+```python
+import transformer_engine.common.recipe as te_recipe
+
+from modeling_mixtral_te import NVMixtralConfig, NVMixtralForCausalLM
+
+fp8_recipe = te_recipe.DelayedScaling()
+
+config = NVMixtralConfig(
+    layer_precision=["fp8"] * 32,
+    num_hidden_layers=32,
+)
+model = NVMixtralForCausalLM(config, fp8_recipe=fp8_recipe)
+```
+
+For FP4 (NVFP4) quantization, pass an `fp4_recipe` instead and set the corresponding layers to
+`"fp4"` in `layer_precision`:
+
+```python
+fp4_recipe = te_recipe.NVFP4BlockScaling()
+
+config = NVMixtralConfig(
+    layer_precision=["fp4"] * 32,
+    num_hidden_layers=32,
+)
+model = NVMixtralForCausalLM(config, fp4_recipe=fp4_recipe)
+```
+
+You can also mix FP8 and FP4 layers by providing both recipes and a mixed `layer_precision` list.
+
+### Quantized model initialization: `use_quantized_model_init`
+
+When `use_quantized_model_init=True` is set in the config, layers are created inside a
+`te.quantized_model_init` context. This tells TransformerEngine to initialize weights directly in
+the target quantized format, avoiding a separate quantization step after initialization. This is
+primarily useful when loading pre-quantized checkpoints.
+
+```python
+config = NVMixtralConfig(
+    layer_precision=["fp4"] * 32,
+    num_hidden_layers=32,
+    use_quantized_model_init=True,
+)
+model = NVMixtralForCausalLM(config, fp4_recipe=te_recipe.NVFP4BlockScaling())
+```
+
+### Notes
+
+- The `lm_head` always runs in higher precision (`te.autocast(enabled=False)`) regardless of
+  `layer_precision`, to avoid numerical instability in the output logits.
+- The MoE router gate (`model.layers.*.mlp.gate`) always runs in BF16 regardless of
+  `layer_precision`, to maintain stable routing decisions.
+- FP8 requires compute capability 9.0+ (Hopper). MXFP8 requires compute capability 10.0+
+  (Blackwell).
+- If an `fp8_recipe` is provided without `layer_precision`, all layers default to FP8. Providing
+  both `fp8_recipe` and `fp4_recipe` without `layer_precision` raises a `RuntimeError`.
+- An FP4 layer **requires** an `fp4_recipe`; omitting it raises a `RuntimeError`.
+
 ## Converting Between Model Formats
 
 This section explains how to convert between Hugging Face Transformers and Transformer Engine (TE) Mixtral model
diff --git a/bionemo-recipes/models/qwen/README.md b/bionemo-recipes/models/qwen/README.md
@@ -81,6 +81,105 @@ with torch.no_grad():
 print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
 ```
 
+## Running with Low Precision (FP8/FP4)
+
+The TE-optimized Qwen models support per-layer quantization via two mechanisms: a **config-level**
+`layer_precision` list that declares which layers use which precision, and **constructor-level** recipe
+objects (`fp8_recipe`, `fp4_recipe`) that control the quantization behaviour.
+
+### Configuration: `layer_precision`
+
+`NVQwen2Config.layer_precision` (and `NVQwen3Config.layer_precision`) is a list of length
+`num_hidden_layers` where each element is `"fp8"`, `"fp4"`, or `None` (BF16 fallback). When set, it
+controls the `te.autocast` context used for each transformer layer during both initialization and
+forward pass.
+
+```python
+from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
+
+# All layers in FP8
+config = NVQwen3Config.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    layer_precision=["fp8"] * 28,
+)
+```
+
+If you pass an `fp8_recipe` to the model constructor **without** setting `layer_precision`, it
+defaults to `["fp8"] * num_hidden_layers` (all layers FP8). You can also mix precisions, for example
+running most layers in FP8 but keeping the first and last layers in BF16:
+
+```python
+layer_precision = [None] + ["fp8"] * 26 + [None]
+config = NVQwen3Config.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    layer_precision=layer_precision,
+)
+```
+
+### Constructor arguments: `fp8_recipe` and `fp4_recipe`
+
+The model classes (`NVQwen2Model`, `NVQwen2ForCausalLM`, `NVQwen3Model`, `NVQwen3ForCausalLM`)
+accept `fp8_recipe` and `fp4_recipe` keyword arguments. These are
+`transformer_engine.common.recipe.Recipe` objects that configure the quantization algorithm (e.g.,
+delayed scaling, block scaling, MXFP8).
+
+```python
+import transformer_engine.common.recipe as te_recipe
+
+from modeling_qwen3_te import NVQwen3Config, NVQwen3ForCausalLM
+
+fp8_recipe = te_recipe.DelayedScaling()
+
+config = NVQwen3Config.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    layer_precision=["fp8"] * 28,
+)
+model = NVQwen3ForCausalLM(config, fp8_recipe=fp8_recipe)
+```
+
+For FP4 (NVFP4) quantization, pass an `fp4_recipe` instead and set the corresponding layers to
+`"fp4"` in `layer_precision`:
+
+```python
+fp4_recipe = te_recipe.NVFP4BlockScaling()
+
+config = NVQwen3Config.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    layer_precision=["fp4"] * 28,
+)
+model = NVQwen3ForCausalLM(config, fp4_recipe=fp4_recipe)
+```
+
+You can also mix FP8 and FP4 layers by providing both recipes and a mixed `layer_precision` list.
+
+The same pattern applies to Qwen2.5 models using `NVQwen2Config` and `NVQwen2ForCausalLM`.
+
+### Quantized model initialization: `use_quantized_model_init`
+
+When `use_quantized_model_init=True` is set in the config, layers are created inside a
+`te.quantized_model_init` context. This tells TransformerEngine to initialize weights directly in
+the target quantized format, avoiding a separate quantization step after initialization. This is
+primarily useful when loading pre-quantized checkpoints.
+
+```python
+config = NVQwen3Config.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    layer_precision=["fp4"] * 28,
+    use_quantized_model_init=True,
+)
+model = NVQwen3ForCausalLM(config, fp4_recipe=te_recipe.NVFP4BlockScaling())
+```
+
+### Notes
+
+- The `lm_head` always runs in higher precision (`te.autocast(enabled=False)`) regardless of
+  `layer_precision`, to avoid numerical instability in the output logits.
+- FP8 requires compute capability 9.0+ (Hopper). MXFP8 requires compute capability 10.0+
+  (Blackwell).
+- If an `fp8_recipe` is provided without `layer_precision`, all layers default to FP8. Providing
+  both `fp8_recipe` and `fp4_recipe` without `layer_precision` raises a `RuntimeError`.
+- An FP4 layer **requires** an `fp4_recipe`; omitting it raises a `RuntimeError`.
+
 ## Converting Between Model Formats
 
 This section explains how to convert between Hugging Face Transformers and Transformer Engine (TE) Qwen model formats.