THUDM
diff --git a/‎.doctrees/_examples_synced/low_precision/README.doctree‎
14.6 KB b/‎.doctrees/_examples_synced/low_precision/README.doctree‎
14.6 KB
diff --git a/‎.doctrees/environment.pickle‎
4.07 KB b/‎.doctrees/environment.pickle‎
4.07 KB
diff --git a/‎_examples_synced/low_precision/README.html‎
Lines changed: 99 additions & 0 deletions b/‎_examples_synced/low_precision/README.html‎
Lines changed: 99 additions & 0 deletions
diff --git a/‎_sources/_examples_synced/low_precision/README.md‎
Lines changed: 82 additions & 1 deletion b/‎_sources/_examples_synced/low_precision/README.md‎
Lines changed: 82 additions & 1 deletion
diff --git a/‎objects.inv‎
262 Bytes b/‎objects.inv‎
262 Bytes
diff --git a/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion b/‎searchindex.js‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎zh/.doctrees/_examples_synced/low_precision/README.doctree‎
14.5 KB b/‎zh/.doctrees/_examples_synced/low_precision/README.doctree‎
14.5 KB
diff --git a/‎zh/.doctrees/environment.pickle‎
4.07 KB b/‎zh/.doctrees/environment.pickle‎
4.07 KB
@@ -445,6 +445,16 @@ <h2> Contents </h2>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quick-start">Quick Start</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quick-explanation">Quick Explanation</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#todo">TODO</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#int4-training-examples">INT4 Training Examples</a><ul class="visible nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Files</a></li>
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id2">Quick Start</a><ul class="nav section-nav flex-column">
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#configure-training-arguments">1. Configure Training Arguments</a></li>
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#convert-huggingface-weights-to-int4">2. Convert HuggingFace Weights to INT4</a></li>
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#start-int4-training">3. Start INT4 Training</a></li>
+</ul>
+</li>
+</ul>
+</li>
 </ul>
             </nav>
         </div>
@@ -518,6 +528,85 @@ <h2>TODO<a class="headerlink" href="#todo" title="Link to this heading">#</a></h
 <li><p>FP8 weights (<code class="docutils literal notranslate"><span class="pre">--fp8-param-gather</span></code>) can provide memory savings benefits, but currently FP8 weights must be used with TransformerEngine’s FusedAdam, which conflicts with the commonly used Adam CPU offload technique in Megatron-LM.</p></li>
 </ul>
 <p>The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.</p>
+<p>Here is a polished and professional version of your documentation.</p>
+<p>I have corrected grammatical errors, improved the flow, standardizes the terminology (e.g., capitalizing “STE”), and clarified the instructions.</p>
+</section>
+<hr class="docutils" />
+<section id="int4-training-examples">
+<h2>INT4 Training Examples<a class="headerlink" href="#int4-training-examples" title="Link to this heading">#</a></h2>
+<p>This guide provides examples for INT4 STE (Straight-Through Estimator) training and INT4 inference. Utilizing INT4 inference significantly improves throughput, thereby accelerating the training pipeline (specifically during the rollout generation phase).</p>
+<section id="id1">
+<h3>Files<a class="headerlink" href="#id1" title="Link to this heading">#</a></h3>
+<ul class="simple">
+<li><p><code class="docutils literal notranslate"><span class="pre">run-moonlight-16B-A3B-int4.sh</span></code>: Launch script for <strong>Moonlight-16B-A3B</strong> (INT4) on 4x H200 GPUs.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">run-qwen3‑30B‑A3B-int4.sh</span></code>: Launch script for <strong>Qwen3‑30B‑A3B</strong> (INT4) on 8x H200 GPUs.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">run-qwen3-235B-A22B-int4.sh</span></code>: Launch script for <strong>Qwen3-235B-A22B</strong> (INT4) on 64x H200 GPUs.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">run-kimi-k2-Thinking-int4.sh</span></code>: Launch script for <strong>Kimi-k2-Thinking</strong> (INT4) on 256x H200 GPUs.</p></li>
+</ul>
+</section>
+<section id="id2">
+<h3>Quick Start<a class="headerlink" href="#id2" title="Link to this heading">#</a></h3>
+<section id="configure-training-arguments">
+<h4>1. Configure Training Arguments<a class="headerlink" href="#configure-training-arguments" title="Link to this heading">#</a></h4>
+<p>Ensure your training script is properly configured. For training tasks, you must add the following flag to your launch arguments:</p>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>--int4-params-rollout
+</pre></div>
+</div>
+</section>
+<section id="convert-huggingface-weights-to-int4">
+<h4>2. Convert HuggingFace Weights to INT4<a class="headerlink" href="#convert-huggingface-weights-to-int4" title="Link to this heading">#</a></h4>
+<p>First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace:
+<a class="reference external" href="https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1">https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1</a></p>
+<p>Next, use the <code class="docutils literal notranslate"><span class="pre">tools/convert_hf_to_hf_int4.py</span></code> script to convert BF16 weights to INT4 format. Ensure that the <code class="docutils literal notranslate"><span class="pre">--hf-checkpoint</span></code> parameter points to a directory where <code class="docutils literal notranslate"><span class="pre">config.json</span></code> contains the correct <code class="docutils literal notranslate"><span class="pre">quantization_config</span></code>. Slime will automatically utilize INT4 quantization during weight updates.</p>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python<span class="w"> </span>tools/convert_hf_to_hf_int4.py<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--model_id<span class="w"> </span>/path/to/your/original/models<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--output_dir<span class="w"> </span>/path/to/your/save/models<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--local_data_path<span class="w"> </span>/path/to/your/wikitext
+</pre></div>
+</div>
+</section>
+<section id="start-int4-training">
+<h4>3. Start INT4 Training<a class="headerlink" href="#start-int4-training" title="Link to this heading">#</a></h4>
+<p>You need to configure the specific environment variables for quantization settings.</p>
+<p><strong>Environment Variables:</strong></p>
+<ul class="simple">
+<li><p><strong><code class="docutils literal notranslate"><span class="pre">OPEN_TRAINING_INT4_FAKE_QAT_FLAG</span></code></strong>: Enables fake quantization operations for INT4 training.</p></li>
+<li><p><strong><code class="docutils literal notranslate"><span class="pre">OPEN_TRAINING_INT4_GROUP_SIZE</span></code></strong>: Specifies the block size (group size) for model quantization.</p>
+<ul>
+<li><p>Set to <strong>128</strong> for <code class="docutils literal notranslate"><span class="pre">moonlight-16B-A3B</span></code> 、 <code class="docutils literal notranslate"><span class="pre">qwen3-30B-A3B</span></code>and <code class="docutils literal notranslate"><span class="pre">qwen3-235B-A22B-int4</span></code>.</p></li>
+<li><p>Set to <strong>32</strong> for <code class="docutils literal notranslate"><span class="pre">kimi-k2-Thinking-int4</span></code>.</p></li>
+</ul>
+</li>
+</ul>
+<p><strong>Configuration Example:</strong></p>
+<div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="err">RUNTIME_ENV_JSON=</span><span class="s2">&quot;{</span>
+<span class="s2">  \&quot;env_vars\&quot;: {</span>
+<span class="s2">    ...</span>
+<span class="s2">    \&quot;OPEN_TRAINING_INT4_FAKE_QAT_FLAG\&quot;: \&quot;1\&quot;,</span>
+<span class="s2">    \&quot;OPEN_TRAINING_INT4_GROUP_SIZE\&quot;: \&quot;128\&quot;</span>
+<span class="s2">  }</span>
+<span class="s2">}&quot;</span>
+</pre></div>
+</div>
+<p><strong>Launch Commands:</strong></p>
+<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Moonlight-16B-A3B Int4 training</span>
+bash<span class="w"> </span>examples/low_precision/run-moonlight-16B-A3B-int4.sh
+
+<span class="c1"># Qwen3‑30B‑A3B Int4 training</span>
+bash<span class="w"> </span>examples/low_precision/run-qwen3‑30B‑A3B-int4.sh
+
+<span class="c1"># Qwen3-235B-A22B Int4 training (8 nodes)</span>
+bash<span class="w"> </span>examples/low_precision/run-qwen3-235B-A22B-int4.sh
+
+<span class="c1"># Kimi-k2-Thinking Int4 training (32 nodes)</span>
+bash<span class="w"> </span>examples/low_precision/run-kimi-k2-Thinking-int4.sh
+</pre></div>
+</div>
+<ul class="simple">
+<li><p>For multi-node environments, please start the Ray service according to your cluster configuration.</p></li>
+</ul>
+</section>
+</section>
 </section>
 </section>
 
@@ -552,6 +641,16 @@ <h2>TODO<a class="headerlink" href="#todo" title="Link to this heading">#</a></h
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quick-start">Quick Start</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quick-explanation">Quick Explanation</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#todo">TODO</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#int4-training-examples">INT4 Training Examples</a><ul class="visible nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Files</a></li>
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id2">Quick Start</a><ul class="nav section-nav flex-column">
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#configure-training-arguments">1. Configure Training Arguments</a></li>
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#convert-huggingface-weights-to-int4">2. Convert HuggingFace Weights to INT4</a></li>
+<li class="toc-h4 nav-item toc-entry"><a class="reference internal nav-link" href="#start-int4-training">3. Start INT4 Training</a></li>
+</ul>
+</li>
+</ul>
+</li>
 </ul>
   </nav></div>
 
 
@@ -63,4 +63,85 @@ Currently, FP8 is far from being a complete feature and still has the following
 
 - FP8 weights (`--fp8-param-gather`) can provide memory savings benefits, but currently FP8 weights must be used with TransformerEngine's FusedAdam, which conflicts with the commonly used Adam CPU offload technique in Megatron-LM.
 
-The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.
+The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.
+
+
+Here is a polished and professional version of your documentation.
+
+I have corrected grammatical errors, improved the flow, standardizes the terminology (e.g., capitalizing "STE"), and clarified the instructions.
+
+***
+
+## INT4 Training Examples
+
+This guide provides examples for INT4 STE (Straight-Through Estimator) training and INT4 inference. Utilizing INT4 inference significantly improves throughput, thereby accelerating the training pipeline (specifically during the rollout generation phase).
+
+### Files
+
+*   `run-moonlight-16B-A3B-int4.sh`: Launch script for **Moonlight-16B-A3B** (INT4) on 4x H200 GPUs.
+*   `run-qwen3‑30B‑A3B-int4.sh`: Launch script for **Qwen3‑30B‑A3B** (INT4) on 8x H200 GPUs.
+*   `run-qwen3-235B-A22B-int4.sh`: Launch script for **Qwen3-235B-A22B** (INT4) on 64x H200 GPUs.
+*   `run-kimi-k2-Thinking-int4.sh`: Launch script for **Kimi-k2-Thinking** (INT4) on 256x H200 GPUs.
+
+### Quick Start
+
+#### 1. Configure Training Arguments
+Ensure your training script is properly configured. For training tasks, you must add the following flag to your launch arguments:
+
+```bash
+--int4-params-rollout
+```
+
+#### 2. Convert HuggingFace Weights to INT4
+First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace:
+[https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)
+
+Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. Slime will automatically utilize INT4 quantization during weight updates.
+
+```bash
+python tools/convert_hf_to_hf_int4.py \
+  --model_id /path/to/your/original/models \
+  --output_dir /path/to/your/save/models \
+  --local_data_path /path/to/your/wikitext
+```
+
+#### 3. Start INT4 Training
+
+You need to configure the specific environment variables for quantization settings.
+
+**Environment Variables:**
+
+*   **`OPEN_TRAINING_INT4_FAKE_QAT_FLAG`**: Enables fake quantization operations for INT4 training.
+*   **`OPEN_TRAINING_INT4_GROUP_SIZE`**: Specifies the block size (group size) for model quantization.
+    *   Set to **128** for `moonlight-16B-A3B` 、 `qwen3-30B-A3B`and `qwen3-235B-A22B-int4`.
+    *   Set to **32** for `kimi-k2-Thinking-int4`.
+
+**Configuration Example:**
+
+```json
+RUNTIME_ENV_JSON="{
+  \"env_vars\": {
+    ...
+    \"OPEN_TRAINING_INT4_FAKE_QAT_FLAG\": \"1\",
+    \"OPEN_TRAINING_INT4_GROUP_SIZE\": \"128\"
+  }
+}"
+```
+
+**Launch Commands:**
+
+```bash
+# Moonlight-16B-A3B Int4 training
+bash examples/low_precision/run-moonlight-16B-A3B-int4.sh
+
+# Qwen3‑30B‑A3B Int4 training
+bash examples/low_precision/run-qwen3‑30B‑A3B-int4.sh
+
+# Qwen3-235B-A22B Int4 training (8 nodes)
+bash examples/low_precision/run-qwen3-235B-A22B-int4.sh
+
+# Kimi-k2-Thinking Int4 training (32 nodes)
+bash examples/low_precision/run-kimi-k2-Thinking-int4.sh
+```
+
+- For multi-node environments, please start the Ray service according to your cluster configuration.