fix README

EvergreenTree · EvergreenTree · commit 564895482aae · 2026-05-12T08:57:18.000Z
diff --git a/README.md b/README.md
@@ -1,5 +1,9 @@
 ***Please refer to the [HTML version](https://evergreentree.github.io/speech2text/) for a better reading experience.***
 
+<p align="center">
+  <img src="asr_bench/figures/owl.png" width="140" alt="project logo" />
+</p>
+
 # Fine-Tuning Efficient Chinese Speech Models beyond the Pareto Frontier
 
 This repo asks a narrow question under a strict compute budget: **when does
@@ -19,6 +23,10 @@ Take-homes before details:
 4. **Qwen3-ASR and Granite were added as counterpoints, not just extra rows.**
    They show how much the conclusion depends on backbone quality, pre-training
    mix, and the evaluation slice.
+5. **RL (MWER/GSPO) fixes the SFT regression.** On Qwen3-ASR-0.6B, GSPO brings
+   French WER *below* baseline (6.13 % vs 6.35 %) and MWER achieves the best
+   Chinese CER at that scale (7.62 % vs 10.41 % baseline). An RL stage at half
+   an epoch recovers what SFT lost — and then some.
 
 This repo bundles four tracks under one roof: the original Whisper
 fine-tuning work, the `asr_bench` baseline benchmark, the Qwen3-ASR pilot, and
diff --git a/asr_bench/figures/owl.png b/asr_bench/figures/owl.png
diff --git a/index.html b/index.html
@@ -145,6 +145,10 @@
     .delta-pos { color: #c0392b; font-weight: 600; }
     .delta-neg { color: #1a7a4a; font-weight: 600; }
 
+    /* RL results table: best cell = green bg, worst = red bg */
+    td.good { background: #d1fae5; color: #065f46; font-weight: 600; }
+    td.bad  { background: #fee2e2; color: #991b1b; }
+
     /* ── Callout / take-home boxes ────────────────────────────── */
     .callout {
       display: flex;
@@ -244,7 +248,15 @@
 
 <!-- ── Side nav ──────────────────────────────────────────────── -->
 <nav>
+  <div style="text-align:center; padding: 0 1.4rem 1rem;">
+    <img src="asr_bench/figures/owl.png" alt="logo" style="width:88px; border-radius:8px; opacity:0.92;" />
+  </div>
   <div class="nav-title">Contents</div>
+  <div style="padding: 0 1.4rem 0.8rem; font-size:0.78rem;">
+    <a href="https://github.com/EvergreenTree/speech2text" style="color:#6c8cff; text-decoration:none;">
+      &#128279; EvergreenTree/speech2text
+    </a>
+  </div>
   <ul>
     <li><a href="#overview">Overview</a></li>
     <li><a href="#tldr">TL;DR</a></li>
@@ -296,6 +308,7 @@ <h3>Take-homes</h3>
       <li><strong>Model size and data overlap matter more than adapter choice alone.</strong> Tiny still has headroom on French; small/medium/turbo mostly do not.</li>
       <li><strong>The right first step under a small budget is a baseline sweep, not blind SFT.</strong> Gap-to-ceiling is the main diagnostic signal in this repo.</li>
       <li><strong>Qwen3-ASR and Granite were added as counterpoints.</strong> They show how much the conclusion depends on backbone quality, pre-training mix, and the evaluation slice.</li>
+      <li><strong>RL (MWER/GSPO) fixes the SFT regression.</strong> On Qwen3-ASR-0.6B, GSPO brings French WER <em>below</em> baseline (6.13 % vs 6.35 %) and MWER achieves the best Chinese CER at that scale (7.62 % vs 10.41 % baseline). An RL stage at half an epoch recovers what SFT lost — and then some.</li>
     </ol>
 
     <p>This repo bundles four tracks under one roof: the original Whisper fine-tuning work, the <code>asr_bench</code> baseline benchmark, the Qwen3-ASR pilot, and the Granite Speech pilot. The earlier zh-CN Whisper run lives intact under <code>archive_zh/</code>; the present fr-FR Whisper run is in <code>outputs/</code>. Tiny was added later to control for the <em>gap-to-ceiling</em> effect discussed in §3.3.</p>
@@ -788,7 +801,7 @@ <h3>Fit / runtime findings on 1× NVIDIA L4 (24 GB)</h3>
 
   <!-- RL fine-tuning -->
   <section id="qwen-rl">
-    <h3>RL fine-tuning: MWER &amp; GSPO</h3>
+    <h2>RL fine-tuning: MWER &amp; GSPO</h2>
     <p>Two reinforcement-learning algorithms were added on top of the SFT checkpoint to push WER/CER further without extra labelled data.</p>
 
     <h4>MWER (Minimum Word Error Rate)</h4>
@@ -818,6 +831,7 @@ <h4>GSPO (Group Sequence Policy Optimisation)</h4>
     <code>&lt;lang&gt;…&lt;/lang&gt;</code> wrapper.</p>
 
     <h4>Hyperparameters</h4>
+    <div class="table-wrap">
     <table>
       <thead><tr><th></th><th>MWER</th><th>GSPO</th></tr></thead>
       <tbody>
@@ -835,9 +849,11 @@ <h4>Hyperparameters</h4>
         <tr><td>Clip range</td><td>—</td><td>ε_lo=3e-4, ε_hi=4e-4</td></tr>
       </tbody>
     </table>
+    </div>
 
     <h4>Results (first 100 dev examples)</h4>
     <p>WER for French, CER for Chinese. ↓ is better. Best result per row is <strong>bold</strong>.</p>
+    <div class="table-wrap">
     <table>
       <thead>
         <tr><th>Model</th><th>Language</th><th>Metric</th><th>Baseline</th><th>SFT full-FT</th><th>MWER (RL)</th><th>GSPO (RL)</th></tr>
@@ -869,6 +885,7 @@ <h4>Results (first 100 dev examples)</h4>
         </tr>
       </tbody>
     </table>
+    </div>
     <p>The 0.6B RL rows are completed 0.5-epoch runs from 2026-05-11. Dashes mean the 1.7B RL variants have not been run yet.</p>
 
     <h4>RL implementation note</h4>
@@ -958,6 +975,11 @@ <h2>9. Limitations and why the French plot is last</h2>
     </div>
   </section>
 
+  <footer style="margin-top:4rem; padding-top:1.5rem; border-top:1px solid #e5e7ef; font-size:0.78rem; color:#9ba3b8;">
+    Last updated 2026-05-12 &mdash;
+    <a href="https://github.com/EvergreenTree/speech2text" style="color:#3b6cff;">&#128279; GitHub — EvergreenTree/speech2text</a>
+  </footer>
+
 </main>
 </div>