Skip to content

Commit 5648954

Browse files
committed
fix README
1 parent c8305fc commit 5648954

3 files changed

Lines changed: 31 additions & 1 deletion

File tree

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
***Please refer to the [HTML version](https://evergreentree.github.io/speech2text/) for a better reading experience.***
22

3+
<p align="center">
4+
<img src="asr_bench/figures/owl.png" width="140" alt="project logo" />
5+
</p>
6+
37
# Fine-Tuning Efficient Chinese Speech Models beyond the Pareto Frontier
48

59
This repo asks a narrow question under a strict compute budget: **when does
@@ -19,6 +23,10 @@ Take-homes before details:
1923
4. **Qwen3-ASR and Granite were added as counterpoints, not just extra rows.**
2024
They show how much the conclusion depends on backbone quality, pre-training
2125
mix, and the evaluation slice.
26+
5. **RL (MWER/GSPO) fixes the SFT regression.** On Qwen3-ASR-0.6B, GSPO brings
27+
French WER *below* baseline (6.13 % vs 6.35 %) and MWER achieves the best
28+
Chinese CER at that scale (7.62 % vs 10.41 % baseline). An RL stage at half
29+
an epoch recovers what SFT lost — and then some.
2230

2331
This repo bundles four tracks under one roof: the original Whisper
2432
fine-tuning work, the `asr_bench` baseline benchmark, the Qwen3-ASR pilot, and

asr_bench/figures/owl.png

1 MB
Loading

index.html

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,10 @@
145145
.delta-pos { color: #c0392b; font-weight: 600; }
146146
.delta-neg { color: #1a7a4a; font-weight: 600; }
147147

148+
/* RL results table: best cell = green bg, worst = red bg */
149+
td.good { background: #d1fae5; color: #065f46; font-weight: 600; }
150+
td.bad { background: #fee2e2; color: #991b1b; }
151+
148152
/* ── Callout / take-home boxes ────────────────────────────── */
149153
.callout {
150154
display: flex;
@@ -244,7 +248,15 @@
244248

245249
<!-- ── Side nav ──────────────────────────────────────────────── -->
246250
<nav>
251+
<div style="text-align:center; padding: 0 1.4rem 1rem;">
252+
<img src="asr_bench/figures/owl.png" alt="logo" style="width:88px; border-radius:8px; opacity:0.92;" />
253+
</div>
247254
<div class="nav-title">Contents</div>
255+
<div style="padding: 0 1.4rem 0.8rem; font-size:0.78rem;">
256+
<a href="https://github.com/EvergreenTree/speech2text" style="color:#6c8cff; text-decoration:none;">
257+
&#128279; EvergreenTree/speech2text
258+
</a>
259+
</div>
248260
<ul>
249261
<li><a href="#overview">Overview</a></li>
250262
<li><a href="#tldr">TL;DR</a></li>
@@ -296,6 +308,7 @@ <h3>Take-homes</h3>
296308
<li><strong>Model size and data overlap matter more than adapter choice alone.</strong> Tiny still has headroom on French; small/medium/turbo mostly do not.</li>
297309
<li><strong>The right first step under a small budget is a baseline sweep, not blind SFT.</strong> Gap-to-ceiling is the main diagnostic signal in this repo.</li>
298310
<li><strong>Qwen3-ASR and Granite were added as counterpoints.</strong> They show how much the conclusion depends on backbone quality, pre-training mix, and the evaluation slice.</li>
311+
<li><strong>RL (MWER/GSPO) fixes the SFT regression.</strong> On Qwen3-ASR-0.6B, GSPO brings French WER <em>below</em> baseline (6.13 % vs 6.35 %) and MWER achieves the best Chinese CER at that scale (7.62 % vs 10.41 % baseline). An RL stage at half an epoch recovers what SFT lost — and then some.</li>
299312
</ol>
300313

301314
<p>This repo bundles four tracks under one roof: the original Whisper fine-tuning work, the <code>asr_bench</code> baseline benchmark, the Qwen3-ASR pilot, and the Granite Speech pilot. The earlier zh-CN Whisper run lives intact under <code>archive_zh/</code>; the present fr-FR Whisper run is in <code>outputs/</code>. Tiny was added later to control for the <em>gap-to-ceiling</em> effect discussed in §3.3.</p>
@@ -788,7 +801,7 @@ <h3>Fit / runtime findings on 1× NVIDIA L4 (24 GB)</h3>
788801

789802
<!-- RL fine-tuning -->
790803
<section id="qwen-rl">
791-
<h3>RL fine-tuning: MWER &amp; GSPO</h3>
804+
<h2>RL fine-tuning: MWER &amp; GSPO</h2>
792805
<p>Two reinforcement-learning algorithms were added on top of the SFT checkpoint to push WER/CER further without extra labelled data.</p>
793806

794807
<h4>MWER (Minimum Word Error Rate)</h4>
@@ -818,6 +831,7 @@ <h4>GSPO (Group Sequence Policy Optimisation)</h4>
818831
<code>&lt;lang&gt;…&lt;/lang&gt;</code> wrapper.</p>
819832

820833
<h4>Hyperparameters</h4>
834+
<div class="table-wrap">
821835
<table>
822836
<thead><tr><th></th><th>MWER</th><th>GSPO</th></tr></thead>
823837
<tbody>
@@ -835,9 +849,11 @@ <h4>Hyperparameters</h4>
835849
<tr><td>Clip range</td><td></td><td>ε_lo=3e-4, ε_hi=4e-4</td></tr>
836850
</tbody>
837851
</table>
852+
</div>
838853

839854
<h4>Results (first 100 dev examples)</h4>
840855
<p>WER for French, CER for Chinese. ↓ is better. Best result per row is <strong>bold</strong>.</p>
856+
<div class="table-wrap">
841857
<table>
842858
<thead>
843859
<tr><th>Model</th><th>Language</th><th>Metric</th><th>Baseline</th><th>SFT full-FT</th><th>MWER (RL)</th><th>GSPO (RL)</th></tr>
@@ -869,6 +885,7 @@ <h4>Results (first 100 dev examples)</h4>
869885
</tr>
870886
</tbody>
871887
</table>
888+
</div>
872889
<p>The 0.6B RL rows are completed 0.5-epoch runs from 2026-05-11. Dashes mean the 1.7B RL variants have not been run yet.</p>
873890

874891
<h4>RL implementation note</h4>
@@ -958,6 +975,11 @@ <h2>9. Limitations and why the French plot is last</h2>
958975
</div>
959976
</section>
960977

978+
<footer style="margin-top:4rem; padding-top:1.5rem; border-top:1px solid #e5e7ef; font-size:0.78rem; color:#9ba3b8;">
979+
Last updated 2026-05-12 &mdash;
980+
<a href="https://github.com/EvergreenTree/speech2text" style="color:#3b6cff;">&#128279; GitHub — EvergreenTree/speech2text</a>
981+
</footer>
982+
961983
</main>
962984
</div>
963985

0 commit comments

Comments
 (0)