Skip to content

Commit f4a016f

Browse files
Merge branch 'main' into finetune-csm
2 parents dca6129 + 57b779c commit f4a016f

File tree

4 files changed

+35
-3
lines changed

4 files changed

+35
-3
lines changed

chapters/en/chapter1/preprocessing.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,9 @@ dataset. However, we can create one, filter based on the values in that column,
9595

9696
```py
9797
# use librosa to get example's duration from the audio file
98-
new_column = [librosa.get_duration(path=x) for x in minds["path"]]
98+
new_column = [
99+
librosa.get_duration(y=x["array"], sr=x["sampling_rate"]) for x in minds["audio"]
100+
]
99101
minds = minds.add_column("duration", new_column)
100102

101103
# use 🤗 Datasets' `filter` method to apply the filtering function

chapters/en/chapter4/fine-tuning.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -440,7 +440,7 @@ num_train_epochs = 10
440440

441441
training_args = TrainingArguments(
442442
f"{model_name}-finetuned-gtzan",
443-
evaluation_strategy="epoch",
443+
eval_strategy="epoch",
444444
save_strategy="epoch",
445445
learning_rate=5e-5,
446446
per_device_train_batch_size=batch_size,

chapters/en/chapter5/evaluation.mdx

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,36 @@ Let's take an example were we predict 10 words and the target only has 2 words.
9191
we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
9292
100%. Although if you're seeing this, something has likely gone wrong... 😅
9393

94+
## Inverse Real-Time Factor (RTFx)
95+
96+
While WER measures the accuracy of transcriptions, the *inverse real-time factor (RTFx)* measures the speed of an ASR system.
97+
RTFx is the inverse ratio of processing time to audio duration:
98+
99+
$$
100+
\text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
101+
$$
102+
103+
For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
104+
means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
105+
video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
106+
below 1.0 indicate slower-than-real-time processing.
107+
108+
Key points about RTFx:
109+
* **Higher is better**: Higher RTFx means faster processing
110+
* **RTFx > 1.0**: Faster than real-time (good for streaming applications)
111+
* **RTFx = 1.0**: Processes at exactly real-time speed
112+
* **RTFx < 1.0**: Slower than real-time (may be acceptable for batch processing)
113+
114+
RTFx is hardware-dependent and varies based on factors like:
115+
- Model size (larger models typically have lower RTFx)
116+
- Hardware acceleration (GPU vs CPU)
117+
- Batch size
118+
- Audio characteristics (sampling rate, number of channels)
119+
120+
When evaluating ASR systems, it's important to consider both WER and RTFx together. A model with excellent WER but very
121+
low RTFx may not be practical for real-time applications, while a model with slightly higher WER but high RTFx might be
122+
more suitable for latency-sensitive use cases.
123+
94124
## Word Accuracy
95125

96126
We can flip the WER around to give us a metric where *higher is better*. Rather than measuring the word error rate,

chapters/en/chapter7/voice-assistant.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -261,7 +261,7 @@ def transcribe(chunk_length_s=5.0, stream_chunk_s=1.0):
261261
if not item["partial"][0]:
262262
break
263263

264-
return item["text"]
264+
return item["text"]
265265
```
266266

267267
Let's give this a go and see how we get on! Once the microphone is live, start speaking and watch your transcription

0 commit comments

Comments
 (0)