Merge branch 'main' into finetune-csm

Deep-unlearning · web-flow · commit f4a016fcb381 · 2025-11-26T17:09:10.000+01:00
diff --git a/chapters/en/chapter1/preprocessing.mdx b/chapters/en/chapter1/preprocessing.mdx
@@ -95,7 +95,9 @@ dataset. However, we can create one, filter based on the values in that column,
 
 ```py
 # use librosa to get example's duration from the audio file
-new_column = [librosa.get_duration(path=x) for x in minds["path"]]
+new_column = [
+    librosa.get_duration(y=x["array"], sr=x["sampling_rate"]) for x in minds["audio"]
+]
 minds = minds.add_column("duration", new_column)
 
 # use 🤗 Datasets' `filter` method to apply the filtering function
diff --git a/chapters/en/chapter4/fine-tuning.mdx b/chapters/en/chapter4/fine-tuning.mdx
@@ -440,7 +440,7 @@ num_train_epochs = 10
 
 training_args = TrainingArguments(
     f"{model_name}-finetuned-gtzan",
-    evaluation_strategy="epoch",
+    eval_strategy="epoch",
     save_strategy="epoch",
     learning_rate=5e-5,
     per_device_train_batch_size=batch_size,
diff --git a/chapters/en/chapter5/evaluation.mdx b/chapters/en/chapter5/evaluation.mdx
@@ -91,6 +91,36 @@ Let's take an example were we predict 10 words and the target only has 2 words.
 we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
 100%. Although if you're seeing this, something has likely gone wrong... 😅
 
+## Inverse Real-Time Factor (RTFx)
+
+While WER measures the accuracy of transcriptions, the *inverse real-time factor (RTFx)* measures the speed of an ASR system.
+RTFx is the inverse ratio of processing time to audio duration:
+
+$$
+\text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
+$$
+
+For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
+means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
+video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
+below 1.0 indicate slower-than-real-time processing.
+
+Key points about RTFx:
+* **Higher is better**: Higher RTFx means faster processing
+* **RTFx > 1.0**: Faster than real-time (good for streaming applications)
+* **RTFx = 1.0**: Processes at exactly real-time speed
+* **RTFx < 1.0**: Slower than real-time (may be acceptable for batch processing)
+
+RTFx is hardware-dependent and varies based on factors like:
+- Model size (larger models typically have lower RTFx)
+- Hardware acceleration (GPU vs CPU)
+- Batch size
+- Audio characteristics (sampling rate, number of channels)
+
+When evaluating ASR systems, it's important to consider both WER and RTFx together. A model with excellent WER but very
+low RTFx may not be practical for real-time applications, while a model with slightly higher WER but high RTFx might be
+more suitable for latency-sensitive use cases.
+
 ## Word Accuracy
 
 We can flip the WER around to give us a metric where *higher is better*. Rather than measuring the word error rate,
diff --git a/chapters/en/chapter7/voice-assistant.mdx b/chapters/en/chapter7/voice-assistant.mdx
@@ -261,7 +261,7 @@ def transcribe(chunk_length_s=5.0, stream_chunk_s=1.0):
         if not item["partial"][0]:
             break
 
-    return item["text"]
+        return item["text"]
 ```
 
 Let's give this a go and see how we get on! Once the microphone is live, start speaking and watch your transcription