Skip to content

Commit f4d27e5

Browse files
Merge pull request #225 from huggingface/add-rtfx-metrics
add rtfx
2 parents 57137ab + 378ea5b commit f4d27e5

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

chapters/en/chapter5/evaluation.mdx

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,36 @@ Let's take an example were we predict 10 words and the target only has 2 words.
9191
we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
9292
100%. Although if you're seeing this, something has likely gone wrong... 😅
9393

94+
## Inverse Real-Time Factor (RTFx)
95+
96+
While WER measures the accuracy of transcriptions, the *inverse real-time factor (RTFx)* measures the speed of an ASR system.
97+
RTFx is the inverse ratio of processing time to audio duration:
98+
99+
$$
100+
\text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
101+
$$
102+
103+
For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
104+
means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
105+
video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
106+
below 1.0 indicate slower-than-real-time processing.
107+
108+
Key points about RTFx:
109+
* **Higher is better**: Higher RTFx means faster processing
110+
* **RTFx > 1.0**: Faster than real-time (good for streaming applications)
111+
* **RTFx = 1.0**: Processes at exactly real-time speed
112+
* **RTFx < 1.0**: Slower than real-time (may be acceptable for batch processing)
113+
114+
RTFx is hardware-dependent and varies based on factors like:
115+
- Model size (larger models typically have lower RTFx)
116+
- Hardware acceleration (GPU vs CPU)
117+
- Batch size
118+
- Audio characteristics (sampling rate, number of channels)
119+
120+
When evaluating ASR systems, it's important to consider both WER and RTFx together. A model with excellent WER but very
121+
low RTFx may not be practical for real-time applications, while a model with slightly higher WER but high RTFx might be
122+
more suitable for latency-sensitive use cases.
123+
94124
## Word Accuracy
95125

96126
We can flip the WER around to give us a metric where *higher is better*. Rather than measuring the word error rate,

0 commit comments

Comments
 (0)