@@ -91,6 +91,36 @@ Let's take an example were we predict 10 words and the target only has 2 words.
9191we'd have a WER of 10 / 2 = 5, or 500%! This is something to bear in mind if you train an ASR system and see a WER of over
9292100%. Although if you're seeing this, something has likely gone wrong... 😅
9393
94+ ## Inverse Real-Time Factor (RTFx)
95+
96+ While WER measures the accuracy of transcriptions, the * inverse real-time factor (RTFx)* measures the speed of an ASR system.
97+ RTFx is the inverse ratio of processing time to audio duration:
98+
99+ $$
100+ \text{RTFx} = \frac{\text{Audio Duration}}{\text{Processing Time}}
101+ $$
102+
103+ For example, if it takes 10 seconds to transcribe 100 seconds of audio, the RTFx is 100/10 = 10. An RTFx greater than 1.0
104+ means the system can transcribe audio faster than real-time, which is essential for live transcription applications like
105+ video conferencing or live captioning. An RTFx of 1.0 means the system processes at exactly real-time speed, while values
106+ below 1.0 indicate slower-than-real-time processing.
107+
108+ Key points about RTFx:
109+ * ** Higher is better** : Higher RTFx means faster processing
110+ * ** RTFx > 1.0** : Faster than real-time (good for streaming applications)
111+ * ** RTFx = 1.0** : Processes at exactly real-time speed
112+ * ** RTFx < 1.0** : Slower than real-time (may be acceptable for batch processing)
113+
114+ RTFx is hardware-dependent and varies based on factors like:
115+ - Model size (larger models typically have lower RTFx)
116+ - Hardware acceleration (GPU vs CPU)
117+ - Batch size
118+ - Audio characteristics (sampling rate, number of channels)
119+
120+ When evaluating ASR systems, it's important to consider both WER and RTFx together. A model with excellent WER but very
121+ low RTFx may not be practical for real-time applications, while a model with slightly higher WER but high RTFx might be
122+ more suitable for latency-sensitive use cases.
123+
94124## Word Accuracy
95125
96126We can flip the WER around to give us a metric where * higher is better* . Rather than measuring the word error rate,
0 commit comments