Question to review my understanding of the operation process for RealTimeSTT text function #214

sangheonEN · 2025-03-19T00:27:10Z

sangheonEN
Mar 19, 2025

The code for [Text Function] below refers to the previous code audio_recorder.py and creates a callback function, creates a thread, and applies in/out parameters. Please refer to the attached audio_recorder.py for my code. (It may be a little different from the latest version because I referred to your old version, but the voice streaming and stt transcription operation codes will probably be similar.) The text function is called in the main code of Thomas_audio_control_src.py.
realtimestt.zip

text, thomas_event_state = recorder.text(utils.main_process, start_time, communicator, similarity_cal, params.similarity_config)

In other words, I implemented a structure that uses the text function to perform voice streaming and STT transcription.

However, since I do not have professional knowledge about voice streaming, it is too difficult to customize the process. So I'm writing this because I want to review if I understand the process of voice streaming and STT transcription correctly. Please read the article and if there is any part where the analysis is wrong or you can advise me to understand it better, I would appreciate it. I want to gain more professional knowledge about voice streaming.

I think it would be appreciated if you could tell me the sequence. How does audio come in through the microphone, and how does data processing occur in which thread, and how does voice stream and STT transcription occur?

Running environment: WINDOW 11, python 3.11.5 (venv), nvidia gpu 4060 labtop -

[Analyzed content]

Multi Processing: Process and thread creation when creating AudioToTextRecorder object

sub_processor: _audio_data_worker
sub_processor : _transcription_worker
main processor: recording_thread(_recording_worker), VAD(silero_speech)_thread, multi process parameters (self.audio_queue, self.start_recording_event, self.stop_recording_event, self.shutdown_event, self.interrupt_stop_event, self.was_interrupted, self.main_transcription_ready_event, self.parent_transcription_pipe, child_transcription_pipe)

audio_data_worker: Audio data stream open and
-Open pyaudio’s interface stream

audio_queue.put(audio_data)
Mic Reconnection → Exception handling

_recording_worker

data=audio_queue.get()
handle_buffer_overflow
use_wake_words trigger
start_recording_on_voice_activity → Call start() function when True
Check if VAD ACTIVATION has been performed.
self.frames.extend(list(self.audio_buffer))

The voice data stored in self_audio_buffer is the audio data that exists just before/after recording. It feels like giving a margin for flexible voice data processing.

Save all voice data. self.frames.append(data)

_transcription_worker

model = faster_whisper.WhisperModel
main_transcription_ready_event.set()
audio, language = child_transcription_pipe.recv()
segments = model.transcribe
transcription = " ".join(seg.text for seg in segments)
child_transcription_pipe.send(('success', transcription))

multi process parameters

start_recording_event: When set in the start() function, it breaks from the wait_audio() function and moves to the next line of code.
stop_recording_event: When set in stop(), shutdown(), it breaks from the wait_audio() function and moves to the next line of code. - use_microphone : flag to use local mic
audio_queue :
Save audio data : data = stream.read(buffer_size) is put in _audio_data_worker() and stored.
Transmit audio data : get from recording_worker and transmit.
shutdown_event : If set in the shutdown() function, all processes are terminated
interrupt_stop_event : If KeyboardInterrupt occurs, both audio_data_worker and transcription_worker interrupt_stop_event.set() to stop recording.
was_interrupted : set() when transcription ends and output an empty string
main_transcription_ready_event : Go into _transcription_worker, load the faster-whisper model, and set() immediately
parent_transcription_pipe : Send the recorded audio data and language information from transcribe() to child_transcription_pipe. And receive the text inference result from child_transcription_pipe. - child_transcription_pipe : _transcription_worker() recv audio data and language parameters from parent_transcription_pipe. Send text result inferred by stt model to parent_transcription_pipe.

text function

wait_audio

Check if recording is in progress
Save final audio data.
self.audio = audio_array.astype(np.float32) / INT16_MAX_ABS_VALUE
reduce noise
reduce db

transcribe

self.parent_transcription_pipe.send((self.audio, self.language)) → Send() final audio data (self.audio) to transcription_worker using parent_transcription_pipe.
status, result = self.parent_transcription_pipe.recv() → Recv() text inferred by transcription_worker. - Perform _preprocess_output
return self._preprocess_output(result)

KoljaB · 2025-03-19T00:55:47Z

KoljaB
Mar 19, 2025
Maintainer

Audio enters in two ways:

From the microphone: The _audio_data_worker process reads data from your mic and writes audio chunks into the audio_queue.
From external sources: The feed_audio method lets you add audio from another source and also places chunks into the audio_queue.

Once in the queue, the _recording_worker thread processes these chunks. It handles wake word detection, VAD analysis, and more. This thread is the core of RealtimeSTT. The flow is:
_audio_data_worker or feed_audio → audio_queue → _recording_worker.

Then, the _transcription_worker process handles the final transcription while the _realtime_worker thread manages everything related to live transcription.

self.audio_buffer compensates lags in wake word detection. If you say "Jarvis turn the lights on" and the system detects "Jarvis" too late, parts of the command might be lost. The audio buffer (together with pre_recording_buffer_duration parameter) make sure no words get missed in the transcription.

12 replies

sangheonEN Mar 26, 2025
Author

Wow, that's really cool. Is chaining up webrtcvad and silerovad a great technique for speech recognition?

KoljaB Mar 27, 2025
Maintainer

It's not absolutely crucial for the library but makes VAD fast and reliable. Webrtcvad is lightweight and very sensitive, it often triggers false positives. Silero is more resource-intensive. With chaining them you use webrtcvad as a prefilter and only activate Silero when voice activity is already detected. So Silero provides a second layer of validation. If you do Silero only you put extra load on your GPU, and if you do Webrtcvad only you deal with more false alarms. Combining both gives a solid balance.

sangheonEN Mar 27, 2025
Author

Oh right! And isn't the code for loading wav files and getting the transcription inference text results included separately in that library?

KoljaB Mar 27, 2025
Maintainer

Unsure if I got that question right. The faster_whisper library includes the code to load wav files and output transcription text results, is that what you were asking?

sangheonEN Mar 28, 2025
Author

Yes, that's right. Thank you, I find the src at (https://github.com/SYSTRAN/faster-whisper) from usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question to review my understanding of the operation process for RealTimeSTT text function #214

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Question to review my understanding of the operation process for RealTimeSTT text function #214

Uh oh!

sangheonEN Mar 19, 2025

Replies: 1 comment · 12 replies

Uh oh!

KoljaB Mar 19, 2025 Maintainer

Uh oh!

sangheonEN Mar 26, 2025 Author

Uh oh!

KoljaB Mar 27, 2025 Maintainer

Uh oh!

sangheonEN Mar 27, 2025 Author

Uh oh!

Uh oh!

KoljaB Mar 27, 2025 Maintainer

Uh oh!

Uh oh!

sangheonEN Mar 28, 2025 Author

sangheonEN
Mar 19, 2025

Replies: 1 comment 12 replies

KoljaB
Mar 19, 2025
Maintainer

sangheonEN Mar 26, 2025
Author

KoljaB Mar 27, 2025
Maintainer

sangheonEN Mar 27, 2025
Author

KoljaB Mar 27, 2025
Maintainer

sangheonEN Mar 28, 2025
Author