This is the repository reference for short-audio Whisper transcription on OAK4 with LED and stream feedback. Use it when you need the repo’s speech-recognition example rather than a vision pipeline.
- You need Whisper Tiny EN on OAK4.
- You want audio-driven text output integrated with a live camera stream.
- You need the LED/color feedback workflow described in the README.
- You need RVC2 support.
- You need continuous streaming ASR.
- You need a fully host-side speech model rather than device-assisted inference.
Category:neural-networks/speech-recognition/whisper-tiny-enShape:script+standalonePrimary task:short audio capture or audio-file transcription with LED/color feedbackEntrypoint:main.pyStandalone path:oakapp.tomlFrontend:noneRuns on:RVC4 onlyRequires:OAK4, Whisper encoder/decoder model assets, and host audio prerequisites for recording modeInput:host-recorded audio triggered by keypress or--audio_file, plus a live camera stream for visual feedbackOutput:CameraandDecoded Audio MessageModels:Whisper encoder/decoder YAMLs in depthai_models/Visualizer / UI:DepthAI Visualizer viadai.RemoteConnection
- README.md
- main.py
- utils/audio_encoder.py
- utils/whisper_encoder.py
- utils/whisper_decoder.py
- utils/led_changer_script.py
- utils/arguments.py
- Host audio is converted into a spectrogram by utils/audio_encoder.py.
- A device
NeuralNetworkruns the Whisper encoder stage. - Host/device helper nodes prepare the recursive decoder loop and feed token sequences back into the decoder network.
- utils/annotation_node.py maps decoded tokens to text and color commands, and utils/led_changer_script.py updates the device LED.
- The live camera stream is tinted/annotated to match the decoded color word.
- main.py explicitly raises on non-RVC4 platforms.
- The README notes that host recording crashes in standalone mode; standalone is effectively for pre-recorded audio only.
- neural-networks/generic-example: use this when you need the simplest single-model scaffold instead of speech-specific recursion
- apps/default-app: use this when you need a packaged baseline app without speech logic
- integrations/rerun: use this when your goal is external visualization rather than speech-driven UI feedback
Run:python3 main.pyFile mode:python3 main.py --audio_file <FILE>Success looks like:the Visualizer showsCameraandDecoded Audio Message, pressingrrecords a short clip in peripheral mode, and recognized color words drive the LED/tint outputCommon failure meaning:the device is not RVC4, host audio dependencies are missing, or standalone recording was attempted