Skip to content

Model Only Recognizes a Single Word from Audio Input #57

@dsnsabari

Description

@dsnsabari

When running speech recognition with ReazonSpeech, the model only outputs a single word, regardless of the length or content of the input audio. This happens even with clear audio files containing multiple words or full sentence

audio (3).zip

Code:

import librosa
import soundfile as sf
import io
import tempfile
import numpy as np

# from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path
from reazonspeech.k2.asr import load_model, transcribe, audio_from_path

# === Load ReazonSpeech model from Hugging Face ===
# model = load_model("reazon-research/reazonspeech-k2-v2-ja-en")
model = load_model(device="cpu", precision="fp32", language="ja") # or language="ja-en" for bilingual model

# === Step 1: Load and resample audio to 16,000 Hz ===
audio_path = r'D:\Image_Based_searchengine\product_images\audio (3).wav'
y, sr = librosa.load(audio_path, sr=16000, mono=True)

# === Step 2: Amplify the audio by 1.5x and clip to avoid distortion ===
amplified_y = np.clip(y * 1.5, -1.0, 1.0)

# === Step 3: Write amplified audio to an in-memory buffer ===
buffer = io.BytesIO()
sf.write(buffer, amplified_y, 16000, format='WAV', subtype='PCM_16')
buffer.seek(0)

# === Step 4: Save buffer to a temp WAV file for ASR model ===
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
    tmp.write(buffer.read())
    temp_wav_path = tmp.name

# === Step 5: Transcribe ===
audio = audio_from_path(temp_wav_path)
print("audio.samplerate:", audio.samplerate)

ret = transcribe(model, audio)
print("Transcribed Text:", ret.text)


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions