-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Hi!
I've been encountering issues with the multilingual Whisper-small model running on ONNX Runtime in my Android application. While the app launches successfully on both the Android Studio Emulator (Pixel 9) and a physical Samsung A52s device, it still fails to transcribe Arabic speech input. This happens with both live speech recognition and when using pre-recorded PCM audio files, which I created as WAV and converted to PCM, (16000 Hz).
The model works perfectly fine via Python using the following code:
python prepare_whisper_configs.py --model_name openai/whisper-small --no_audio_decoder
run olive run --config whisper_cpu_int8.json
python3 test_transcription.py --config whisper_cpu_int8.json --task transcribe --audio_path data/arabic.mp3 --language ar
I should also note that the tiny.en
model produced accurate results for English. Also, the multilingual tiny
model displayed some Arabic characters, but it kept repeating a single word a dozen times, and that word was not even part of the sentence I had spoken.
I modified the SpeechRecognizer.kt
file by implementing Arabic decoder input IDs
, adding an attention mask
with values derived from Python model execution, and incorporating a logits processor
.
val nMels: Long = 80
val nFrames: Long = 3000
// attention_mask
// logits_processor
val attentionMask = IntArray((1 * 80 * nFrames).toInt()) { 0 }
baseInputs = mapOf(
"min_length" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
"max_length" to createIntTensor(env, intArrayOf(200), tensorShape(1)),
"num_beams" to createIntTensor(env, intArrayOf(2), tensorShape(1)),
"num_return_sequences" to createIntTensor(env, intArrayOf(1), tensorShape(1)),
"length_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
"repetition_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
"attention_mask" to createIntTensor(env, attentionMask, tensorShape(1, nMels, 3000)),
"logits_processor" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
"decoder_input_ids" to createIntTensor(env, intArrayOf(50258, 50272, 50359, 50363), tensorShape(1, 4)),
Could you please help with that or let me know if I am missing something?