Unable to get transcriptions with the Multilingual Whisper-Small Model via Android Studio

Hi! 

I've been encountering issues with the multilingual Whisper-small model running on ONNX Runtime in my Android application. While the app launches successfully on both the Android Studio Emulator (Pixel 9) and a physical Samsung A52s device, it still fails to transcribe Arabic speech input. This happens with both live speech recognition and when using pre-recorded PCM audio files, which I created as WAV and converted to PCM,  (16000 Hz). 

The model works perfectly fine via Python using the following code: 

```
      python prepare_whisper_configs.py --model_name openai/whisper-small --no_audio_decoder
      run olive run --config whisper_cpu_int8.json 
      python3 test_transcription.py --config whisper_cpu_int8.json --task transcribe --audio_path data/arabic.mp3 --language ar
```

I should also note that the `tiny.en` model produced accurate results for English. Also, the multilingual `tiny` model displayed some Arabic characters, but it kept repeating a single word a dozen times, and that word was not even part of the sentence I had spoken.

I modified the `SpeechRecognizer.kt` file by implementing Arabic` decoder input IDs`, adding an `attention mask` with values derived from Python model execution, and incorporating a` logits processor`.

```
        val nMels: Long = 80
        val nFrames: Long = 3000
        // attention_mask
        // logits_processor
        val attentionMask = IntArray((1 * 80 * nFrames).toInt()) { 0 }
        baseInputs = mapOf(
            "min_length" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
            "max_length" to createIntTensor(env, intArrayOf(200), tensorShape(1)),
            "num_beams" to createIntTensor(env, intArrayOf(2), tensorShape(1)),
            "num_return_sequences" to createIntTensor(env, intArrayOf(1), tensorShape(1)),
            "length_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
            "repetition_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
            "attention_mask" to createIntTensor(env, attentionMask, tensorShape(1, nMels, 3000)),
            "logits_processor" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
            "decoder_input_ids" to createIntTensor(env, intArrayOf(50258, 50272, 50359, 50363), tensorShape(1, 4)),
```

Could you please help with that or let me know if I am missing something? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to get transcriptions with the Multilingual Whisper-Small Model via Android Studio #513

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to get transcriptions with the Multilingual Whisper-Small Model via Android Studio #513

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions