Skip to content

Unable to get transcriptions with the Multilingual Whisper-Small Model via Android StudioΒ #513

@Muhsabrys

Description

@Muhsabrys

Hi!

I've been encountering issues with the multilingual Whisper-small model running on ONNX Runtime in my Android application. While the app launches successfully on both the Android Studio Emulator (Pixel 9) and a physical Samsung A52s device, it still fails to transcribe Arabic speech input. This happens with both live speech recognition and when using pre-recorded PCM audio files, which I created as WAV and converted to PCM, (16000 Hz).

The model works perfectly fine via Python using the following code:

      python prepare_whisper_configs.py --model_name openai/whisper-small --no_audio_decoder
      run olive run --config whisper_cpu_int8.json 
      python3 test_transcription.py --config whisper_cpu_int8.json --task transcribe --audio_path data/arabic.mp3 --language ar

I should also note that the tiny.en model produced accurate results for English. Also, the multilingual tiny model displayed some Arabic characters, but it kept repeating a single word a dozen times, and that word was not even part of the sentence I had spoken.

I modified the SpeechRecognizer.kt file by implementing Arabic decoder input IDs, adding an attention mask with values derived from Python model execution, and incorporating a logits processor.

        val nMels: Long = 80
        val nFrames: Long = 3000
        // attention_mask
        // logits_processor
        val attentionMask = IntArray((1 * 80 * nFrames).toInt()) { 0 }
        baseInputs = mapOf(
            "min_length" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
            "max_length" to createIntTensor(env, intArrayOf(200), tensorShape(1)),
            "num_beams" to createIntTensor(env, intArrayOf(2), tensorShape(1)),
            "num_return_sequences" to createIntTensor(env, intArrayOf(1), tensorShape(1)),
            "length_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
            "repetition_penalty" to createFloatTensor(env, floatArrayOf(1.0f), tensorShape(1)),
            "attention_mask" to createIntTensor(env, attentionMask, tensorShape(1, nMels, 3000)),
            "logits_processor" to createIntTensor(env, intArrayOf(0), tensorShape(1)),
            "decoder_input_ids" to createIntTensor(env, intArrayOf(50258, 50272, 50359, 50363), tensorShape(1, 4)),

Could you please help with that or let me know if I am missing something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions