Weird compressed audio while reproducing Chat Completion's AudioOutput generated AudioClip. Am I missing something? #358

HunterProduction · 2025-04-09T16:12:59Z

HunterProduction
Apr 9, 2025

Hi!

Thank you for this super dense package. Today I was testing Chat Completion Audio features with GPT 4o mini audio. Once I play the returned AudioClip stored into Message, I'm noticing that the audio playback is kinda compressed or "crunched", as if the clip was not created (or reproduced) correctly.

Here I leave you the main snippets that manage the chat request.

public async void Respond()
{
    var message = inputTextField.text;
    onRequestReceived?.Invoke(message);
    var (answer, audioClip) = await _languageProcessor.RespondTo("User", message);
    onRequestCompleted?.Invoke(answer);

    answerTextField.text = answer;
    audioSource.clip = audioClip;
    audioSource.Play();
}

public async Task<(string, AudioClip)> RespondTo(string speakerName, string message, AudioFormat audioFormat = AudioFormat.Pcm16)
{
    Conversation.AppendMessage(new Message(Role.User, message, speakerName));
    var chatRequest = new ChatRequest(Conversation.Messages, model: Model, audioConfig: new AudioConfig(Voice, audioFormat));
    var response = await PerformChatCompletion(chatRequest);
    return (response, response.AudioOutput.AudioClip);
}

protected virtual async Task<Message> PerformChatCompletion(ChatRequest chatRequest)
{
    if (!Authenticated)
    {
        Log($"[{GetType().Name}] LanguageProcessor not ready");
        return new Message(Role.Assistant, "");
    }

    Log($"[{GetType().Name}] Responding...");

    var start = DateTime.Now;
    var response = await Api.ChatEndpoint.GetCompletionAsync(chatRequest);
    var latency = DateTime.Now.Subtract(start).TotalMilliseconds;

    var choice = response.FirstChoice.Message;

    Log($"[{GetType().Name}] Request latency: {latency:0.0}ms | Finish reason: {response.FirstChoice.FinishReason}\nResponse: {choice}");

    // Add new response to the history
    Conversation.AppendMessage(choice);

    return choice;
}

As you can see I left the audio format to default PCM (I tried using MP3 but the result was only noise with warnings in the console). I can tell you that on startup my AudioSettings.outputSampleRate is 48000.

Am I missing something? I tried to take a look at your two samples, but one is for Realtime models, and the Chat one uses SpeechGeneration endpoint to generate audio.

Bonus Question: is it possible to manage audio response in ChatCompletion endpoint even using StreamCompletionAsync? I wonder if you can stream not only the text response but also the audio using ChatCompletion, without using Realtime websocket.

Answered by StephenHodgson

Apr 13, 2025

@HunterProduction should be fixed in

View full answer

StephenHodgson · 2025-04-09T17:03:41Z

StephenHodgson
Apr 9, 2025
Maintainer

Thanks for the kind words!

As you can see I left the audio format to default PCM (I tried using MP3 but the result was only noise with warnings in the console). I can tell you that on startup my AudioSettings.outputSampleRate is 48000.

Yes I made a deliberate decision to only support PCM, for a number of reasons. Mainly unity doesn't support MP3 streaming on all build platform targets. Working with the Audio system in unity can be quite challenging. I ended up writing my own StreamAudioSource that should be used in most instances for playback.

is it possible to manage audio response in ChatCompletion endpoint even using StreamCompletionAsync?

Yes, but I don't believe that is what I'm doing by default in my Chat demo scene.
If I get some free time maybe I'll update this sample scene do use audio modality.
But I do have examples of how it is achieved in the chat unit tests.

I wonder if you can stream not only the text response but also the audio using ChatCompletion

yes, but currently only with models that support speech.

com.openai.unity/OpenAI/Packages/com.openai.unity/Tests/TestFixture_04_Chat.cs

Lines 137 to 144 in f98b0c0

    
           var chatRequest = new ChatRequest(messages, Model.GPT4oAudio, audioConfig: Voice.Alloy); 
        
           Assert.IsNotNull(chatRequest); 
        
           Assert.IsNotNull(chatRequest.AudioConfig); 
        
           Assert.AreEqual(Model.GPT4oAudio.Id, chatRequest.Model); 
        
           Assert.AreEqual(Voice.Alloy.Id, chatRequest.AudioConfig.Voice); 
        
           Assert.AreEqual(AudioFormat.Pcm16, chatRequest.AudioConfig.Format); 
        
           Assert.AreEqual(Modality.Text | Modality.Audio, chatRequest.Modalities); 
        
           var response = await OpenAIClient.ChatEndpoint.StreamCompletionAsync(chatRequest, Assert.IsNotNull, true);

5 replies

StephenHodgson Apr 10, 2025
Maintainer

@HunterProduction unfortunately that is because OpenAI returns the voice at 16kHz or 24kHz which is why you are hearing this poor audio quality. I simply upsample it to the required frequency playback rate that Unity expects.

It also heavily depends on the headphones/audio output source you're using. For example I'm on windows, and my BT headphones, when the microphone is enabled uses call quality vs stereo quality. One way to improve is to also turn off/disable microphone during audio playback.

more info

HunterProduction Apr 11, 2025
Author

Thanks for the added details.

This morning I tried to do a bit more testing on this, and I tried to manually convert the AudioData retrieved by the ChatCompletion into an AudioClip. I also found that it could not be needed to properly resample the audio data since Unity AudioSource should be able to play also a 24khz clip.

Anyway, I'll share with you the snippet of my implementation, that seems to produce a cleaner playback if compared to the retrieve of AudioOutput.AudioClip

public AudioClip CreateAudioClipFromRawPCM(byte[] audioData, int sampleRate, int channels, int bitDepth)
{
    float[] samples = ConvertByteArrayToSamples(audioData, bitDepth);
    float[] resampledData = ResampleAudioData(samples, 24000, sampleRate);

    AudioClip audioClip = AudioClip.Create("RawPCMClip", resampledData.Length, channels, sampleRate, false);
    audioClip.SetData(resampledData, 0);

    return audioClip;
}

private float[] ConvertByteArrayToSamples(byte[] audioData, int bitDepth)
{
    int sampleCount = audioData.Length / (bitDepth / 8);
    float[] samples = new float[sampleCount];

    for (int i = 0; i < sampleCount; i++)
    {
        samples[i] = ConvertByteToSample(audioData, i, bitDepth);
    }

    return samples;
}

private float ConvertByteToSample(byte[] audioData, int index, int bitDepth)
{
    switch (bitDepth)
    {
        case 8:
            return audioData[index] / 128f - 1f;  // 8-bit PCM normalized to [-1, 1]
        case 16:
            short sample16 = BitConverter.ToInt16(audioData, index * 2);
            return sample16 / 32768f;  // 16-bit PCM normalized to [-1, 1]
        case 24:
            // Handle 24-bit PCM if needed
            byte[] sample24Bytes = new byte[3];
            Array.Copy(audioData, index * 3, sample24Bytes, 0, 3);
            int sample24 = BitConverter.ToInt32(new byte[] { sample24Bytes[0], sample24Bytes[1], sample24Bytes[2], 0 }, 0);
            return sample24 / 8388608f;  // 24-bit PCM normalized to [-1, 1]
        case 32:
            int sample32 = BitConverter.ToInt32(audioData, index * 4);
            return sample32 / 2147483648f;  // 32-bit PCM normalized to [-1, 1]
        default:
            throw new ArgumentException("Unsupported bit depth.");
    }
}

private float[] ResampleAudioData(float[] audioData, int originalSampleRate, int targetSampleRate)
{
    float resampleRatio = (float)targetSampleRate / originalSampleRate;
    int newSampleCount = (int)(audioData.Length * resampleRatio);
    float[] resampledData = new float[newSampleCount];

    for (int i = 0; i < newSampleCount; i++)
    {
        float originalSampleIndex = i / resampleRatio;
        int originalIndexFloor = Mathf.FloorToInt(originalSampleIndex);
        int originalIndexCeil = Mathf.CeilToInt(originalSampleIndex);

        originalIndexFloor = Mathf.Clamp(originalIndexFloor, 0, audioData.Length - 1);
        originalIndexCeil = Mathf.Clamp(originalIndexCeil, 0, audioData.Length - 1);

        float sampleFloor = audioData[originalIndexFloor];
        float sampleCeil = audioData[originalIndexCeil];

        float interpolationFactor = originalSampleIndex - originalIndexFloor;
        resampledData[i] = Mathf.Lerp(sampleFloor, sampleCeil, interpolationFactor);
    }

    return resampledData;
}

I use these methods simply calling CreateAudioClipFromRawPCM and passing it the array response.AudioOutput.AudioData.ToArray(). I left a resampling method algorithm just for coherence, but I tried to generate the audio clip at 24khz and the result sounds the same.

I am not super expert of raw audio management so I am not really able to tell why this seems to reproduce a slightly better audio playback.
I'd like to hear a feedback from you about this!

StephenHodgson Apr 11, 2025
Maintainer

yes we're already resampling the audio using a base Audio Utilities library. I'll cross reference your implementation and tweak accordingly if I can notice any improvements. Thanks!

StephenHodgson Apr 13, 2025
Maintainer

@HunterProduction should be fixed in

Answer selected by StephenHodgson

Uh oh!

Weird compressed audio while reproducing Chat Completion's AudioOutput generated AudioClip. Am I missing something? #358

Uh oh!

Uh oh!

HunterProduction Apr 9, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

StephenHodgson Apr 9, 2025 Maintainer

Uh oh!

Uh oh!

StephenHodgson Apr 10, 2025 Maintainer

Uh oh!

Uh oh!

HunterProduction Apr 11, 2025 Author

Uh oh!

StephenHodgson Apr 11, 2025 Maintainer

Uh oh!

Uh oh!

StephenHodgson Apr 13, 2025 Maintainer

HunterProduction
Apr 9, 2025

Replies: 1 comment 5 replies

StephenHodgson
Apr 9, 2025
Maintainer

StephenHodgson Apr 10, 2025
Maintainer

HunterProduction Apr 11, 2025
Author

StephenHodgson Apr 11, 2025
Maintainer

StephenHodgson Apr 13, 2025
Maintainer