MMAudio from Sony AI Full Tutorial - Open Source AI Audio Generator for Videos, Images and Text #111

FurkanGozukara · 2025-10-16T22:20:01Z

FurkanGozukara
Oct 16, 2025
Maintainer

MMAudio from Sony AI Full Tutorial - Open Source AI Audio Generator for Videos, Images and Text

Full tutorial: https://www.youtube.com/watch?v=504f8S4MLTw

MMAudio is the currently state of the art (SOTA) open source free to use AI model to generate sounds for videos, images and text prompts. It is so amazing and high quality and extremely useful to generate sound effects for your AI videos, game assets, or any project where you need specific or free sound effects. In this step by step tutorial I will show you how to install and use this amazing model on your Windows computer with 1-click installation and extremely easy to use Gradio App. My app and installation supports RTX 5000 series GPUs as well as older GPUs. Moreover, I am sharing scripts to 1-click install on Cloud services such as RunPod, Massed Compute and a free Kaggle account notebook. Enjoy.

🔗 Full Instructions, Configs, Installers, Information and Links Shared Post (the one used in the tutorial) ⤵️

▶️ https://www.patreon.com/posts/click-to-open-post-used-in-tutorial-117990364

🔗 Mandatory Requirements Tutorial⤵️

▶️ https://youtu.be/DrhUHnYfwC0

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.

00:00:00 Introduction to MMAudio: State-of-the-Art AI Audio Generation Model

00:00:06 Exploring MMAudio's Versatility: Generating Audio from Video, Text, and Images

00:00:23 Demonstrating Video to Audio Functionality and Initial Prompting Concepts

00:00:45 Showcasing AI Generated Video Examples with Impressive Audio Quality Matching

00:01:01 Highlighting Perfect Audio Synchronization with Input Video Content: Mind-blowing Results

00:01:17 Illustrating Realistic Video Audio Generation Capabilities with MMAudio for Enhanced Immersion

00:01:31 Example of Image Upload and Automatic Audio Generation Based on Visual Input

00:01:42 Text Prompt to Audio Generation Demonstration: Creating Soundscapes from Written Descriptions

00:02:06 Tutorial Roadmap: Step-by-Step Guide for Local Windows and Cloud Installation Options

00:02:47 Accessing Instruction Post & Downloading the Latest MMAudio Installer Zip File - Quick Guide

00:03:10 Understanding System Requirements and Performing One-Time Mandatory Setup for AI Applications

00:03:28 Detailed Installation Process: Extracting Zip & Running Windows Install.bat Script Locally

00:04:00 Clarifying Gradio Application Compatibility and Supported GPU Series (RTX 5000, 4000, 3000, etc.)

00:04:24 Verifying Installation Completion, Checking for Errors, and Troubleshooting with Log Files

00:04:41 Launching MMAudio: Running Start App.bat and Selecting GPU Option (Above/Below 8GB VRAM)

00:05:03 Observing Initial Model Download Process and First Look at the MMAudio User Interface

00:05:19 Navigating the Interface: Configuration Settings and Exploring Video to Audio Features

00:05:30 Video to Audio Demonstration: Generating Ambient Sound Directly from Video Content Without Prompts

00:06:21 Leveraging Google AI Studio for Advanced Prompt Engineering and Enhanced Audio Generation

00:07:04 Generating Multiple Audio Variations and Adjusting Key Parameters like Steps & Guidance Strength

00:08:18 In-depth Explanation and Demonstration of Batch Processing for Efficient Video to Audio Conversion

00:09:18 Understanding Batch Processing Logic: Defining Prompts Per Video and Output Folder Configuration

00:10:41 Text to Audio Functionality Deep Dive: Generating Diverse Audio Files Solely from Text Prompts

00:11:52 Streamlining Workflow with Batch Processing for Text to Audio: Generating Multiple Prompts at Once

00:12:50 Image to Audio Functionality Showcase: Generating Contextual Audio Based on Uploaded Images

00:13:31 Optimizing Image to Audio Results with Effective Prompting Techniques for Targeted Sound Design

00:14:02 Step-by-Step Guide to Batch Processing for Image to Audio: Automating Audio Generation for Multiple Images

00:14:48 Mastering Configuration Settings: Saving, Loading, and Resetting Custom Parameter Presets

00:15:27 Live Speed Comparison: Analyzing Performance Differences Between RTX 5090 and 3090 Ti GPUs

00:17:50 Cloud Service Installation Tutorial: Massed Compute, Runpod, and Free Kaggle Account Setup

00:19:29 Kaggle Setup Walkthrough: Importing Notebook, Running the App, and Downloading Generated Files as Zip

00:20:18 Exploring Patreon Exclusive Content, Discord Community, GitHub Repository, Reddit, and LinkedIn Links

Song: Robin Hustin x TobiMorrow - Light It Up (feat. Jex) [NCS Release]

Music provided by NoCopyrightSounds

Free Download/Stream: http://ncs.io/LightItUp

Watch: http://youtu.be/bdE_SyHad90

Song: Dirty Palm - Freakshow (feat. LexBlaze) [NCS Release]

Song: TULE - Lost [NCS Release]

Song: NIVIRO - The Ghost [NCS Release]

Song: Unknown Brain - Superhero (feat. Chris Linton) [NCS]

Song: Cartoon, Jéja - Why We Lose (feat. Coleman Trapp) [NCS]

Song: Egzod, Maestro Chives, Neoni - Royalty [NCS]

Video Transcription

00:00:00 Greetings everyone. Today, I am going to introduce you to the most advanced state-of-the-art audio
00:00:06 generation model, MMAudio. So what does this model do is, it is able to generate audio
00:00:13 according to the input video, and also it is able to generate audio from text,
00:00:17 and also it is able to generate from input image. So let me show you some of the examples and let's
00:00:23 begin this tutorial. As you have seen, I have uploaded this video and I have entered a prompt,
00:00:34 and according to the video content and the prompt, it has generated this video. I will
00:00:39 show you how to write good prompts. Let's see several other examples. This is second
00:00:45 example. This is another amazing AI generated video and let's see the audio it generated.
00:01:01 As you can see, the audio is perfectly matching to the input video and it is just mind-blowing.
00:01:07 It is even better with the more realistic videos. Let's see how it generates. So the audio of this
00:01:17 video was generated with the MMAudio. Let me show you another example. Just purely amazing.
00:01:31 And let's see an example of image upload and audio generation for that image. Just purely amazing.
00:01:42 Finally, it is also able to generate audio from just text prompts. Let's see an example. Just
00:01:54 amazing. You see this is the prompt and this was the generation. Let's see another example.
00:02:06 Just purely amazing. With a simple prompt and amazing output. So in this tutorial,
00:02:11 I will show you how to install and run this application locally on your Windows computer
00:02:17 with full speed mode and also VRAM optimized mode. Moreover, I will show on Massed Compute,
00:02:24 my favorite most affordable cloud platform. On Runpod, which is favorite of a lot of people,
00:02:31 and on a free Kaggle account who is GPU poor and also doesn't want to pay any money to any cloud
00:02:36 services. The application I have developed is extremely advanced. I will show every feature of
00:02:42 it. So watch the Windows tutorial part, then you can run it either locally or on a cloud service
00:02:47 that you would like. So as usual, I have prepared an amazing instruction post. The link of this post
00:02:53 will be in the description of the video in below. So check it out. Once you entered to this post,
00:02:59 you will see the latest installer zip file. I recommend you to read this post from top to
00:03:04 bottom. So click the latest zip file to download it. Before starting installation, make sure that
00:03:10 you have the requirements. These requirements are one-time follow. When you follow it, you will
00:03:15 be able to use all of the AI applications that I show you, that I teach you, that I share. So this
00:03:22 is one-time mandatory to follow. Once you have followed the requirements, move the downloaded
00:03:28 zip file into any disk that you want to install. Let's install it into my E drive and extract. Do
00:03:34 not install it into a shared drive like OneDrive, like Google Drive. So install it into your one of
00:03:41 the internal drive. Then double click Windows install.bat file, more info, run anyway. Do not
00:03:47 run anything as administrator. Always use double click or select and hit enter. This will download
00:03:54 and install everything automatically for you. Just wait for installation to be complete. This
00:04:00 Gradio application that I have developed working on RTX 5000 series as well, as well as with the
00:04:06 older GPUs like RTX 4000 series, 3000 series, probably 2000, 1000 series. I haven't tested,
00:04:13 but it should work. But it is working beginning from 3000 to above and it should work with the
00:04:18 2000 and older GPUs as well. When you see that virtual environment made and installed properly,
00:04:24 the installation has been completed. Scroll up and see if there are any errors or not. If you
00:04:29 see any errors, select all of the logs, copy into a text file and just email me,
00:04:35 message me from Patreon or from Discord. So I can see your error. So just click anything and it will
00:04:41 close. Once installed, you are ready to run it. So double click Windows start App.bat file, more info,
00:04:48 run anyway and it will start the application. So if you have above 8 GB GPUs, select option
00:04:53 one. If you have below 8 GB GPUs, select option two. According to your GPU, select it. Wait for
00:04:59 application to start. Initially, it will download the models, then it will start the application.
00:05:03 You see it is downloading the missing models right now. They will be all downloaded into this folder.
00:05:09 This is one time only download, so it will not download again once they are downloaded. Okay,
00:05:14 so the application has been started and this is our interface. First of all,
00:05:19 it will generate the default configuration and load it, but you can always save your configs
00:05:24 and load them. It has 3 options, video to audio, text to audio and image to audio. So
00:05:30 let's begin with video to audio. Click here and select your input video. You can use any video,
00:05:37 the videos that has already sound, it doesn't matter. So let's try this video for example.
00:05:42 Let me play it. There is no sound. This is downloaded from CivitAI. So how you should
00:05:47 prompt this to generate? First of all, you don't need to prompt. It can already generate according
00:05:53 to the video. Let's try. Submit. You can see the status on the started CMD window always. This is
00:05:59 running on RTX 5090 right now. It is already very fast. As you see, it is showing the speed and
00:06:05 everything and it is generated. So you can click here to play it and it will start playing. Okay,
00:06:15 so this is the audio it generated for this video without any prompt. So how you should prompt it? I
00:06:21 really recommend to use Google AI Studio. This is an amazing free to use AI right now. So select the
00:06:29 Flash thinking experimental model. I like this and click plus icon, upload file. You can use this for
00:06:36 free and upload your video. Then you can use this prompt that I have prepared it for you. Paste it,
00:06:42 wait for upload to be completed. You see, it is currently extracting the video. If your video
00:06:47 be too big, it may fail and it is uploaded. It is only using 1,300 tokens and run. Then this
00:06:54 will give us a very good prompt to generate audio for our models and it has given us this prompt.
00:07:01 So let's copy this prompt. Let's copy and paste it here. Now you can generate multiple videos at
00:07:07 once. How you can do that? You see we have number of generations. Let's generate 10 videos. So it
00:07:13 will use different seeds and generate 10 audios for this video. You can also change number of
00:07:19 steps from here. Recommended is 25, but since it is really fast, I use 50. It will set it according
00:07:26 to your input video duration and you can also change the guidance strength. It will impact how
00:07:32 much prompt should be followed or not. Moreover, when you enable save generation parameters,
00:07:38 it will save all the used parameters. I will show you. So let's hit submit and the generation has
00:07:44 been started. You see it is truncating video to generate accordingly and you will see the progress
00:07:51 like this. I have spent a lot of time to make this application perfect. It shows the step speed,
00:07:56 it shows the generation status, the generation speed, the time it is going to take. So currently
00:08:03 it is generating 10 videos and it is taking around one minute to generate 10 different audios for
00:08:08 this input video. By the way, this is not the maximum speed because currently I am recording,
00:08:13 but it is really really fast and it is also fast on RTX 3090 Ti as well. I will make a
00:08:18 comparison to show you. And the 10 videos has been generated. You can see all of them is displayed in
00:08:24 here. So you can click like this, then you can select videos to play them. Let me show you.
00:08:43 So you can generate more audios until you find whatever you like most. This is so simple. Every
00:08:50 generation will be saved inside outputs folder. Click this open outputs folder and you see this is
00:08:56 where I have installed and it is inside outputs folder like this. So let's see the videos. So
00:09:02 these are the videos, the audios we have generated and when you open the params, it will show you all
00:09:08 of the used parameters like this, prompt, negative prompt, used seed, number of steps and whatever
00:09:13 there are. We also have batch processing feature. The batch processing is so simple,
00:09:18 let me demonstrate you. So put the videos into a folder like this. You can define a separate prompt
00:09:25 for every video or you can just use the prompt written here. It is up to you. So how you can
00:09:31 define a prompt for each one? Generate a text file like this 1. This will be matching with
00:09:37 the first video and type a prompt like test 1. Then generate another text prompt for the second
00:09:43 video like 2 and this will be matching with the second video and let's not generate any prompt for
00:09:49 the third one and let's make this is the third video like this and it should use it. Then enter
00:09:54 your input folder like this and enter your output folder. Let's say B audio. And then click start
00:10:01 processing. You can also skip if existing and you can save generation parameters. Then follow what
00:10:07 is happening in the CMD window, but you need to be careful with something. When you have number
00:10:12 of generations here, it will still apply to your batch processing. So it is currently going
00:10:19 to generate 10 videos for every video. So let's open our target desired folder, which is inside
00:10:26 here and we can see that it is batch processing every video. Let's open the parameters. So we can
00:10:31 see it is using test 1. This is amazing. And the second video will be also test 1. However,
00:10:36 it is using different seeds. So this is how batch processing works. This is the logic of it. Let's
00:10:41 cancel the batch processing. The batch processing should be canceled. Then let's make the number of
00:10:46 generations 1 and I will start batch processing again. Now it should start generating one time
00:10:52 for every video and you can watch the progress here. By the way, the batch processing will be
00:10:58 canceled once these 10 videos are done. So always pay attention to the number of generations. And
00:11:04 currently it is really slow because I am using a lot of VRAM with the other applications. So let
00:11:08 me close everything and restart. So what about text to audio? For text to audio, click here. It
00:11:13 is just taking prompts to generate audio files. I already have two examples here. For example,
00:11:20 this one, just type it. How many you want to generate the same settings? Let's generate 3
00:11:25 and click submit. It will generate 3 audio files with this prompt. You can always monitor
00:11:31 what is happening on CMD, remember that. You see I did restart and closed some of the other
00:11:35 applications and now it is faster. Let's play some of them. Yeah, it is really great. You can click
00:11:46 here and download or you can open outputs folder and you will see they are generated like this,
00:11:52 12 MP3, 13 MP3 or 14 MP3. This is also supporting batch processing. For batch processing, you need
00:11:59 to type prompts here. So let's copy these two and type here. You can also set an output folder for
00:12:06 the batch processing of the text to audio. Let's generate them inside here and let's say 3 and
00:12:12 start batch processing. It will skip if the line is empty or lesser than 2 characters. So don't
00:12:18 worry about that and now it is generating. We will see the results. Yes, it is completed.
00:12:24 Let's open the folder inside here. Oh, by the way, still it is generating number of 3. So
00:12:30 I am always forgetting that, don't forget that. I can see that it is generating. You can also
00:12:35 see the used parameters, the prompt, the negative prompt, used seed and everything and it is already
00:12:41 done. It is super fast. This application is just super fast. Just double click and play. Okay,
00:12:50 it is working as expected. So what about image to audio? Image to audio generates audio according to
00:12:57 the image. Let's upload an image for testing. For example, this image. Now you don't need to type
00:13:02 any prompt. Let's first try that way. Submit. The speed doesn't really change and it is generated.
00:13:08 Let's play it. Okay, not very relative to the image. So we need to support it with a prompt,
00:13:19 but sometimes it works. For example, let's try another one before I show you. For example,
00:13:23 let's try this one. Submit. Okay, let's play. Yes, this one definitely more matching. So how
00:13:31 you can decide your prompt? Again, you can use this prompt on here. So let's clear this chat
00:13:37 and let's upload our image and see what it is giving to us. Okay, let's try this image. Wait
00:13:43 for upload to be completed and run. Okay, it gave us a prompt. So let's try copy it and paste it and
00:13:52 let's see. Let's play it. Okay, it is decent. You can always generate multiple and decide which one
00:14:02 is best. Image to audio is also supporting batch and the logic is exactly same as in the video
00:14:09 batch. So let me demonstrate you. Put the images into a folder and type prompts if you wish like
00:14:15 try like this one, let's say a dragon. You don't need to necessarily define and the same logic,
00:14:22 let's say like this and let's see. Then start batch processing. It will process all the images
00:14:28 and generate audio and generate a video. The video will be static of course. Let's return back to
00:14:33 the folder and the files are generated like this. Let's play. As you can see, it is pretty cool. So
00:14:45 this is how you can do batch processing with image to audio. What about configuration? Let's say you
00:14:50 want to change configurations, for example, the duration for text to audio. For example,
00:14:56 like this 10. Let's say you want to change image to audio number of steps like 100. Then you need
00:15:01 to go to here and set a config like test 1, save config and now it is selected and it is
00:15:08 set. So if you want to return back to the default, select the default and load config and it will be
00:15:14 default. If you want to get the default again, it is so easy. Go to the MMaudio, go to the folder
00:15:21 and you will see that there are configs folder. Just delete it and restart your application and it
00:15:25 will regenerate the default configuration. So now I will show speed comparison between RTX 5090 and
00:15:33 RTX 3090 Ti. I have both of the GPUs, so let me demonstrate you the speed comparison. Since when
00:15:39 recording a video, it takes a lot of GPU power, I will stop recording and I will make a comparison,
00:15:46 but before stopping and making that comparison, let's make a live comparison. So I will copy paste
00:15:52 this to set GPU IDs. The first one will start on my first GPU. The second one, I will right
00:15:58 click and edit in Notepad++. And I will set CUDA visible devices 1. So it will use my second GPU.
00:16:05 Let's start it and again the first option. We can see that it will load the models here. Okay, both
00:16:11 of the applications started on each of the GPU. So let's start a comparison. So let's try this
00:16:17 video on both of them. Upload. Let's get to our prompt. I will get the prompt from here since it
00:16:24 is saved. For example, this one. Type the prompt in both of the audio panel and let's generate 10
00:16:31 audio to see the speed difference. Okay, 10. Now we can see both of the applications will start
00:16:38 generation and we can see the speed comparison. So currently average of the 5090 Ti is 7 seconds.
00:16:47 3090 Ti is like 9 seconds. This application is not able to fully utilize the GPU because it has other
00:16:54 parts that uses the CPU, but it is really really fast when we consider the amount of the VRAM it
00:17:02 uses and the time it takes to generate such audio with such amount of steps. I am using 50 steps,
00:17:09 but you can use 25 steps as well. So this is the live comparison. Now I will close the video
00:17:15 recording, make a more fair comparison and show you the results. So I have made the generations
00:17:21 and calculated the average. The average speed of the RTX 5090 is 16.5 IT second and the average
00:17:30 of 3090 Ti is 13.61 IT second. If we calculate the difference, 16.5 over 13.61, it is around
00:17:40 20% faster for this application, but in some other applications we get 300% speed difference. On this
00:17:48 application, there is not much difference. Now I will show you how you can use this application
00:17:53 on cloud services. The using on cloud services is so easy. First of all, I will show on Massed
00:18:00 Compute. For Massed Compute, please register an account with this link. I appreciate that. After
00:18:06 registration, set your billing, deploy a GPU. You can use any GPU simply because this application
00:18:13 is very lightweight, but if RTX A6000 is not available, use L40. Select our image, select
00:18:21 SECourses, use our coupon, verify and deploy. Then you will be able to use. If you don't know how to
00:18:27 use on Massed Compute, you can follow this video like 5 to 10 minutes and you will learn. Moreover,
00:18:33 after you have downloaded the zip file, you will see there is Massed Compute instructions txt file.
00:18:39 Always read this file, follow the instructions here and it will work. It is exactly as same
00:18:45 principle as following this video. You see, the link is here. For Runpod, please use this link to
00:18:52 register. After registration, put go to billing, set up some balance, then go to pods, click
00:18:59 deploy, select your machine. You can use simply any machine. This application is very lightweight.
00:19:05 If you don't know how to use on Runpod, please follow this video starting from minute 22. This
00:19:11 is starting from minute 13 and watch like 5 to 10 minutes and you will understand. Also you
00:19:16 can always follow the Runpod instructions read txt file. This has everything that you need to follow.
00:19:23 Just follow this txt file and you will learn. And what about Kaggle? Kaggle is also extremely simple
00:19:29 to use. Go to kaggle.com, register your account and after registering your account, verify your
00:19:35 phone number. This is mandatory. Then go to create new notebook, click file, import notebook, select
00:19:42 the notebook file we have in our folder, this one and then follow the instructions. The rest is
00:19:48 same. If you don't know how to use Kaggle, we also have a tutorial for that. It is here. Just follow
00:19:53 from 27 minute 34 second. And there is a good news, we don't need to use ngrok anymore. It is
00:20:01 just working with Gradio live share, so it is much easier than before. Also on the Kaggle notebook,
00:20:06 you will see a script like this, which allows you to download all of the generated videos,
00:20:12 audio files as a zip file. So just follow the tutorial video and you will learn how to use
00:20:18 Kaggle as well. I hope you have enjoyed. Please go to our Patreon exclusive post index. You will see
00:20:24 that we have over 100 AI applications, amazing tutorials. Just read here and you will see our
00:20:31 applications. You will get access to every one of them. Moreover, please join our Discord channel.
00:20:38 We have over 10,000 users. Just click this link to join our Discord channel. Moreover, we have
00:20:44 Stable Diffusion Generative AI GitHub repository. Please star it, fork it and watch it. If you also
00:20:51 sponsor, I appreciate that. We have the list of our tutorials here starting from very early
00:20:56 ones to the very latest ones. Moreover, we have a Reddit. Please click here and go to our Reddit
00:21:02 subreddit and follow our subreddit. I am sharing a lot of useful information here. And finally,
00:21:09 you can follow me on my LinkedIn account. It's a real account. Go to there, you will
00:21:13 see everything about me. I am a PhD computer engineer. So you can follow me on LinkedIn as
00:21:19 well. I hope you have enjoyed. Hopefully, see you in another next amazing tutorial video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MMAudio from Sony AI Full Tutorial - Open Source AI Audio Generator for Videos, Images and Text #111

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

MMAudio from Sony AI Full Tutorial - Open Source AI Audio Generator for Videos, Images and Text #111

Uh oh!

FurkanGozukara Oct 16, 2025 Maintainer

MMAudio from Sony AI Full Tutorial - Open Source AI Audio Generator for Videos, Images and Text

Video Transcription

Replies: 0 comments

FurkanGozukara
Oct 16, 2025
Maintainer