Best model for isolating dialogues in a film? #605

kaswardy · 2023-06-08T14:07:01Z

kaswardy
Jun 8, 2023

I am using VR-Architecture 1_HP_UVR and it does a good job, but I was wondering if there is another model that is better suited for isolating the vocal dialogues from a film which may have music and sound effects.

I have a RTX 4080, so I can run a heavier model if needed.

Answered by codebespawler

Jul 22, 2023

I'm not sure if it's the best, but I've been using this ensemble mode reccomended by @sabaasa:

MDX-Net: Kim Vocal 1, UVR-MDX-NET inst 3 & UVR-MDX-NET inst main
Demucs: v4: htdemucs_ft

It's done a pretty good job when I've wanted to isolate dialogue from films.

Full thread with more infomation here https://github.com/Anjok07/ultimatevocalremovergui/discussions/444#discussioncomment-5313230

View full answer

codebespawler · 2023-07-22T18:17:21Z

codebespawler
Jul 22, 2023

I'm not sure if it's the best, but I've been using this ensemble mode reccomended by @sabaasa:

MDX-Net: Kim Vocal 1, UVR-MDX-NET inst 3 & UVR-MDX-NET inst main
Demucs: v4: htdemucs_ft

It's done a pretty good job when I've wanted to isolate dialogue from films.

Full thread with more infomation here https://github.com/Anjok07/ultimatevocalremovergui/discussions/444#discussioncomment-5313230

0 replies

marshalleq · 2023-08-01T04:45:01Z

marshalleq
Aug 1, 2023

I'm finding / hearing that MDX-Net models are not GPU accelerated / very slow. Out of interest is this your experience too? This is such a fantastic tool, only just discovered it. I would never have dreamed something could do what this can do, it's like magic.

2 replies

codebespawler Aug 1, 2023

For me a 4:07 song takes around 2:30 to process. It appears to be using my NVIDIA 3050 Ti, but I use Demucs in my ensemble mode, so not sure if it's using the GPU just for that model and not the MDX-Net ones.

RyLeo154 Sep 7, 2023

I'm finding / hearing that MDX-Net models are not GPU accelerated / very slow. Out of interest is this your experience too? This is such a fantastic tool, only just discovered it. I would never have dreamed something could do what this can do, it's like magic.

@marshalleq GPU acceleration on MDX-NET models seem to work just fine on my home desktop's RTX 3060 Ti on all other MDX-Net models I've tested so far like Kim Vocal 2, Inst HQ 1, Inst HQ 2 and Inst HQ 3 seem to properly utilize the GPU although utilization was sometimes misreported in Task Manager; I tried inferencing a sample file (3m 17s, 128kbps mp3) with MDX-Net Inst HQ 2 with my laptop's GTX 1060 Max-Q and here's my results:

^ results of GPU-accelerated inferencing, 1m 15s to process the sample file on a GTX 1060 Max-Q
(additional footnote: my 3060 Ti processed 3-minute 320kbps mp3 files in under a minute, I'll update this post later with results when I'm able to use my desktop to process the same sample file later)

^ results of CPU-only inferencing, 9m 26s to process the same sample file on a 4-core i7-7700HQ (took this screenshot after a small delay; the last 50% of the CPU histogram represented CPU utilization during inference)

1m 15s (GPU-accelerated) vs 9m 26s (CPU-only) is a pretty stark difference in processing time, so GPU acceleration is working fine on my end right out of the box, no additional tweaks needed

kdcyberdude · 2023-08-03T19:26:56Z

kdcyberdude
Aug 3, 2023

@codebespawler @marshalleq MDX-Net models aren't utilising the GPU? Do you have any idea why that might be happening?

0 replies

PodRED · 2023-09-28T13:17:41Z

PodRED
Sep 28, 2023

As a complete aside if you have access to the original 5.1 / 7.1 audio you can basically always grab just the voice by extracting the centre channel track on it's own as nothing else generally gets put in centre channel.

0 replies

COOLak · 2023-10-27T22:53:30Z

COOLak
Oct 27, 2023

As a complete aside if you have access to the original 5.1 / 7.1 audio you can basically always grab just the voice by extracting the centre channel track on it's own as nothing else generally gets put in centre channel.

There's a LOT of other stuff in the center channel. Believe me.

0 replies

GUUser91 · 2025-07-26T19:56:39Z

GUUser91
Jul 26, 2025

I use the BandIt Plus model via https://github.com/ZFTurbo/Music-Source-Separation-Training to seperate dialogue / vocals from background music and sound effects / SFX. If you're willing to pay, there's also the Moises Pro Plan which does an even better job at separating dialogue. I bought it during a black friday sale for $150, regular price is $300. You can then feed the bandit plus and or moises pro plan output files to https://github.com/resemble-ai/resemble-enhance
Input audio
https://vocaroo.com/1f36TF9tUCmZ
Bandit plus output file
https://vocaroo.com/14vZgiXh308o
Bandit plus output file fed to resemble enhance
https://vocaroo.com/1nh3mGlSl5ez
Moises Pro Plan output file
https://vocaroo.com/1hKFmgaTuXH3

Input audio
https://vocaroo.com/11DPVoQboEJI
Bandit plus output file
https://vocaroo.com/12asQ27WSRct
Moises Pro Plan output file
https://vocaroo.com/13bp0cNetPaA

0 replies

Best model for isolating dialogues in a film? #605

Uh oh!

Replies: 6 comments · 2 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 2 replies