-
Notifications
You must be signed in to change notification settings - Fork 141
Description
It is an interesting work and the task it aims to do is as exciting as SAM to me.
But I am not familar with audio research and I do have some questions related to this work.
Firstly, I checked the dataset amd it seems not very complete for "sound separation" or "separate anything in audio".
Actually I tried some samples for "separate vocal from songs", I found no matter use "Human Sounds" or "Vocal" the model cannot separate it even from a very slow and simple "guitar playing and singing" sample. And reversely I tried "acoustic guitar", it contains some vocal which is obvious.
Am I misunderstanding the scope that "songs" do not belong to music and the scope of this work?
Secondly, I would like to ask why it is foundation. It seems multimodal or multiple types of inputs = foundation model as I do not know what it provides for the "downstream tasks". Can someone provide me the insights?