I used the English test audio for this project, and it looks exactly as expected
english.txt
But when using Chinese and Japanese speech, punctuation marks were completely removed, and spaces were added between each character, with the speech of multiple speakers all crammed together
chinese.txt
english.txt
The ctc-forced-aligner and romanize in the project seem to be having issues, and the nemo framework is also unable to correctly distinguish the speaker
Does the nemo framework use of titenet_large and other pre-trained models only support English??
I want to know how to solve this tricky problem
I am using machine translation, so please bear with any misunderstandings