One or more functions for lyric alignment, e.g. _map_words_to_lines in analyzer/align.py are relying on split() to create word entries, but this doesn't work for lyrics that have no spaces, e.g. a typical Japanese lyrics file like this:
{
"lines": [
"何度でも",
"何度でも叫ぶ",
"この暗い夜の怪獣になっても",
]
}
Will fail to align because .split() on e.g. 何度でも叫ぶ is just ["何度でも叫ぶ"], when it should be ["何度", "でも", "叫ぶ"] or ["何度でも", "叫ぶ"] depending on what you consider a word in Japanese.
It works (kind of) as expected when the lyrics are spaced, as I did a test by overwriting the lyrics file with a spaced one before the app reads it and lyrics appeared.
Perhaps using something like https://github.com/polm/fugashi to tokenize lyric files and add spaces where necessary before aligning would be the simplest solution? There might be a nicer way that works well with whisper though by just tokenizing instead of splitting if the language is non-spaced though.
One or more functions for lyric alignment, e.g.
_map_words_to_linesinanalyzer/align.pyare relying onsplit()to create word entries, but this doesn't work for lyrics that have no spaces, e.g. a typical Japanese lyrics file like this:{ "lines": [ "何度でも", "何度でも叫ぶ", "この暗い夜の怪獣になっても", ] }Will fail to align because
.split()on e.g.何度でも叫ぶis just["何度でも叫ぶ"], when it should be["何度", "でも", "叫ぶ"]or["何度でも", "叫ぶ"]depending on what you consider a word in Japanese.It works (kind of) as expected when the lyrics are spaced, as I did a test by overwriting the lyrics file with a spaced one before the app reads it and lyrics appeared.
Perhaps using something like https://github.com/polm/fugashi to tokenize lyric files and add spaces where necessary before aligning would be the simplest solution? There might be a nicer way that works well with whisper though by just tokenizing instead of splitting if the language is non-spaced though.