Skip to content

Lyrics alignment broken for scripts that don't use spaces between words #15

@crnhrv

Description

@crnhrv

One or more functions for lyric alignment, e.g. _map_words_to_lines in analyzer/align.py are relying on split() to create word entries, but this doesn't work for lyrics that have no spaces, e.g. a typical Japanese lyrics file like this:

{
  "lines": [
    "何度でも",
    "何度でも叫ぶ",
    "この暗い夜の怪獣になっても",
  ]
}

Will fail to align because .split() on e.g. 何度でも叫ぶ is just ["何度でも叫ぶ"], when it should be ["何度", "でも", "叫ぶ"] or ["何度でも", "叫ぶ"] depending on what you consider a word in Japanese.

It works (kind of) as expected when the lyrics are spaced, as I did a test by overwriting the lyrics file with a spaced one before the app reads it and lyrics appeared.

Perhaps using something like https://github.com/polm/fugashi to tokenize lyric files and add spaces where necessary before aligning would be the simplest solution? There might be a nicer way that works well with whisper though by just tokenizing instead of splitting if the language is non-spaced though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions