Lyrics alignment broken for scripts that don't use spaces between words

One or more functions for lyric alignment, e.g. `_map_words_to_lines` in `analyzer/align.py`  are relying on `split()` to create word entries, but this doesn't work for lyrics that have no spaces, e.g. a typical Japanese lyrics file like this:

```json
{
  "lines": [
    "何度でも",
    "何度でも叫ぶ",
    "この暗い夜の怪獣になっても",
  ]
}
```

Will fail to align because `.split()` on e.g. `何度でも叫ぶ` is just `["何度でも叫ぶ"]`, when it should be `["何度", "でも", "叫ぶ"]` or `["何度でも", "叫ぶ"]` depending on what you consider a word in Japanese.


It works (kind of) as expected when the lyrics are spaced, as I did a test by overwriting the lyrics file with a spaced one before the app reads it and lyrics appeared.

Perhaps using something like https://github.com/polm/fugashi to tokenize lyric files and add spaces where necessary before aligning would be the simplest solution? There might be a nicer way that works well with whisper though by just tokenizing instead of splitting if the language is non-spaced though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lyrics alignment broken for scripts that don't use spaces between words #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Lyrics alignment broken for scripts that don't use spaces between words #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions