Umlaute are removed

I think Umlaut-characters (äüößÄÜÖ) are currently just being removed from the input texts instead of getting their own symbol id or being replaced by similar ASCII encodings ('ae', 'ue, 'oe', 'ss'...). Even though I guess the neural network learns to pronounce 'fnf' as 'fünf' I think the performance could be improved by fixing this.

The background is that german_transliterate actually doesn't change the umlaut-characters, even though it states it 'replaces Unicode symbols with ASCII characters'. They are still in the string afterwards and as there is no symbol id for them in `symbol_to_id` they are just left out in the resulting sequence.

A solution could be to append those characters to ALL_SYMBOLS to give them their own id. Unfortunately the network probably has to be retrained after changing this.

Please don't hesitate to tell me if I got something wrong and umlaut characters are being handled correctly.

[Edit: Thank you Monatis and Thorsten for this really great effort regardless of this issue anyway!]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Umlaute are removed #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Umlaute are removed #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions