fun-langid: The code does lowercase, but that is a locale sensitive operation.

This method:
```
def _normalize(self, line:str):
    return line.lower().replace('"', "'").replace("/", " ")
```
invokes `.lower()` on the input text, but that is a locale-sensitive operation.
The uppercase I (`U+0049`) converts i (`U+0069`) in all languages except Turkish and Azeri, where it should convert to dotless lowercase i (ı, `U+0131`).

So with the current code lowercase of "LARI" will be "lari", which does not exist in the Turkish n-gram, instead of "ları", which does exist.

This means that the recognition of Turkish and Azeri uppercase text will be problematic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions