Skip to content

Conversation

@Marny30
Copy link

@Marny30 Marny30 commented Feb 8, 2019

In the current version, the anntoconll tool will split a word containing accents into different tokens, isolating the accents as if they were words. When working with european languages such as French, Spanish, etc or even German with the ß this comes to be a problem.

For example, the text "déjà fait" would be split into tokens "d", "é", "j", "à", "fait" instead of "déjà", "fait"

By adding all accents range À-ÿ to the tokenization regex, this tokenization issue doesn't happen anymore.

@Marny30 Marny30 changed the title fix word with accents being split by tokenization anntoconnl : fix word with accents being split by tokenization Feb 8, 2019
@Marny30 Marny30 changed the title anntoconnl : fix word with accents being split by tokenization anntoconll : fix word with accents being split by tokenization Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant