Open
Description
I wonder if we should not normalize unicode as part of our Atlas data prep. I was looking on line about how to do it and found this code from some guy named Tauber ....
@jtauber @lcerrato @AlisonBabeu
from unicodedata import normalize
curword = normalize("NFC",m[1])
My thinking:
- Anything in our repos should probably be normalized (e.g., the Greek from the Greco-Arabic corpus).
- Anything we import into Atlas, we should normalize. That would imply some code in the Atlas data prep pipeline (I think)
Thoughts?
Metadata
Metadata
Assignees
Labels
No labels