Open
Description
HI there,
First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.
To be specific, here is what I need to do:
- standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
- remove extra whitespaces from words, e.g. two whitespaces in a row
- stem and lemmatize
I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?
Many thanks in advance!
Metadata
Assignees
Labels
No labels