Skip to content

Lemmatizing with Mallet #203

Open
Open
@Glorifier85

Description

HI there,

First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

To be specific, here is what I need to do:

  • standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
  • remove extra whitespaces from words, e.g. two whitespaces in a row
  • stem and lemmatize

I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?

Many thanks in advance!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions