Python package to clean strings and making them reasonable for NLP.
cleantxty is a an open-source python package cleaning text from raw text format. Source code for the library can be found here.
cleantxt has two main methods,
- clean: to clean raw text and return the cleaned text
- clean_words: to clean raw text and return a list of clean words
other menthods that can be used simultaneoulsy are:
- remove_link: to remove link from the text
- remove_extra_white_space: to remove extra white space from the text
- lower_text: to make case of the text to lower case
- upper_text: to make case of the text to upper case
- remove_stopwords: to remove stopwords from the text
- remove_digits: to remove digits from the text
- remove_punctuations: to remove punctuations from the text
- custom_regex: to use custom regex and appy to text
- stem_text: to stem the provided text
cleantext requires Python 3 and NLTK to execute.
To install using pip, use
pip install cleantxty
- Import the library:
import cleantxty- Choose a method:
To return the text in a string format,
cleantxty.clean("raw_text_here") To return a list of words from the text,
cleantxty.clean_words("raw_text_here") To choose a specific set of cleaning operations,
cleantxty.clean("raw_text_here",
default_case= "lower", # lower by default change to upper for upper case result
regex=None # Provide custom regex to use
)
cleantxty.clean_words("raw_text_here",
default_case= "lower", # lower by default change to upper for upper case result
regex=None # Provide custom regex to use
)import cleantxty
cleantxty.clean('This is A s$ple ? tExt3% to cleaN566556+wow8 ')returns,
'this is a sample text to clean'import cleantxty
cleantext.clean_words('This is A s$ample !!!! tExt3% to cleaN566556+2+59*/133')returns,
['sampl', 'text', 'clean']from cleantxty import clean
text = "my id, [email protected] and your, [email protected]"
clean(text, regex=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+")returns,
"my id, email and your, email"For any questions, issues, bugs, and suggestions please visit here