Skip to content

aditya0072001/cleantxty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cleantxty

Downloads

Python package to clean strings and making them reasonable for NLP.

cleantxty is a an open-source python package cleaning text from raw text format. Source code for the library can be found here.

Features

cleantxt has two main methods,

  • clean: to clean raw text and return the cleaned text
  • clean_words: to clean raw text and return a list of clean words

other menthods that can be used simultaneoulsy are:

  • remove_link: to remove link from the text
  • remove_extra_white_space: to remove extra white space from the text
  • lower_text: to make case of the text to lower case
  • upper_text: to make case of the text to upper case
  • remove_stopwords: to remove stopwords from the text
  • remove_digits: to remove digits from the text
  • remove_punctuations: to remove punctuations from the text
  • custom_regex: to use custom regex and appy to text
  • stem_text: to stem the provided text

Installation

cleantext requires Python 3 and NLTK to execute.

To install using pip, use

pip install cleantxty

Usage

  • Import the library:
import cleantxty
  • Choose a method:

To return the text in a string format,

cleantxty.clean("raw_text_here") 

To return a list of words from the text,

cleantxty.clean_words("raw_text_here") 

To choose a specific set of cleaning operations,

cleantxty.clean("raw_text_here",
default_case= "lower", # lower by default change to upper for upper case result
regex=None  # Provide custom regex to use
)

cleantxty.clean_words("raw_text_here",
default_case= "lower", # lower by default change to upper for upper case result
regex=None  # Provide custom regex to use
)

Examples

import cleantxty
cleantxty.clean('This is A s$ple ? tExt3% to   cleaN566556+wow8 ')

returns,

'this is a sample text to clean'

import cleantxty
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')

returns,

['sampl', 'text', 'clean']

from cleantxty import clean
text = "my id, [email protected] and your, [email protected]"
clean(text, regex=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+")

returns,

"my id, email and your, email"

License

MIT

For any questions, issues, bugs, and suggestions please visit here

About

Python package to clean strings and making them reasonable for NLP.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages