Data cleaner for language (English) data

This uses the cleantext package:

pip install clean-text

All options to clean text:

clean("some input",
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=True,                     # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_urls=False,                  # replace all URLs with a special token
    no_emails=False,                # replace all email addresses with a special token
    no_phone_numbers=False,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    no_digits=False,                # replace all digits with a special token
    no_currency_symbols=False,      # replace all currency symbols with a special token
    no_punct=False,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
)

Example usage:

python language_data_cleaner/cleaner.py

Note: for data de-idenfitification we may should also look into a package like this: https://github.com/nedap/deidentify

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
language_data_cleaner		language_data_cleaner
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data cleaner for language (English) data

About

Uh oh!

Releases

Packages

Languages

gokite-ai/language-data-cleaner

Folders and files

Latest commit

History

Repository files navigation

Data cleaner for language (English) data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages