Skip to content

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

License

Notifications You must be signed in to change notification settings

shjwudp/c4-dataset-script

Error
Looks like something went wrong!

About

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages