corpora_build

This python code is a non-AI parser of lanaguge in arXiv papers. We donwloaded all physics papers from arXiv and tried to filter out all of the noise in the .tex files in order to make a corpus for training. It was a neat exercise, but not that successful. If we do thsi project again, we will use modern NLP like LLM's to process all of these date. Still, you might find this useful so we have it here.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
parse_arxiv_tex_v5.py		parse_arxiv_tex_v5.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corpora_build

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

corpora_build

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages