This python code is a non-AI parser of lanaguge in arXiv papers. We donwloaded all physics papers from arXiv and tried to filter out all of the noise in the .tex files in order to make a corpus for training. It was a neat exercise, but not that successful. If we do thsi project again, we will use modern NLP like LLM's to process all of these date. Still, you might find this useful so we have it here.
cwbartlett/corpora_build
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|