This repository was archived by the owner on Mar 25, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6
This repository was archived by the owner on Mar 25, 2024. It is now read-only.
Preprocessing language+mathematics corpora for pretraining #59
Copy link
Copy link
Open
Labels
Description
A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.
What are we looking for?
- primary textual sources (rather than remixed/curated train sets from other experiments)
- openly available for download & research
- processable math syntax that we can reliably normalize and lexematize
- interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems
I will re-edit this description to include corpora I think I can include for the current pass.
Decided to include (data has been obtained locally, checked when preprocessing is completed):
- our own arXiv as HTML5, as usual
- PubMed Central textmining resources
- oa subset, 3.2 million docs, 260k of which with marked math
- manuscript subset
- Kiwix packaged (fantastic archival effort)
- Wikipedia subset is union of
- subset of "all articles" with math syntax (
alttextattr) - kiwix-selected subjects of {astronomy, chemistry, climate change, computer science, geography, history, mathematics, medicine, molcell, physics, sociology}
- wikipedia simple articles (all or only with math syntax?)
- subset of "all articles" with math syntax (
- wiktionary and wiktionary simple
- wikiversity
- wikibooks
- wikiquote
- StackExchange subsets of {ai, astronomy, bioinformatics, codereview, cs, cseducators, cstheory, datascience, earthscience, engineering, math, matheducators, mathoverflow, physics, robotics, space, stats} ?
- non-Math sets also included: {academia, chess, ebooks, english, history, law, linguistics, literature, money, patents, philosophy, writers} - rational wiki
- proofwiki - the kiwix distribution is easier to preprocess than the official
latest.xml, as it is already standard HTML
- Wikipedia subset is union of
- art of problem solving wiki - downloaded separately, as kiwix was missing the
alt=attributes in math images, needed to lemmatize. - PlanetMath wiki entries
- WikiHow
- UBC course materials
- Stanford encyclopedia of philosophy
- etymonline
- Ancient EU
- deepmind mathematics problems
- deepmind AQuA word problems
- caltech neural PDE datasets
- Stacks project by directly normalizing their HTML dialect, typeset via plastex
- AIMath approved textbooks, typeset via PreText
- ncatlab
- encyclopedia of math
- encyclopedia Britannica
- math.libretexts
- math history
- math programming glossary
- mathworld
To vet:
- LearningQ
- Berkley MATH + AMPS dataset
- openstax
- wikia math & physics problems
- science blogs: sciencealert, sciencebuddies, symmetry magazine, ...
- educational texts - Introduction to Proofs
Vetted, but currently excluded:
- the pile preliminary
- see twitter thread about details why books1 and books3 are too broken for math syntax
- their arxiv conversion is via pandoc and has significantly more breakage than our own, which I've reported back to bmk of EleutherAI
- s2orc - very tempting to just use directly their curated set of 12.7M full-text papers. However, they are all uniformly obtained via PDF scraping (with grobid), so the mathematical markup is badly broken. I'll attach a data sample in the comments below, but PDFs will PDF...
- Open Library Data dumps - just metadata, no content
- PubMedCentral historical OCR has very rocky quality, and no math syntax. So probably better excluded.
- dictionaries such as wordnet are a bit too artificial to fit in
- Project Gutenberg OR wikisource -- adequate preprocessing is rather expensive, and while they are at least partially relevant, will likely defer to a later date.
- wikispecies - a bit too synthetic, great taxonomic language, but little actual sentences fleshing it out.
- vikidia - surprisingly the data quality is a bit poor here, and since STEM is a bit of a minor subset, skipping.
- Elsevier OA CC-BY Corpus
- 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign
=in the texts, 16,000 have a+. So may be worth including as an extra source for light inline math. - Sadly, a closer look revealed intentional breakage of documents with formulas when producing the JSON - the math syntax is completely missing from the provided data. Since I also find the format rather unpleasant to piece together, I've outright given up on this corpus for now.
- 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign
- 800+ textbooks from Open Textbooks - mostly available PDFs, but also some online variants with MathML, and some latex variants. Would take a while to preprocess even to download, so postponing for the next pass.
holtzermann17