Preprocessing language+mathematics corpora for pretraining

A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.

**What are we looking for?**
 - primary textual sources (rather than remixed/curated train sets from other experiments)
 - openly available for download & research
 - processable **math syntax** that we can reliably normalize and lexematize
 - interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems

I will re-edit this description to include corpora I think I can include for the current pass.

**Decided to include** (data has been obtained locally, checked when preprocessing is completed):
 - [x] our own [arXiv as HTML5](https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/), as usual
  - [x] [PubMed Central textmining resources](https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/)
     - [x] oa subset, 3.2 million docs, 260k of which with marked math
     - [x] manuscript subset
  - [x] [Kiwix packaged](https://wiki.kiwix.org/wiki/Content_in_all_languages) (fantastic archival effort)
      - [x] Wikipedia subset is union of
        - [x] subset of "all articles" with math syntax (`alttext` attr)
        -  [x] kiwix-selected subjects of {astronomy, chemistry, climate change, computer science, geography, history, mathematics, medicine, molcell, physics, sociology}
         - [x] wikipedia simple articles (all or only with math syntax?)
      - [x] _wiktionary_ and _wiktionary simple_
      - [x] wikiversity
      - [x] wikibooks
      - [x] wikiquote
      - [x]  StackExchange subsets of {ai, astronomy, bioinformatics, codereview, cs, cseducators, cstheory, datascience, earthscience, engineering, math, matheducators, mathoverflow, physics, robotics, space, stats} ?
              - non-Math sets also included: {academia, chess, ebooks, english, history, law, linguistics, literature, money, patents, philosophy, writers}
      - [x] rational wiki
      - [x] proofwiki - the kiwix distribution is easier to preprocess than the official `latest.xml`, as it is already standard HTML
   - [x] [art of problem solving wiki](https://artofproblemsolving.com/wiki/) - downloaded separately, as kiwix was missing the `alt=` attributes in math images, needed to lemmatize.
  - [x] [PlanetMath wiki entries](https://github.com/planetmath/)
 - [x] [WikiHow](https://github.com/mahnazkoupaee/WikiHow-Dataset)
 - [x] [UBC course materials](https://wiki.ubc.ca/)
 - [x] [Stanford encyclopedia of philosophy](https://plato.stanford.edu/)
 - [x] [etymonline](https://www.etymonline.com/)
 - [x] [Ancient EU](https://www.ancient.eu/)
 - [ ] [deepmind mathematics problems](https://github.com/deepmind/mathematics_dataset)
 - [ ] [deepmind AQuA word problems](https://github.com/deepmind/AQuA)
 - [ ] [caltech neural PDE datasets](https://github.com/zongyi-li/fourier_neural_operator#datasets)
 - [ ] [Stacks project](https://stacks.math.columbia.edu/browse) by directly normalizing their HTML dialect, typeset via plastex
 - [ ] [AIMath approved textbooks](https://aimath.org/textbooks/approved-textbooks/), typeset via PreText
 - [ ] [ncatlab](https://ncatlab.org/nlab/show/HomePage)
 - [ ] [encyclopedia of math](https://encyclopediaofmath.org/wiki/Main_Page)
 - [ ] [encyclopedia Britannica](https://www.britannica.com/)
 - [ ] [math.libretexts](https://math.libretexts.org/)
 - [ ] [math history](https://mathshistory.st-andrews.ac.uk/)
 - [ ] [math programming glossary](https://glossary.informs.org/ver2/mpgwiki/index.php?title=Main_Page)
 - [ ] [mathworld](https://mathworld.wolfram.com/) 
---
**To vet:**
 - [LearningQ](https://yangjiera.github.io/pdf/chen2018icwsm.pdf)
 - [Berkley MATH + AMPS dataset](https://github.com/hendrycks/math)
 - [openstax](openstax.org/)
 - [wikia math & physics problems](https://math-physics-problems.wikia.org/wiki/Math_%26_Physics_Problems_Wikia)
 - science blogs: [sciencealert](https://www.sciencealert.com/), [sciencebuddies](https://www.sciencebuddies.org/), [symmetry magazine](https://www.symmetrymagazine.org/), ...
 - educational texts - [Introduction to Proofs](https://gitlab.com/jim.hefferon/proofs)
---
**Vetted, but currently excluded:**
  - [x] [the pile preliminary](https://the-eye.eu/public/AI/pile_preliminary_components/)
    - see [twitter thread](https://twitter.com/dginev/status/1332683557475639296) about details why books1 and books3 are too broken for math syntax
    - their arxiv conversion is via pandoc and has significantly more breakage than our own, which I've reported back to bmk of EleutherAI
  - [x] [s2orc](https://github.com/allenai/s2orc) - very tempting to just use directly their curated set of 12.7M full-text papers. However, they are all uniformly obtained via PDF scraping (with grobid), so the mathematical markup is badly broken.  I'll attach a data sample in the [comments below](https://github.com/KWARC/llamapun/issues/59#issuecomment-735321559), but PDFs will PDF...
  - [x] [Open Library Data dumps](https://openlibrary.org/developers/dumps) - just metadata, no content
  - [x] PubMedCentral historical OCR has very rocky quality, and no math syntax. So probably better excluded.
  - [x] dictionaries such as wordnet are a bit too artificial to fit in 
  - [x] Project Gutenberg OR wikisource -- adequate preprocessing is rather expensive, and while they are at least partially relevant, will likely defer to a later date.
  - [x] wikispecies - a bit too synthetic, great taxonomic language, but little actual sentences fleshing it out. 
  - [x] vikidia - surprisingly the data quality is a bit poor here, and since STEM is a bit of a minor subset, skipping.
  - [x] [Elsevier OA CC-BY Corpus](https://data.mendeley.com/datasets/zm33cdndxs/2)
     - 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign `=` in the texts, 16,000 have a `+`. So may be worth including as an extra source for light inline math. 
      - Sadly, a closer look revealed intentional breakage of documents with formulas when producing the JSON - the math syntax is completely missing from the provided data. Since I also find the format rather unpleasant to piece together, I've outright given up on this corpus for now.
   - [x] 800+ textbooks from [Open Textbooks](https://open.umn.edu/opentextbooks/subjects/mathematics) - mostly available PDFs, but also some online variants with MathML, and some latex variants. Would take a while to preprocess even to download, so postponing for the next pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocessing language+mathematics corpora for pretraining #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preprocessing language+mathematics corpora for pretraining #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions