Skip to content
This repository was archived by the owner on Mar 25, 2024. It is now read-only.
This repository was archived by the owner on Mar 25, 2024. It is now read-only.

Preprocessing language+mathematics corpora for pretraining #59

@dginev

Description

@dginev

A good 2020 use of llamapun would be to use it as a unified preprocessing step for a variety of HTML corpora which also include math syntax by one trick or another. The goal would be to do the legwork on a variety of HTML dialects so that we get clean and maximally denoised data as a plaintext target, with a primary focus on using that textual form for pretraining a neural language model. I will be using this issue as a documentation placeholder for the various targets I have in mind.

What are we looking for?

  • primary textual sources (rather than remixed/curated train sets from other experiments)
  • openly available for download & research
  • processable math syntax that we can reliably normalize and lexematize
  • interesting exceptions: e.g. consider synthetic datasets that offer diverse examples of math syntax use, and/or posing problems

I will re-edit this description to include corpora I think I can include for the current pass.

Decided to include (data has been obtained locally, checked when preprocessing is completed):


To vet:


Vetted, but currently excluded:

  • the pile preliminary
    • see twitter thread about details why books1 and books3 are too broken for math syntax
    • their arxiv conversion is via pandoc and has significantly more breakage than our own, which I've reported back to bmk of EleutherAI
  • s2orc - very tempting to just use directly their curated set of 12.7M full-text papers. However, they are all uniformly obtained via PDF scraping (with grobid), so the mathematical markup is badly broken. I'll attach a data sample in the comments below, but PDFs will PDF...
  • Open Library Data dumps - just metadata, no content
  • PubMedCentral historical OCR has very rocky quality, and no math syntax. So probably better excluded.
  • dictionaries such as wordnet are a bit too artificial to fit in
  • Project Gutenberg OR wikisource -- adequate preprocessing is rather expensive, and while they are at least partially relevant, will likely defer to a later date.
  • wikispecies - a bit too synthetic, great taxonomic language, but little actual sentences fleshing it out.
  • vikidia - surprisingly the data quality is a bit poor here, and since STEM is a bit of a minor subset, skipping.
  • Elsevier OA CC-BY Corpus
    • 40,000 entries, but no traces of entire equations. Small pieces of syntax are traceable though, e.g. 23,250 documents have an equal sign = in the texts, 16,000 have a +. So may be worth including as an extra source for light inline math.
    • Sadly, a closer look revealed intentional breakage of documents with formulas when producing the JSON - the math syntax is completely missing from the provided data. Since I also find the format rather unpleasant to piece together, I've outright given up on this corpus for now.
  • 800+ textbooks from Open Textbooks - mostly available PDFs, but also some online variants with MathML, and some latex variants. Would take a while to preprocess even to download, so postponing for the next pass.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions