Name	Name	Last commit message	Last commit date
parent directory ..
metadata	metadata
README.md	README.md
download_and_extract.sh	download_and_extract.sh

Name

Last commit message

Last commit date

OpenWebText (OWT) data preparation scripts

This directory contains scripts that can be used to download the OpenWebText dataset and create TFRecords containing masked sequences and labels. The TFRecords are then used by the BertTfRecordsProcessor and GptTfRecordsProcessor to produce inputs to BertModel and Gpt2Model, respectively.

Data download and extraction

To download the OWT dataset, access the following link from a browser:

https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx

and manually download the tar.xz file there to the location you want. This is a google drive link, so you may need to manually download the file.

As an alternative, to automatically download and extract the files using gdown, run:

bash download_and_extract.sh

Note that download_and_extract.sh may take a while to complete, as it unpacks 40GB of data (8,013,770 documents). Upon completion, the script will produce openwebtext folder in the same location. The folder has multiple subfolders, each containing a collection of *.txt files of raw text data (one document per .txt file).

Note that this script installs gdown in your python environment so be sure you can install this without admin privileges.

Define train and evaluation datasets

Define metadata files that contain paths to subsets of documents in openwebtext folder to be used for training or evaluation. For example, for training, we use a subset of 512,000 documents. The associated metadata file can be found in metadata/train_512k.txt.

For evaluation, we choose 5,000 documents that are outside of the training set. The metadata file for evaluation can be found in metadata/val_files.txt. Users are free to create their own metadata files to define train and evaluation (as well as test) data subsets of different content and sizes.

Convert dataset to hdf5 files

After downloading the raw data, you can convert it to HDF5 files using preprocess_data.py. The instructions for this can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

OpenWebText (OWT) data preparation scripts

Data download and extraction

Define train and evaluation datasets

Convert dataset to hdf5 files

FilesExpand file tree

owt

Directory actions

More options

Directory actions

More options

Latest commit

History

owt

Folders and files

parent directory

README.md

OpenWebText (OWT) data preparation scripts

Data download and extraction

Define train and evaluation datasets

Convert dataset to hdf5 files