1. Set up the code to run PEGASUS

Please follow the following steps in order to replicate the results presented in this paper:

Set up the code to run PEGASUS
Download and sort dataset used for our results
Use our datasets in the PEGASUS model

Details for each step are outlined below:

1. Set up the code to run PEGASUS

Please follow the instructions below which can additionally be found here.

1.1 PEGASUS library

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. ICML 2020 accepted.

If you use this code or these models, please cite the following paper:

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

1.2 Setup

create an instance on google cloud with GPU (optional)

Please create a project first and create an instance

gcloud compute instances create \
  ${VM_NAME} \
  --zone=${ZONE} \
  --machine-type=n1-highmem-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --boot-disk-size=500GB \
  --image-project=ml-images \
  --image-family=tf-1-15 \
  --maintenance-policy TERMINATE --restart-on-failure

install library and dependencies

Clone library on github and install requirements.

git clone https://github.com/google-research/pegasus
cd pegasus
export PYTHONPATH=.
pip3 install -r requirements.txt

Download vocab, pretrained and fine-tuned checkpoints of all experiments from Google Cloud.

Alternatively in terminal, follow the instruction and install gsutil. Then

mkdir ckpt
gsutil cp -r gs://pegasus_ckpt/ ckpt/

1.3 Finetuning on downstream datasets

on existing dataset

Finetune on an existing dataset aeslc.

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

If you would like to finetune on a subset of dataset, please refer to the example of input pattern.

Evaluate on the finetuned dataset.

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper.

add new finetuning dataset

Two types of dataset format are supported: TensorFlow Datasets (TFDS) or TFRecords.

This tutorial shows how to add a new dataset in TFDS. (The fine-tuning dataset is expected to be supervised, please provide supervised_keys in dataset info).

Tfrecords format requires each record to be a tf example of {"inputs":tf.string, "targets":tf.string}.

For example, if you registered a TFDS dataset called new_tfds_dataset for training and evaluation, and have some files in tfrecord format called new_dataset_files.tfrecord* for test, they can be registered in /pegasus/params/public_params.py.

@registry.register("new_params")
def my_param(param_overrides):
  return public_params.transformer_params(
      {
          "train_pattern": "tfds:new_tfds_dataset,train",
          "dev_pattern": "tfds:new_tfds_dataset,validation",
          "test_pattern": "tfrecord:new_dataset_files.tfrecord*",
          "max_input_len": 512,
          "max_output_len": 128,
          "train_steps": 10000,
          "learning_rate": 0.0001,
          "batch_size": 8,
      }, param_overrides)

Evaluation metrics.

Evaluation results can be found in mode_dir. Summarization metrics are automatically calculated for each evaluation point.

ROUGE is the main metric for summarization quality.
BLEU is an alternative quality metric for language generation.
Extractive Fragments Coverage & Density are metrics that measures the abstractiveness of the summary.
Repetition Rates measures generation repetition failure modes.
Length statistics measures the length distribution of decodes comparing to gold summary.

Several types of output files can be found in model_dir

text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value.
inputs-.txt, targets-.txt, predictions-*.txt: raw text files of model inputs/outputs.

2. Download and sort dataset used for our results

User @JafferWilson has provided the processed data, which you can download here. (See discussion here about why the original authors do not provide it themselves).

Download the CNN_STORIES_TOKENIZED and DM_STORIES_TOKENIZED here.
Extract the downloaded folders and move all ".story" folders into a single folder.
Use "Sortdata.py" to create sorted datasets, see further instructions in code comments.

3. Use created datasets in the PEGASUS model

Please follow the instructions in step "1.3 Finetuning on downstream datasets", to finetune the provided pre-trained PEGASUS model on the created datasets. Further instructions can be found here.

An example registry is as follows:

@registry.register("cnndm_CLOP_complexity_bucket_1")
def cnn_dailymail(param_overrides):
  return transformer_params(
      {
          "train_pattern": "tfrecord:path/to/dataset/cnndm_CLOP_complexity_bucket_1/cnndm_CLOP_complexity_bucket_1_train.tfrecord",
          "dev_pattern": "tfrecord:path/to/dataset/cnndm_CLOP_complexity_bucket_1/cnndm_CLOP_complexity_bucket_1_validate.tfrecord",
          "test_pattern": "tfds:cnn_dailymail/plain_text-test",
          "max_input_len": 1024,
          "max_output_len": 128,
          "train_steps": 500,
          "learning_rate": 0.001,
          "batch_size": 2,
      }, param_overrides)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
README.md		README.md
sortData.py		sortData.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Set up the code to run PEGASUS

1.1 PEGASUS library

1.2 Setup

create an instance on google cloud with GPU (optional)

install library and dependencies

1.3 Finetuning on downstream datasets

on existing dataset

add new finetuning dataset

Evaluation metrics.

2. Download and sort dataset used for our results

3. Use created datasets in the PEGASUS model

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Set up the code to run PEGASUS

1.1 PEGASUS library

1.2 Setup

create an instance on google cloud with GPU (optional)

install library and dependencies

1.3 Finetuning on downstream datasets

on existing dataset

add new finetuning dataset

Evaluation metrics.

2. Download and sort dataset used for our results

3. Use created datasets in the PEGASUS model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages