Skip to content

Commit 4d8f5d1

Browse files
add xlnet mems and fix merge conflicts
1 parent 710b010 commit 4d8f5d1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+252
-128
lines changed

docs/source/installation.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
9797
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
9898
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
9999

100-
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or
100+
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
101101
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
102102
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!

docs/source/model_doc/bertgeneration.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
1010

1111
The abstract from the paper is the following:
1212

13-
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
13+
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
1414
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
1515
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
1616
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We

docs/source/model_doc/deberta.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
2020
position, respectively, and the attention weights among words are computed using disentangled matrices on their
2121
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
2222
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
23-
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
24-
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
23+
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
24+
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
2525
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
2626
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
2727

docs/source/model_doc/distilbert.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
1818
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
1919
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
2020
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
21-
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
21+
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
2222
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
23-
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
23+
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
2424
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
2525
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
2626
study.*

docs/source/model_doc/electra.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
1212

1313
The abstract from the paper is the following:
1414

15-
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
16-
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
15+
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
16+
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
1717
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
18-
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
18+
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
1919
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
2020
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
2121
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
22-
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
22+
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
2323
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
2424
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
2525
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained

docs/source/model_doc/flaubert.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
1919
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
2020
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
2121
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
22-
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
22+
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
2323
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
2424
community for further reproducible experiments in French NLP.*
2525

docs/source/model_doc/gpt.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The abstract from the paper is the following:
1414
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
1515
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
1616
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
17-
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
17+
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
1818
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
1919
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
2020
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our

docs/source/model_doc/layoutlm.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,19 @@ Overview
66

77
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
88
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
9-
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
9+
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
1010
information extraction tasks, such as form understanding and receipt understanding.
1111

1212
The abstract from the paper is the following:
1313

1414
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
15-
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
15+
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
1616
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
1717
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
1818
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
1919
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
2020
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
21-
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
21+
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
2222
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
2323
classification (from 93.07 to 94.42).*
2424

docs/source/model_doc/lxmert.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
1919
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
2020
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
2121
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
22-
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
22+
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
2323
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
2424
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
2525
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our

docs/source/model_doc/mbart.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
1313
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
1414

1515
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
16-
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
16+
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
1717
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
1818
on the encoder, decoder, or reconstructing parts of the text.
1919

docs/source/model_doc/prophetnet.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,15 @@ the next token.
1717

1818
The abstract from the paper is the following:
1919

20-
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
20+
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
2121
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
2222
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
2323
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
2424
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
2525
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
2626
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
2727
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
28-
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
28+
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
2929

3030
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
3131

docs/source/model_doc/t5.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The abstract from the paper is the following:
1717
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
1818
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
1919
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
20-
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
20+
text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
2121
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
2222
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
2323
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for

docs/source/model_doc/xlmprophetnet.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
1919

2020
The abstract from the paper is the following:
2121

22-
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
22+
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
2323
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
2424
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
2525
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
2626
step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent
2727
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
2828
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
2929
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
30-
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
30+
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
3131

3232
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
3333

docs/source/model_summary.rst

+7-7
Original file line numberDiff line numberDiff line change
@@ -527,7 +527,7 @@ Pegasus
527527
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
528528

529529
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
530-
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
530+
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
531531
objective, called Gap Sentence Generation (GSG).
532532

533533
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
@@ -609,7 +609,7 @@ MT5
609609
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
610610
et al.
611611

612-
The model architecture is same as T5. mT5's pre-training objective includes T5's self-supervised training, but not T5's
612+
The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
613613
supervised training. mT5 is trained on 101 languages.
614614

615615
The library provides a version of this model for conditional generation.
@@ -630,8 +630,8 @@ MBart
630630
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
631631
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
632632

633-
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
634-
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
633+
The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
634+
for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
635635
sequence-to-sequence model by denoising full texts in multiple languages,
636636

637637
The library provides a version of this model for conditional generation.
@@ -658,7 +658,7 @@ ProphetNet
658658
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
659659
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
660660

661-
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
661+
ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
662662
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
663663
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
664664
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
@@ -683,8 +683,8 @@ XLM-ProphetNet
683683
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
684684
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
685685

686-
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
687-
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
686+
XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
687+
on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
688688

689689
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
690690
versions for headline generation and question generation, respectively.

docs/source/task_summary.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
305305
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
306306
GPT-2 with causal language modeling.
307307

308-
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
308+
Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
309309
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
310310
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
311311

0 commit comments

Comments
 (0)