Skip to content

Commit 66d50ca

Browse files
authored
Merge pull request #73 from huggingface/third-release
Third release
2 parents 8c7267f + f9f3bdd commit 66d50ca

File tree

8 files changed

+311
-67
lines changed

8 files changed

+311
-67
lines changed

README.md

+28-15
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This implementation is provided with [Google's pre-trained models](https://githu
1414
| [Doc](#doc) | Detailed documentation |
1515
| [Examples](#examples) | Detailed examples on how to fine-tune Bert |
1616
| [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
17-
| [TPU](#tup) | Notes on TPU support and pretraining scripts |
17+
| [TPU](#tpu) | Notes on TPU support and pretraining scripts |
1818
| [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |
1919

2020
## Installation
@@ -46,13 +46,14 @@ python -m pytest -sv tests/
4646

4747
This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
4848

49-
- Six PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
50-
- [`BertModel`](./pytorch_pretrained_bert/modeling.py#L535) - raw BERT Transformer model (**fully pre-trained**),
51-
- [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L689) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
52-
- [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L750) - BERT Transformer with the pre-trained next sentence prediction classifier on top (**fully pre-trained**),
53-
- [`BertForPreTraining`](./pytorch_pretrained_bert/modeling.py#L618) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
54-
- [`BertForSequenceClassification`](./pytorch_pretrained_bert/modeling.py#L812) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
55-
- [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L877) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
49+
- Seven PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
50+
- [`BertModel`](./pytorch_pretrained_bert/modeling.py#L537) - raw BERT Transformer model (**fully pre-trained**),
51+
- [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L691) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
52+
- [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L752) - BERT Transformer with the pre-trained next sentence prediction classifier on top (**fully pre-trained**),
53+
- [`BertForPreTraining`](./pytorch_pretrained_bert/modeling.py#L620) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
54+
- [`BertForSequenceClassification`](./pytorch_pretrained_bert/modeling.py#L814) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
55+
- [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L880) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
56+
- [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L946) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
5657

5758
- Three tokenizers (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
5859
- `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
@@ -153,7 +154,7 @@ Here is a detailed documentation of the classes in the package and how to use th
153154
| Sub-section | Description |
154155
|-|-|
155156
| [Loading Google AI's pre-trained weigths](#Loading-Google-AIs-pre-trained-weigths-and-PyTorch-dump) | How to load Google AI's pre-trained weight or a PyTorch saved instance |
156-
| [PyTorch models](#PyTorch-models) | API of the six PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering` |
157+
| [PyTorch models](#PyTorch-models) | API of the seven PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering` |
157158
| [Tokenizer: `BertTokenizer`](#Tokenizer-BertTokenizer) | API of the `BertTokenizer` class|
158159
| [Optimizer: `BertAdam`](#Optimizer-BertAdam) | API of the `BertAdam` class |
159160

@@ -167,25 +168,31 @@ model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
167168

168169
where
169170

170-
- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the six PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification` or `BertForQuestionAnswering`, and
171+
- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the seven PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification` or `BertForQuestionAnswering`, and
171172
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
172173

173174
- the shortcut name of a Google AI's pre-trained model selected in the list:
174175

175176
- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
176177
- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
177178
- `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
178-
- `bert-base-multilingual`: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
179+
- `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
180+
- `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
181+
- `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
179182
- `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
180183

181184
- a path or url to a pretrained model archive containing:
182-
183-
- `bert_config.json` a configuration file for the model, and
184-
- `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)
185+
186+
- `bert_config.json` a configuration file for the model, and
187+
- `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)
185188

186189
If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
187190
- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information)
188191

192+
`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.
193+
194+
**When using an `uncased model`, make sure to pass `--do_lower_case` to the training scripts. (Or pass `do_lower_case=True` directly to FullTokenizer if you're using your own script.)**
195+
189196
Example:
190197
```python
191198
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
@@ -271,7 +278,13 @@ The sequence-level classifier is a linear layer that takes as input the last hid
271278

272279
An example on how to use this class is given in the `run_classifier.py` script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.
273280

274-
#### 6. `BertForQuestionAnswering`
281+
#### 6. `BertForTokenClassification`
282+
283+
`BertForTokenClassification` is a fine-tuning model that includes `BertModel` and a token-level classifier on top of the `BertModel`.
284+
285+
The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.
286+
287+
#### 7. `BertForQuestionAnswering`
275288

276289
`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
277290

examples/extract_features.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,7 @@ def main():
199199
"bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese.")
200200

201201
## Other parameters
202+
parser.add_argument("--do_lower_case", default=False, action='store_true', help="Set this flag if you are using an uncased model.")
202203
parser.add_argument("--layers", default="-1,-2,-3,-4", type=str)
203204
parser.add_argument("--max_seq_length", default=128, type=int,
204205
help="The maximum total input sequence length after WordPiece tokenization. Sequences longer "
@@ -227,7 +228,7 @@ def main():
227228

228229
layer_indexes = [int(x) for x in args.layers.split(",")]
229230

230-
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
231+
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
231232

232233
examples = read_examples(args.input_file)
233234

examples/run_classifier.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -376,6 +376,10 @@ def main():
376376
default=False,
377377
action='store_true',
378378
help="Whether to run eval on the dev set.")
379+
parser.add_argument("--do_lower_case",
380+
default=False,
381+
action='store_true',
382+
help="Set this flag if you are using an uncased model.")
379383
parser.add_argument("--train_batch_size",
380384
default=32,
381385
type=int,
@@ -473,7 +477,7 @@ def main():
473477
processor = processors[task_name]()
474478
label_list = processor.get_labels()
475479

476-
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
480+
tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
477481

478482
train_examples = None
479483
num_train_steps = None
@@ -542,7 +546,7 @@ def main():
542546
for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
543547
batch = tuple(t.to(device) for t in batch)
544548
input_ids, input_mask, segment_ids, label_ids = batch
545-
loss, _ = model(input_ids, segment_ids, input_mask, label_ids)
549+
loss = model(input_ids, segment_ids, input_mask, label_ids)
546550
if n_gpu > 1:
547551
loss = loss.mean() # mean() to average on multi-gpu.
548552
if args.fp16 and args.loss_scale != 1.0:

pytorch_pretrained_bert/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from .tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
22
from .modeling import (BertConfig, BertModel, BertForPreTraining,
33
BertForMaskedLM, BertForNextSentencePrediction,
4-
BertForSequenceClassification, BertForQuestionAnswering)
4+
BertForSequenceClassification, BertForTokenClassification,
5+
BertForQuestionAnswering)
56
from .optimization import BertAdam
67
from .file_utils import PYTORCH_PRETRAINED_BERT_CACHE

0 commit comments

Comments
 (0)