Questions around perplexity metric for language modeling

Hi there,

I've been fiddling with perplexity computation in the past few days, and I was struck by a few discrepancies:
- In [your manual implementation](https://huggingface.co/docs/transformers/en/perplexity), I read:
   
  > When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same as the `19.93` reported in the GPT-2 paper (note: [p. 5](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)).
- On [the reference page](https://huggingface.co/spaces/evaluate-metric/perplexity) (see 'Examples'):
   
  ```python
  perplexity = evaluate.load("perplexity", module_type="metric")
  input_texts = datasets.load_dataset("wikitext",
                                      "wikitext-2-raw-v1",
                                      split="test")["text"][:50]
  input_texts = [s for s in input_texts if s!='']
  results = perplexity.compute(model_id='gpt2',
                               predictions=input_texts)
  print(list(results.keys()))
  >>>['perplexities', 'mean_perplexity']
  print(round(results["mean_perplexity"], 2))
  >>>576.76
  print(round(results["perplexities"][0], 2))
  >>>889.28
  ```
  (Of course this is run on a very short sequence, only 50 characters!)


[Here's a Colab notebook](https://colab.research.google.com/drive/1c8OWBeFRl0178UqmVFKioTeDuZP2-lt0?usp=sharing) with my experiments.

### Question 1

Even when using `evaluate`, I get fairly different results: just for the 50 character selection, I obtain:
```
320.85
567.91
```

Something in the implementation changed since the docs were written? What am I missing?

### Question 2

There is still a humongous difference between the manual computation (first page I quoted above, ported in the notebook, and similar results when experimenting with a faster, batched implementation using `stride` in the tokenizer) and what the metric produces. What I'm expecting for `wikitext` is something of the order of 20-30, and lower for bigger models, not in the hundreds, let alone in the thousands (see the notebook, result when joining all rows with `\n\n`)... (At first I thought it was because of `pad_tokens` being counted, that was wrong.) I'm now at a loss as to where that might come from. Any ideas?

### Question 3

Also, notice that I can pass the dataset either as a list of strings of varying lengths, or as one long string, and it runs, but the [perplexity class itself](https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py), on its own, does not seem to handle that (no `return_overflowing_tokens = True` passed to the tokenizer, for instance, I performed a few tests with the class only): I'm guessing the dataset is processed somewhere before being fed to it? I'm in the process of understanding the whole pipeline of invocation of the metric, but so far I haven't found where it happens.

Apologies if the first version of this issue had quite a few silly mistakes, and thanks for reading!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions around perplexity metric for language modeling #688

Question 1

Question 2

Question 3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions around perplexity metric for language modeling #688

Description

Question 1

Question 2

Question 3

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions