Question regarding the calibration set

Hi!

I found in your paper that you "randomly sample 256 questions from the NQ open training set". However, the `get_calib_data` function is realized as following:

```
def get_calib_data(name, tokenizer, model_id, nsamples, seqlen=2048, seed=3):
    ...
    elif name == "nqopen":
        traindata = load_dataset("nq_open", split="train")
        tot_text = "\n\n".join(traindata["question"])        
    ...
    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, len(tot_text) - seqlen - 1)
        j = i + seqlen * 10
        trainenc = tokenizer(tot_text[i:j], return_tensors="pt")
        inp = trainenc.input_ids[:, :seqlen]
        attention_mask = torch.ones_like(inp)
        traindataset.append({"input_ids": inp, "attention_mask": attention_mask})
    torch.save(traindataset, cache_file)
    return traindataset
```
Since not further assigned, a default value of 2048 is used for the parameter `seqlen`. In the nq_open dataset, a question is usually quite short, and I think these codes would result in a dataset containing much more questions than 256.
Am I right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the calibration set #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question regarding the calibration set #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions