Hi!
I found in your paper that you "randomly sample 256 questions from the NQ open training set". However, the get_calib_data function is realized as following:
def get_calib_data(name, tokenizer, model_id, nsamples, seqlen=2048, seed=3):
...
elif name == "nqopen":
traindata = load_dataset("nq_open", split="train")
tot_text = "\n\n".join(traindata["question"])
...
traindataset = []
for _ in range(nsamples):
i = random.randint(0, len(tot_text) - seqlen - 1)
j = i + seqlen * 10
trainenc = tokenizer(tot_text[i:j], return_tensors="pt")
inp = trainenc.input_ids[:, :seqlen]
attention_mask = torch.ones_like(inp)
traindataset.append({"input_ids": inp, "attention_mask": attention_mask})
torch.save(traindataset, cache_file)
return traindataset
Since not further assigned, a default value of 2048 is used for the parameter seqlen. In the nq_open dataset, a question is usually quite short, and I think these codes would result in a dataset containing much more questions than 256.
Am I right?
Hi!
I found in your paper that you "randomly sample 256 questions from the NQ open training set". However, the
get_calib_datafunction is realized as following:Since not further assigned, a default value of 2048 is used for the parameter
seqlen. In the nq_open dataset, a question is usually quite short, and I think these codes would result in a dataset containing much more questions than 256.Am I right?