Operating system
Linux
Hardware
GPU with CUDA
Description
Hi, I think there is a small issues regarding the instantiation of the eval_data_loader, when using the do_eval: True argument.
In the train.py script, the eval_data_loader is instantiated once at the beginning of the script, but since it's an Iterator object, it can only be iterated on it once, so every call of the evaluate method after the first one will not compute the eval set and result in an eval_loss and eval_perplexity of 0.0
I guess a potential fix would be to instantiate each time before the evaluate() call in the _train() method.
I tried this since I noticed to have an eval_loss and eval_perplexity of NaN after the first eval iteration when testing on a small private training dataset.
Please tell me if I just misconfigured something or misunderstood the eval part of the training script.
Thank you in advance !
Extra information
Example of the fix in the _train() method in train.py :
def _train(args: TrainArgs, exit_stack: ExitStack):
.............
# Remove the initial instantiation
# if args.do_eval:
# eval_data_loader = build_data_loader(
# instruct_tokenizer=interleaved_tokenizer,
# args=args.data,
# batch_size=args.batch_size,
# seed=None,
# rank=get_rank(), # DDP rank
# world_size=get_world_size(), # DDP world_size
# is_eval=True,
# )
...................
if args.do_eval and (
(args.eval_freq > 0 and state.step % args.eval_freq == 0) or is_last_step
):
# Instantiate it at each eval call (could be done in a cleaner way I guess)
eval_data_loader = build_data_loader(
instruct_tokenizer=interleaved_tokenizer,
args=args.data,
batch_size=args.batch_size,
seed=None,
rank=get_rank(),
world_size=get_world_size(),
is_eval=True,
)
# write perplexity to state
evaluate(model, eval_data_loader, state, args)
eval_logs = get_eval_logs(
state.step,
avg_loss,
state.this_eval_perplexity,
state.this_eval_loss,
)
.......................
Environment
Fill in the following information on your system.
- Operating system version: Ubuntu 22.04.5 LTS
- Python version: 3.12.11
- PyTorch version: 2.7.1+cu126
- CUDA version (run
python -c 'import torch; print(torch.version.cuda)'): 12.6
- GPU model and memory: NVIDIA GeForce RTX 3060 12Gb
Operating system
Linux
Hardware
GPU with CUDA
Description
Hi, I think there is a small issues regarding the instantiation of the
eval_data_loader, when using thedo_eval: Trueargument.In the train.py script, the eval_data_loader is instantiated once at the beginning of the script, but since it's an Iterator object, it can only be iterated on it once, so every call of the evaluate method after the first one will not compute the eval set and result in an eval_loss and eval_perplexity of 0.0
I guess a potential fix would be to instantiate each time before the
evaluate()call in the_train()method.I tried this since I noticed to have an eval_loss and eval_perplexity of NaN after the first eval iteration when testing on a small private training dataset.
Please tell me if I just misconfigured something or misunderstood the eval part of the training script.
Thank you in advance !
Extra information
Example of the fix in the
_train()method intrain.py:Environment
Fill in the following information on your system.
python -c 'import torch; print(torch.version.cuda)'): 12.6