Skip to content

Instanciation of the eval_set Iterator #13

Description

@NicolasBataille

Operating system

Linux

Hardware

GPU with CUDA

Description

Hi, I think there is a small issues regarding the instantiation of the eval_data_loader, when using the do_eval: True argument.

In the train.py script, the eval_data_loader is instantiated once at the beginning of the script, but since it's an Iterator object, it can only be iterated on it once, so every call of the evaluate method after the first one will not compute the eval set and result in an eval_loss and eval_perplexity of 0.0

I guess a potential fix would be to instantiate each time before the evaluate() call in the _train() method.

I tried this since I noticed to have an eval_loss and eval_perplexity of NaN after the first eval iteration when testing on a small private training dataset.
Please tell me if I just misconfigured something or misunderstood the eval part of the training script.

Thank you in advance !

Extra information

Example of the fix in the _train() method in train.py :

def _train(args: TrainArgs, exit_stack: ExitStack):
    .............
    # Remove the initial instantiation
    # if args.do_eval:
    #     eval_data_loader = build_data_loader(
    #         instruct_tokenizer=interleaved_tokenizer,
    #         args=args.data,
    #         batch_size=args.batch_size,
    #         seed=None,
    #         rank=get_rank(),  # DDP rank
    #         world_size=get_world_size(),  # DDP world_size
    #         is_eval=True,
    #     )
    ...................
    if args.do_eval and (
            (args.eval_freq > 0 and state.step % args.eval_freq == 0) or is_last_step
        ):
            # Instantiate it at each eval call (could be done in a cleaner way I guess)
            eval_data_loader = build_data_loader(
                instruct_tokenizer=interleaved_tokenizer,
                args=args.data,
                batch_size=args.batch_size,
                seed=None,
                rank=get_rank(),
                world_size=get_world_size(),
                is_eval=True,
            )
            # write perplexity to state
            evaluate(model, eval_data_loader, state, args)

            eval_logs = get_eval_logs(
                state.step,
                avg_loss,
                state.this_eval_perplexity,
                state.this_eval_loss,
            )
            .......................

Environment

Fill in the following information on your system.

  • Operating system version: Ubuntu 22.04.5 LTS
  • Python version: 3.12.11
  • PyTorch version: 2.7.1+cu126
  • CUDA version (run python -c 'import torch; print(torch.version.cuda)'): 12.6
  • GPU model and memory: NVIDIA GeForce RTX 3060 12Gb

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions