Instanciation of the eval_set Iterator

### Operating system

Linux

### Hardware

GPU with CUDA

### Description

Hi, I think there is a small issues regarding the instantiation of the `eval_data_loader`, when using the `do_eval: True` argument.

In the train.py script, the eval_data_loader is instantiated once at the beginning of the script, but since it's an Iterator object, it can only be iterated on it once, so every call of the evaluate method after the first one will not compute the eval set and result in an eval_loss and eval_perplexity of 0.0

I guess a potential fix would be to instantiate each time before the `evaluate()` call in the `_train()` method.

I tried this since I noticed to have an eval_loss and eval_perplexity of NaN after the first eval iteration when testing on a small private training dataset. 
Please tell me if I just misconfigured something or misunderstood the eval part of the training script.

Thank you in advance !

### Extra information

Example of the fix in the `_train()` method in `train.py` :
```python
def _train(args: TrainArgs, exit_stack: ExitStack):
    .............
    # Remove the initial instantiation
    # if args.do_eval:
    #     eval_data_loader = build_data_loader(
    #         instruct_tokenizer=interleaved_tokenizer,
    #         args=args.data,
    #         batch_size=args.batch_size,
    #         seed=None,
    #         rank=get_rank(),  # DDP rank
    #         world_size=get_world_size(),  # DDP world_size
    #         is_eval=True,
    #     )
    ...................
    if args.do_eval and (
            (args.eval_freq > 0 and state.step % args.eval_freq == 0) or is_last_step
        ):
            # Instantiate it at each eval call (could be done in a cleaner way I guess)
            eval_data_loader = build_data_loader(
                instruct_tokenizer=interleaved_tokenizer,
                args=args.data,
                batch_size=args.batch_size,
                seed=None,
                rank=get_rank(),
                world_size=get_world_size(),
                is_eval=True,
            )
            # write perplexity to state
            evaluate(model, eval_data_loader, state, args)

            eval_logs = get_eval_logs(
                state.step,
                avg_loss,
                state.this_eval_perplexity,
                state.this_eval_loss,
            )
            .......................

```

### Environment

Fill in the following information on your system.
- Operating system version: Ubuntu 22.04.5 LTS
- Python version: 3.12.11
- PyTorch version:  2.7.1+cu126
- CUDA version (run `python -c 'import torch;  print(torch.version.cuda)'`):  12.6
- GPU model and memory:   NVIDIA GeForce RTX 3060 12Gb


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instanciation of the eval_set Iterator #13

Operating system

Hardware

Description

Extra information

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Instanciation of the eval_set Iterator #13

Description

Operating system

Hardware

Description

Extra information

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions