Skip to content

How to properly integrate Energon dataloader with Hugging Face Accelerate? #155

@leo1oel

Description

@leo1oel

Problem Description

I'm trying to use Megatron-Energon's dataloader with Hugging Face Accelerate for distributed training, but encountering a tensor concatenation error when using accelerator.prepare() on the dataloader.

Custom Sample Definition

@dataclass
class CustomSample(Sample):
    cls: torch.Tensor
    image: torch.Tensor
    latent: torch.Tensor
def cook_custom_sample(sample: dict) -> CustomSample:
    return CustomSample(
        **basic_sample_keys(sample),
        cls=torch.tensor(sample["cls"], dtype=torch.long),
        image=sample["image.png"],
        latent=torch.from_numpy(sample["latent.npy"]),
    )

Training Setup

ds = get_train_dataset(
    args.data_dir,
    batch_size=local_batch_size,
    shuffle_buffer_size=100,
    task_encoder=CustomTaskEncoder(),
    max_samples_per_sequence=100,
    worker_config=WorkerConfig.default_worker_config(),
)
train_dataloader = get_savable_loader(ds)
# This line causes the error
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

Error Message

TypeError: Can only concatenate tensors but got <class 'int'>

The error occurs in Accelerate's internal concatenation operations when it tries to process the dataloader batches.

Does Energon handle distributed training internally, making Accelerate preparation unnecessary? What's the recommended integration pattern for using both libraries together?

Any guidance on the proper integration pattern would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions