-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
Problem Description
I'm trying to use Megatron-Energon's dataloader with Hugging Face Accelerate for distributed training, but encountering a tensor concatenation error when using accelerator.prepare() on the dataloader.
Custom Sample Definition
@dataclass
class CustomSample(Sample):
cls: torch.Tensor
image: torch.Tensor
latent: torch.Tensor
def cook_custom_sample(sample: dict) -> CustomSample:
return CustomSample(
**basic_sample_keys(sample),
cls=torch.tensor(sample["cls"], dtype=torch.long),
image=sample["image.png"],
latent=torch.from_numpy(sample["latent.npy"]),
)Training Setup
ds = get_train_dataset(
args.data_dir,
batch_size=local_batch_size,
shuffle_buffer_size=100,
task_encoder=CustomTaskEncoder(),
max_samples_per_sequence=100,
worker_config=WorkerConfig.default_worker_config(),
)
train_dataloader = get_savable_loader(ds)
# This line causes the error
model, optimizer, train_dataloader = accelerator.prepare(
model, optimizer, train_dataloader
)Error Message
TypeError: Can only concatenate tensors but got <class 'int'>
The error occurs in Accelerate's internal concatenation operations when it tries to process the dataloader batches.
Does Energon handle distributed training internally, making Accelerate preparation unnecessary? What's the recommended integration pattern for using both libraries together?
Any guidance on the proper integration pattern would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels