How does distributed training on ray work? #7467

cl-nf · 2025-03-25T00:17:04Z

cl-nf
Mar 25, 2025

I can run the example here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/train_lora/llama3_lora_sft_ray.yaml and after changing the storage path and modifying the storage logic, it works with scaling to 2 workers.

I wanted to verify how data is distributed across workers.
Between the Ray documentation and Ray implementation, I don't see any reference to passing in the dataset to the TorchTrainer or use of dataset shards or data loaders.

https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html

Is this not using Ray Data?
How can I verify that data is being split properly between the workers?

hiyouga · 2025-03-25T01:25:39Z

hiyouga
Mar 25, 2025
Maintainer

The accelerate library will do it for you https://github.com/huggingface/accelerate

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does distributed training on ray work? #7467

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How does distributed training on ray work? #7467

cl-nf Mar 25, 2025

Replies: 1 comment

hiyouga Mar 25, 2025 Maintainer

cl-nf
Mar 25, 2025

hiyouga
Mar 25, 2025
Maintainer