Scaling up Batch Size and GPU Usage to Accelerate Training

Hi, and thank you for your support so far.

After setting tokens_per_batch to 8192, the training runs smoothly. However, based on the source code, it seems that each batch is constrained to a size of 1, which also implies that training is limited to using only a single GPU — otherwise, the following assertion error is triggered:

![Image](https://github.com/user-attachments/assets/9e0d4979-87b0-4b09-b794-2501703eebef)

This restriction results in a large number of batches being processed. For example, pre-training on the MIMIC-IV dataset currently takes approximately 42–45 days to complete.

Given that Appendix C of your paper mentions using “24 Intel Xeon 2.70GHz CPU cores and 8 Nvidia V100 GPUs,” I’m wondering:

1. Does the current version of the code support multi-GPU training and larger batch sizes?
2. If not directly, are there recommended changes or workarounds to enable this?

Any guidance or suggestions would be greatly appreciated.
Thank you again for your time and work on this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling up Batch Size and GPU Usage to Accelerate Training #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scaling up Batch Size and GPU Usage to Accelerate Training #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions