internship-light-summer-2025/MultiMeditron_Simon_Lefort_Summer25.typ at main · simon-epfl/internship-light-summer-2025 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
#show heading.where(level: 1): contents => text(size: 20pt, contents)
#show heading: contents => pad(bottom: 10pt, contents)
#set quote(block: true)
#set heading(numbering: (ignore_first, ..n) => {
  if (n.pos().len() != 0) {
    numbering("1.1.", ..n)
  }
})
#let stick-together(a, threshold: 3em) = {
  block(a + v(threshold), breakable: false)
  v(-1 * threshold)
}

= Summer in the Lab \@ LiGHT

Simon Lefort, supervised by Michael Zhang, Annie Hartley.

== Problem Statement

The Laboratory for Intelligent Global Health & Humanitarian Response Technologies (LiGHT) creates data-driven tools to replace missing resources or expertise in remote and resource-constrained environments.

One of these projects is Meditron, a family of open source language models (LLMs) for medicine. More recently, they worked on a multimodal version of these models, MultiMeditron.

My research consisted of finding a way to scale the MultiMeditron training pipeline settings to efficiently train a 70B-parameter version of the multimodal model.

In this document, I will first present what Meditron, MultiMeditron and DeepSpeed are. In a second part, I will explain how I used profiling tools such as Nvidia Nsight (Streamer), the DeepSpeed FLOPS profiler to boost the number of tokens/gpu/second during the training, and finally the results obtained.

=== What is MultiMeditron?

MultiMeditron is a multimodal model based on the LLaVA architecture, which in turn is based on the LLaMA architecture. It is trained to understand and generate text and images, making it suitable for various applications in the medical field, such as analyzing medical images and generating reports.

The LLaVA architecture combines a vision encoder (like CLIP) with a language model (like LLaMA). It uses the vision encoder to produce embeddings from images, which are then projected into the same space as the language model's embeddings (aligned), and sent to the LLM as part of the input.

== Getting started: training on a single node

The simplest attempt to train MultiMeditron is to run the training on a single node (a single server).
The EPFL cluster, RCP, offers nodes with up to 8 GPUs. Thus, the first test to perform was to test whether the 70B model could fit on these 8 GPUs.

```
➜  ~ kubectl get pods
NAME                 READY   STATUS      RESTARTS   AGE
meditron-basic-0-0   0/1     OOMKilled   0          16m
➜  ~
```

As shown above, after 16 minutes, the training stopped because the container ran out of memory. During training, two types of memory errors can occur:
- a Host OOM: one of the nodes running the training has run out of memory.
- a CUDA OOM: one of the GPUs connected to the nodes has run out of memory.

In this case, this was a *Host OOM*. To fix it, there are three main approaches:
- directly modify the training pipeline code to reduce memory usage
- change the HuggingFace trainer configuration
- change the DeepSpeed configuration

Indeed, the training pipeline uses the `transformers` library from HuggingFace and extends its functionality.

#block(
  fill: rgb("#e3f2fd"),
  stroke: (left: 3pt + rgb("#1976d2")),
  radius: 4pt,
  inset: 12pt,
  [
    #text(rgb("#1976d2"), size: 11pt)[*Not familiar with DeepSpeed?*] \
    I've written a short but detailed explanations of what is DeepSpeed and what are the main optimizations it offers #link(label("deepspeed-explanations"))[*here*]. Feel free to check it out before continuing your reading,
  ]
)

== Setting up DeepSpeed ZeRO-3 in the training pipeline #label("beginning-work")

After configuring DeepSpeed to use ZeRO-3, I realized using `nvidia-smi` and `htop` that my 8 H200 GPUs were still being loaded with about 80GB each right after loading the model, as if they still held a copy of the model in memory (which should not happen with ZeRO-3, it should prevent any redundancy to save on the memory usage).

The first part of the work was therefore to debug the pipeline to make it support DeepSpeed ZeRO-3 properly! (in short: I realized by adding some logs in the `transformers` library's code that `TrainingArgs` had to be instantiated before loading the model, even though it was not used before).

I then had a pretty stable memory usage, with about only 20GB on my 8 H200 GPUs. However, after the loading of the model, the training started and the memory usage spiked again, leading to a CUDA OOM, this time.

== Setting up distributed training

After trying a bunch of different optimizations and DeepSpeed/HuggingFace configurations, I decided to give up using a single node and started using distributed trainings on several nodes. The training finally started!

See the `scripts`folder for more details.

#figure(
  image("assets/distrib.png"),
  caption: [Distributed training setup showing GPU memory usage across 16 H200 GPUs]
)

My 16 H200 GPUs all used about 100-110GB which was still fine.

== The loss issue

Another important issue at that point, that I didn't realize first is that the initial loss of my model was around 15, which is far too high considering that we were using a pre-trained LLama model.

That issue was due to these two lines missing in the pipeline that Michael found after a while, resulting in the model's weights not being initialized properly and that we were using an older version of transformers.
```python
import deepspeed
with deepspeed.zero.Init(dtype=torch.bfloat16):
  # call to from_pretrained
```

The `deepspeed.zero.Init` function is a context manager that ensures that when loading a pre-trained model, the weights are properly partitioned across GPUs according to ZeRO-3's strategy, rather than loading the full model on each GPU. Because of this missing line, the model was not properly initialized and the training started from scratch (random weights), leading to a very high loss.
You can read more about it #link("https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/deepspeed#constructing-massive-models")[here].

== Nvidia Nsight

Now that the training was running, we needed some way to profile it and see where the bottleneck was. We decided to use Nvidia Nsight Systems for this purpose. We configured a HuggingFace callback to trigger the profiling between two steps.
See the `scripts` folder for more details.

=== Nvidia Nsight Streamer

We made some tests on the 8B model with Nvidia that went fine on my laptop, but when we got the 1.5GB `.nsys-rep` report file, we gave up opening it after a day. I started looking for ways to reduce the size of the report, and found a post mentioning Nvidia Nsight Streamer, which allows to run the Nvidia Nsight GUI on a remote server and access the streamed UI locally.

For reference, here is how I submitted the job on the RCP:
```sh
runai submit nsight-streamer \
  --image nvcr.io/nvidia/devtools/nsight-streamer-nsys:2025.3.1 \
  --pvc light-scratch:/lightscratch \
  --large-shm \
  --run-as-gid 84257 \
  -e ENABLE_RESIZE=true \
  -e MAX_RESOLUTION=3440x1440
```

Port forward to access it locally:
```sh
runai port-forward nsight-streamer --port 8080:8080
```

Access it from a browser at `http://nvidia:nvidia@localhost:8080`.

=== The first result

#figure(
  image("assets/nvidia_nsight.png"),
  caption: [Nvidia Nsight profiling results showing performance bottlenecks in DeepSpeed ZeRO-3 functions]
)

We finally opened the 70B Nvidia report! We then noticed that most of the time was spent in following DeepSpeed functions:
- `DeepSpeedZeroOptimizer_Stage3.create_reduce_and_remove_grad_hooks`
- which calls `DeepSpeedZeroOptimizer_Stage3.__reduce_and_partition_ipg_grads`
- which calls the `cudaEventSynchronize` blocking function.

After doing some research, we found that this was very probably due to the `stage3_gather_16bit_weights_on_model_save` option, which gathers all the 16-bit weights stored on every single GPU to save it in a proper `safetensors` format.

== DeepSpeed FLOPS profiler

=== Noticing the issue with the batch size

Even after disabling `stage3_gather_16bit_weights_on_model_save`, we still observed a very slow training, with a throughput of about 30 tokens per GPU per second.

We then had a look at the DeepSpeed FLOPS profiler to get more insights. We ended up trying to run a 8B model to see what was going on, and we experienced the same issue. Results showed about 3-4 TFLOPS, while the DeepSpeed benchmarking on MegatronLM (which is still far more optimized than ours) showed about 35 TFLOPS.

```
-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 2:

world size:                                                             3
data parallel size:                                                     3
model parallel size:                                                    1
batch size per GPU:                                                     1
params per GPU:                                                         527.17 M
params of model = params per GPU * mp_size:                             527.17 M
fwd MACs per GPU:                                                       6.21 TMACs
fwd flops per GPU:                                                      12.41 T
fwd flops of model = fwd flops per GPU * mp_size:                       12.41 T
fwd latency:                                                            5.25 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:                    2.37 TFLOPS
bwd latency:                                                            4.76 s
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:                5.21 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):      3.72 TFLOPS
step latency:                                                           129.78 ms
iter latency:                                                           10.14 s
FLOPS per GPU = 3 * fwd flops per GPU / iter latency:                   3.67 TFLOPS
samples/second:                                                         0.3
```
Notations:
- number of multiply-accumulate operations (MACs)
- number of floating-point operations (flops)
- floating-point operations per second (FLOPS)
- fwd latency (forward propagation latency)
- bwd latency (backward propagation latency)
- iter latency (sum of fwd, bwd and step latency)

We then realized that this problem could come from the batch size per GPU being way too low, and by increasing it our results became extremely better.

=== Fixing the batch size issue

Indeed, we realized that we had some OOM previously because of a very few amount of large samples that made the GPU exceed its memory limit. We then coded our own truncation script to make sure the very few samples that were too large were truncated to a maximum size. After that, we were able to increase the batch size per GPU to up to 128 on the 8B model, reaching up to 95 TFLOPS!

== Migrating from the RCP (EPFL) cluster to the CSCS (Switzerland-wide)...

At that moment, we were in early August, and the RCP started to be completely overloaded, and we spent several days unable to submit new jobs. We then took a couple of days to migrate to the CSCS (Swiss National Supercomputing Centre) cluster.

Instead of using 16 GPUs (2 nodes), switching to the CSCS allowed us to run our tests on 64 GPUs (16 nodes).

Many issues occurred, including:
- the disk quota being exceeded because we couldn't save the model checkpoints at the same place
- some issues with the Python libraries being different from our RCP environment

#figure(
  image("assets/quota.png"),
  caption: [Disk quota exceeded error encountered during training on CSCS infrastructure]
)

See the final `scripts` folder for more details.

== Benchmarks

We then decided to run some benchmarks based on the information we had gathered so far, and here are the results.

These training runs all consisted of *50 steps* (we fixed the number of steps because we wanted to measure the throughput and not the total training time), and we approximated the number of tokens per sample to *400 tokens/sample* (using a lower bound, since most of the samples were around 430 tokens). We were using the llava-pretrain dataset. We used *16 nodes* with *4 GPUs/node*, resulting in a total of *64 GPUs* (NVIDIA GH200 120GB).

The most valuable metric is the amount of *tokens per gpu per second*.

#table(
  columns: (auto, auto, auto, auto, auto, auto, auto, auto),
  inset: 10pt,
  align: horizon,
  table.header(
    [*reduce bucket size*],
    [*stage3 gather 16bit weights on model save*],
    [*overlap comm*],
    [*per device train batch size*],
    [*gradient acc. steps*],
    [*Total training time \ hh:mm*],
    [*Batch size (in terms of samples)*],
    [*⚠️ tokens per gpu per second*]
  ),
  "5.00E+08", "0", "1", "2", "16", "02:06", "2048",  [*106*],
  "5.00E+08", "0", "0", "2", "16", "02:06", "2048", [*106*],
  "5.00E+08", "1", "1", "2", "16", "05:10", "2048", [*35*],
  "5.00E+09", "0", "1", "2", "16", "02:13", "2048", [*106*],
  "5.00E+07", "0", "1", "2", "16", "02:07", "2048", [*106*],
  "5.00E+08", "0", "1", "32", "16", "03:13", "32768", [*1137*]
)

As we can see, two main factors influenced the training time:
- the fact that we gathered the weights before saving the model, considerably reducing the throughput.
- the fact that we are using a larger batch size per GPU (we actually ran more tests on using 4, and 8 as a batch size before trying 32, because the training did not run out of memory).

=== Evaluating the trade-off

==== Gather before model save

Disabling the gathering of the weights before saving the model significantly improved the throughput. However, it means that each device saves their own checkpoint, leading to a huge amount of temporary storage required. In addition to that, we also need to merge these checkpoints after the training is done.

The output of the training process when disabling the gathering of the weights is a set of `.pt` files. For one epoch of LLaVA pretrain, the size was about 719G against 264G for the gathered weights (about x2.7 more). Since storage is much cheaper than GPU-hours, this was a significant improvement.

Gathering these 719G weights to `.safetensors` files took approximately 20 minutes with only 4 GPUs.

#stick-together[#table(
  columns: (auto, auto, auto),
  inset: 10pt,
  align: horizon,
  table.header(
    [],
    [*Gather before saving*],
    [*Do not gather before saving*],
  ),
  "Initial checkpoint format",
  ".safetensors",
  ".pt (per device)",
  "Required temporary storage (1 epoch)",
  "264 GB",
  "719 GB (~2.7× larger)",
  "Training throughput",
  "36 tokens/gpu/second",
  "107 tokens/gpu/second (~3× faster)",
  "Merging time (→ .safetensors)",
  "—",
  "~20 minutes with only 4 GPUs"
)]

==== Larger batch size per GPU

Increasing the batch size per GPU also had a huge impact on the training time. The trade-off here is that it mostly depends on the dataset. In the case of homogeneous samples such as llava-pretrain, it is easy to increase the batch size. However, for more diverse datasets, it may lead to a less optimal training because of the memory padding requirement: when batching samples of varying lengths, all samples in a batch must be padded to match the longest sample's size. For instance, if one sample contains 8000 tokens and another contains only 6 tokens, both will occupy 8000 tokens worth of memory. This leads to significant memory waste, as the shorter sample uses the same GPU memory as the longer one, reducing the effective batch size that could fit in memory. That is why it might be useful to implement a packing or truncating mechanism, for instance.

#pagebreak()

== Fine-Tuning of the Swiss LLM (Apertus)

=== Creating an ApertusPromptTokenizer

To support Apertus (the Swiss LLM) in the MultiMeditron pipeline, I created a new tokenizer class that handles its specific chat template format. Unlike LLaMA or Qwen, Apertus uses numbered special tokens (`<SPECIAL_61>`, `<SPECIAL_62>`, etc.) to delimit conversation messages.

I implemented the `ApertusPromptTokenizer` by extending the base class and adding proper masking for:
- System messages (`<SPECIAL_61>` to `<SPECIAL_62>`)
- Developer messages (`<SPECIAL_63>` to `<SPECIAL_64>`)
- User messages (`<SPECIAL_65>` to `<SPECIAL_66>`)

This ensures the model learns only from assistant responses during training.

=== Running the first training

With the tokenizer in place, I proceeded to run the first training job for the Apertus 8B model with the MultiMeditron pipeline, and 50 steps of the llava-pretrain dataset on 4 GPUs. The training went fine, however during the inference, I kept getting the following error:
```
Traceback (most recent call last):
  File "/users/slefort/MultiMeditron/inference.py", line 78, in <module>
    outputs = model.generate(input_ids=batch["input_ids"],
  File "/users/slefort/MultiMeditron/src/multimeditron/model/model.py", line 472, in generate
    x = torch.multinomial(
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
```

I then started to debug the entire SwissAI/transformers fork. It turned out that these NaN values were coming from the normalization layer, which received infinite values and led to a NaN. I then debugged more to find out that the MLP output was way too high at some point, leading to an infinite value.

*Solution*: the issue was that the experimental CUDA #link("https://github.com/rubber-duck-debug/xielu/commits/main/")[Xielu] attention implementation was not properly installed. It fell back to the Python implementation which led to these errors.

I could finally start the inference, however the results were really bad:

```
User: Describe the image: [EPFL Campus image]

MultiMeditron-Apertus (1st-training): On a multi-layered scene, an enormous turtle is the centerpiece. Surrounding it are various flora and fauna. Notably, a small, almost invisible bird sits on the turtle's back, adding an unexpected touch of whimsy. The background layers are detailed with blooming flowers and lush greenery, creating a vibrant and inviting atmosphere.
```

#figure(
  image("assets/EPFL_campus_2017.jpg"),
  caption: [EPFL Campus image],
)

=== Running a second training

I then started the second training on the whole llava-pretrain dataset. Something I did not notice on the 1st training is that it probably did not converge.

#figure(
  image("assets/1st-training-apertus.png"),
  caption: [1st training of MultiMeditron-Apertus],
)

I also wanted to try if results were better with the CUDA experimental Xielu attention implementation so I made sure to install it properly this time.

The results were still bad. We figured out that using a batch size larger than 1 completely broke our training, leading to a loss of about 0.5, which was way too low.

#figure(
  image("assets/2nd-training-apertus.png"),
  caption: [2nd training of MultiMeditron-Apertus],
)

The model was also completely broken and similar to the first one.

=== Running a third training

We then started a training with a batch size equal to 1 while working on a fix.

#figure(
  image("assets/cat.png", width: 80%),
  caption: [Cat image sent to the LLM],
)

This time the loss was pretty stable and the LLM replied the following:
```
User: Describe the image: [Cat image]

MultiMeditron-Apertus (3rd-training): The image features a cat sitting on a wooden fence, looking out at a grassy field. The cat has a distinctive white and black coat with a white belly and black stripes on its back. The fence is made of wooden planks, and the cat's posture suggests a relaxed and contented mood. The background is a lush green field, adding a serene atmosphere to the scene. The cat's expression and the setting create a charming and peaceful image.
```

The model could now slightly understand the image!

=== Running a fourth training

This happened two days after I left the lab. My supervisor Michael Zhang realized that the issue with the batch size was due to the fact that the padding was set using padding tokens instead of `-100` so the model was actually training on these, resulting in a very low loss as the model constantly predicted them.

We could therefore get the same results as before with a much higher throughput.

=== Next steps

We should now make more attempts with more data and see what it can bring to the model!

#pagebreak()

== In-between work

Because I had a lot of idle time during my internship while waiting for the training to complete, I used it to help with other lab's projects.

=== My work on the MMORE project

MMORE (Massive Multimodal Open RAG & Extraction) is a project aiming to provide a multimodal pipeline for processing, indexing, and querying multimodal documents.

First, I fixed the project's Docker image to make it support GPUs, added a cache for models so they would not be downloaded every time and added more documentation for that part. I also created a GitHub CI to build and push the Docker image to our registry.

Then, I created a production-ready config folder and added a utility script to support environment variables inside these config files.

=== My work on the MOOVE gateway

The MOOVE (Massive Open Online Validation & Evaluation) is a project aiming to provide an evaluation online platform for doctors to rate our models and compare them with the state of the art closed models (with a blind evaluation process).

To connect this frontend to the models, a gateway was built. However, all new deployments were done manually: connecting via SSH to the VM, pulling, rebuilding the Docker images if needed, and restarting the gateway.

I then automated this process by creating a deployment script that pulls the latest changes from the `main` and `dev` branches, rebuilds the Docker images, restarts the gateway and sends a status update to Slack.

However, that required quite a lot of refactors in the gateway. It was using a Docker-in-Docker hack to support the development environment running on the same host as the development one so I had to refactor all the `docker-compose` file and a part of the decK and Kong configuration files.

I also set up `dotenvx` to manage the secrets. These are now encrypted using a public key and pushed to GitHub so the deployment script can safely pull the changes and decrypt the updated secrets. This was mostly necessary because the model's API keys (OpenAI, Deepseek, etc.) were often rotated.

=== My (very small) work on the MOOVE platform

I pushed two very small changes on the MOOVE, users can now change their passwords and I also updated a minor part of the CI. ;)

#pagebreak()

== What is DeepSpeed? #label("deepspeed-explanations")

DeepSpeed is an optimization library for training large models developed by Microsoft. Here is a summary of the main optimizations it offers. Because each optimization has a trade-off, it's important to understand the context in which they are used, and to always check if they help reduce the bottleneck.

=== ZeRO (Zero Redundancy Optimizer)

This optimization allows reducing the memory used by optimizers by partitioning gradient and weight states across different GPUs rather than having them replicated on each GPU. This allows reducing the memory needed to store gradients and weights, which is crucial for training large models. However, this approach requires more data exchange between GPUs and is therefore slower.

DeepSpeed offers 3 variants of ZeRO:
- ZeRO-1 : The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its own partition.
- ZeRO-2 : The reduced 16-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
- ZeRO-3 : The 16-bit model parameters are partitioned across the processes, and ZeRO-3 will automatically collect and partition them during the forward and backward passes.

For MultiMeditron, we were always using the ZeRO-3 variant, which is the most memory efficient.

==== ZeRO-Offload

For large model training, optimizers (e.g. Adam) can consume a significant amount of GPU compute and memory. ZeRO-Offload reduces the GPU compute and memory requirements of such models by leveraging compute and memory resources on the host CPU to execute the optimizer instead of relying on the GPU. Furthermore, to prevent the optimizer from becoming a bottleneck, ZeRO-Offload uses its own DeepSpeed's highly optimized CPU implementation of Adam called #link("https://github.com/deepspeedai/DeepSpeed/tree/master/deepspeed/ops/adam")[DeepSpeedCPUAdam], which is 5X-7X faster than the standard PyTorch implementation.

This parameter was probably the reason we had a Host OOM earlier.

=== Gradient checkpointing

Calculating the LLM output (the forward pass) requires $O(n)$ memory (where $n$ is the number of layers in the network). We also need $O(n)$ memory for the backward pass (we need to remember the activation function results). What we could do to optimize this memory is to forget the first activation once we've calculated the 2nd, etc. and we would thus need only $O(1)$ memory. However, for backpropagation, we would need $O(n^2)$ in computation (since we go backwards, to get the gradient at the second-to-last, we need to recalculate all the nodes before, etc.).

Gradient checkpointing consists of storing only certain results, so we need only $O(sqrt(n))$ memory. Now, we need $O(n)$ in compute.

#figure(
  image("assets/checkpointing.png"),
  caption: [Gradient checkpointing strategy balancing memory usage and computational complexity]
)

=== Gradient accumulation

Another problem is the size of batches. Large batches are often necessary for training stability and to efficiently exploit the optimizer, but a batch that is too large can exceed the available GPU memory.

Thus, we divide a large batch into several mini-batches that fit in memory. For each mini-batch, we do a forward and backward pass, and we accumulate the gradients without updating the weights. After processing a certain number of mini-batches (equivalent to a large batch), we perform a single weight update. This allows simulating a large effective batch, while respecting memory constraints, at the cost of additional time overhead.

Click #link(label("beginning-work"))[*here*] to get redirected to the beginning of my work!

#pagebreak()

== References

- Touvron, H., et al. (2023). *LLaMA: Open and Efficient Foundation Language Models*.
  arXiv:2302.13971. https://arxiv.org/abs/2302.13971

- Liu, H., et al. (2023). *Visual Instruction Tuning*.
  arXiv:2304.08485. (Introduces LLaVA) https://arxiv.org/abs/2304.08485

- Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2020).
  *ZeRO: Memory Optimizations Toward Training Trillion Parameter Models*.
  In Proceedings of SC20. https://arxiv.org/abs/1910.02054

- Microsoft DeepSpeed Team. (2025, v0.17.5). *DeepSpeed: Extreme-scale model training for everyone*.
  GitHub repository. https://github.com/microsoft/DeepSpeed

- HuggingFace. (2025, v4.55.4). *Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX*.
  GitHub repository. https://github.com/huggingface/transformers

- NVIDIA. (2025, v2025.3.1). *Nsight Systems User Guide*.
  NVIDIA Developer Tools. https://developer.nvidia.com/nsight-systems