Why is the fine-tuning speed so different #7323

ktobah · 2025-03-16T01:44:45Z

ktobah
Mar 16, 2025

Hello,

I am trying to fine-tune Qwen2.5-VL-7B on a Windows 11 with 1 RTX 4090 24GB, 95GB RAM, and i9-14900K. All packages were installed according to the docs.

I started the 1st experiment with these arguments:

model_name_or_path: D:\llms\Qwen\Qwen2.5-VL-7B-Instruct
bf16: true
trust_remote_code: true

dataset_dir: data
dataset: train_dataset  # has 12743 samples
overwrite_cache: true
template: qwen2_vl
cutoff_len: 2048
packing: false
max_samples: 100000
dataloader_num_workers: 0
preprocessing_num_workers: 0
tokenized_path: saves/data/training_tokenized/sft

stage: sft
finetuning_type: lora
do_train: true
lora_rank: 8
lora_alpha: 32
lora_dropout: 0
lora_target: all

num_train_epochs: 3.0
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
include_num_input_tokens_seen: true
learning_rate: 5.0e-05
lr_scheduler_type: cosine
max_grad_norm: 1.0
warmup_steps: 1
ddp_timeout: 180000000
optim: adamw_torch
flash_attn: auto

output_dir: saves\Qwen2.5-VL-7B-Instruct\lora\train_2025-03-14-08-22-21
logging_steps: 5
save_steps: 50
plot_loss: true
report_to: wandb

I started training with: llamafactory-cli train training_args.yaml from the Windows terminal, got:

{"current_steps": 5, "total_steps": 2388, "loss": 3.2346, "lr": 4.999965356329446e-05, "epoch": 0.006277463904582548, "percentage": 0.21, "elapsed_time": "0:02:44", "remaining_time": "21:50:06", "throughput": 428.69, "total_tokens": 70704}
{"current_steps": 10, "total_steps": 2388, "loss": 1.9232, "lr": 4.999824618063383e-05, "epoch": 0.012554927809165096, "percentage": 0.42, "elapsed_time": "0:11:24", "remaining_time": "1 day, 21:14:02", "throughput": 206.34, "total_tokens": 141296}
{"current_steps": 15, "total_steps": 2388, "loss": 0.6395, "lr": 4.9995756260623194e-05, "epoch": 0.018832391713747645, "percentage": 0.63, "elapsed_time": "0:14:15", "remaining_time": "1 day, 13:36:53", "throughput": 247.56, "total_tokens": 211904}
{"current_steps": 20, "total_steps": 2388, "loss": 0.3513, "lr": 4.999218391108735e-05, "epoch": 0.025109855618330193, "percentage": 0.84, "elapsed_time": "0:16:15", "remaining_time": "1 day, 8:04:39", "throughput": 289.72, "total_tokens": 282576}
{"current_steps": 25, "total_steps": 2388, "loss": 0.186, "lr": 4.998752928672525e-05, "epoch": 0.031387319522912745, "percentage": 1.05, "elapsed_time": "0:19:50", "remaining_time": "1 day, 7:16:07", "throughput": 296.57, "total_tokens": 353200}
{"current_steps": 30, "total_steps": 2388, "loss": 0.2938, "lr": 4.9981792589103196e-05, "epoch": 0.03766478342749529, "percentage": 1.26, "elapsed_time": "0:23:32", "remaining_time": "1 day, 6:50:46", "throughput": 300.01, "total_tokens": 423856}
{"current_steps": 35, "total_steps": 2388, "loss": 0.2395, "lr": 4.997497406664621e-05, "epoch": 0.04394224733207784, "percentage": 1.47, "elapsed_time": "0:28:20", "remaining_time": "1 day, 7:45:47", "throughput": 290.74, "total_tokens": 494512}
{"current_steps": 40, "total_steps": 2388, "loss": 0.2299, "lr": 4.99670740146272e-05, "epoch": 0.050219711236660386, "percentage": 1.68, "elapsed_time": "0:31:22", "remaining_time": "1 day, 6:41:37", "throughput": 300.31, "total_tokens": 565312}
{"current_steps": 45, "total_steps": 2388, "loss": 0.2505, "lr": 4.995809277515424e-05, "epoch": 0.05649717514124294, "percentage": 1.88, "elapsed_time": "0:34:52", "remaining_time": "1 day, 6:16:00", "throughput": 303.9, "total_tokens": 635968}
{"current_steps": 50, "total_steps": 2388, "loss": 0.2116, "lr": 4.99480307371557e-05, "epoch": 0.06277463904582549, "percentage": 2.09, "elapsed_time": "0:36:38", "remaining_time": "1 day, 4:33:06", "throughput": 321.25, "total_tokens": 706160}
{"current_steps": 55, "total_steps": 2388, "loss": 0.2287, "lr": 4.993688833636341e-05, "epoch": 0.06905210295040803, "percentage": 2.3, "elapsed_time": "0:39:50", "remaining_time": "1 day, 4:10:15", "throughput": 324.95, "total_tokens": 776912}
{"current_steps": 60, "total_steps": 2388, "loss": 0.1991, "lr": 4.992466605529384e-05, "epoch": 0.07532956685499058, "percentage": 2.51, "elapsed_time": "0:41:46", "remaining_time": "1 day, 3:00:58", "throughput": 338.05, "total_tokens": 847376}
{"current_steps": 65, "total_steps": 2388, "loss": 0.1959, "lr": 4.9911364423227127e-05, "epoch": 0.08160703075957314, "percentage": 2.72, "elapsed_time": "0:44:33", "remaining_time": "1 day, 2:32:14", "throughput": 343.42, "total_tokens": 918000}
{"current_steps": 70, "total_steps": 2388, "loss": 0.1287, "lr": 4.9896984016184235e-05, "epoch": 0.08788449466415568, "percentage": 2.93, "elapsed_time": "0:47:14", "remaining_time": "1 day, 2:04:28", "throughput": 348.72, "total_tokens": 988528}
{"current_steps": 75, "total_steps": 2388, "loss": 0.1946, "lr": 4.988152545690197e-05, "epoch": 0.09416195856873823, "percentage": 3.14, "elapsed_time": "0:50:00", "remaining_time": "1 day, 1:42:29", "throughput": 352.95, "total_tokens": 1059200}
{"current_steps": 80, "total_steps": 2388, "loss": 0.1824, "lr": 4.986498941480599e-05, "epoch": 0.10043942247332077, "percentage": 3.35, "elapsed_time": "0:51:48", "remaining_time": "1 day, 0:54:44", "throughput": 363.34, "total_tokens": 1129504}
{"current_steps": 85, "total_steps": 2388, "loss": 0.1668, "lr": 4.9847376605981866e-05, "epoch": 0.10671688637790333, "percentage": 3.56, "elapsed_time": "0:54:30", "remaining_time": "1 day, 0:37:04", "throughput": 366.92, "total_tokens": 1200192}
{"current_steps": 90, "total_steps": 2388, "loss": 0.1784, "lr": 4.9828687793144044e-05, "epoch": 0.11299435028248588, "percentage": 3.77, "elapsed_time": "0:57:13", "remaining_time": "1 day, 0:21:03", "throughput": 370.16, "total_tokens": 1270880}
{"current_steps": 95, "total_steps": 2388, "loss": 0.1315, "lr": 4.9808923785602804e-05, "epoch": 0.11927181418706842, "percentage": 3.98, "elapsed_time": "0:59:47", "remaining_time": "1 day, 0:03:12", "throughput": 373.88, "total_tokens": 1341328}
{"current_steps": 100, "total_steps": 2388, "loss": 0.1174, "lr": 4.978808543922925e-05, "epoch": 0.12554927809165098, "percentage": 4.19, "elapsed_time": "1:01:37", "remaining_time": "23:29:54", "throughput": 381.88, "total_tokens": 1411936}
{"current_steps": 105, "total_steps": 2388, "loss": 0.1636, "lr": 4.976617365641822e-05, "epoch": 0.1318267419962335, "percentage": 4.4, "elapsed_time": "1:04:17", "remaining_time": "23:17:53", "throughput": 384.31, "total_tokens": 1482496}
{"current_steps": 110, "total_steps": 2388, "loss": 0.1354, "lr": 4.974318938604921e-05, "epoch": 0.13810420590081607, "percentage": 4.61, "elapsed_time": "1:06:56", "remaining_time": "23:06:16", "throughput": 386.62, "total_tokens": 1552816}
{"current_steps": 115, "total_steps": 2388, "loss": 0.2335, "lr": 4.9719133623445285e-05, "epoch": 0.14438166980539863, "percentage": 4.82, "elapsed_time": "1:08:41", "remaining_time": "22:37:46", "throughput": 393.88, "total_tokens": 1623456}
{"current_steps": 120, "total_steps": 2388, "loss": 0.2331, "lr": 4.969400741032999e-05, "epoch": 0.15065913370998116, "percentage": 5.03, "elapsed_time": "1:12:17", "remaining_time": "22:46:17", "throughput": 390.54, "total_tokens": 1693968}
{"current_steps": 125, "total_steps": 2388, "loss": 0.1979, "lr": 4.9667811834782224e-05, "epoch": 0.15693659761456372, "percentage": 5.23, "elapsed_time": "1:13:56", "remaining_time": "22:18:42", "throughput": 397.64, "total_tokens": 1764192}
{"current_steps": 130, "total_steps": 2388, "loss": 0.2025, "lr": 4.964054803118913e-05, "epoch": 0.16321406151914628, "percentage": 5.44, "elapsed_time": "1:19:01", "remaining_time": "22:52:37", "throughput": 386.97, "total_tokens": 1834864}
{"current_steps": 135, "total_steps": 2388, "loss": 0.1674, "lr": 4.961221718019694e-05, "epoch": 0.1694915254237288, "percentage": 5.65, "elapsed_time": "1:22:38", "remaining_time": "22:59:19", "throughput": 384.26, "total_tokens": 1905536}
{"current_steps": 140, "total_steps": 2388, "loss": 0.2247, "lr": 4.9582820508659924e-05, "epoch": 0.17576898932831136, "percentage": 5.86, "elapsed_time": "1:24:41", "remaining_time": "22:39:51", "throughput": 388.89, "total_tokens": 1976064}
{"current_steps": 145, "total_steps": 2388, "loss": 0.131, "lr": 4.955235928958716e-05, "epoch": 0.18204645323289392, "percentage": 6.07, "elapsed_time": "1:27:27", "remaining_time": "22:32:52", "throughput": 390.04, "total_tokens": 2046688}
{"current_steps": 150, "total_steps": 2388, "loss": 0.1496, "lr": 4.9520834842087496e-05, "epoch": 0.18832391713747645, "percentage": 6.28, "elapsed_time": "1:31:32", "remaining_time": "22:45:46", "throughput": 385.5, "total_tokens": 2117344}
{"current_steps": 155, "total_steps": 2388, "loss": 0.184, "lr": 4.948824853131236e-05, "epoch": 0.194601381042059, "percentage": 6.49, "elapsed_time": "1:35:33", "remaining_time": "22:56:44", "throughput": 381.59, "total_tokens": 2188000}
{"current_steps": 160, "total_steps": 2388, "loss": 0.1698, "lr": 4.9454601768396714e-05, "epoch": 0.20087884494664154, "percentage": 6.7, "elapsed_time": "1:37:22", "remaining_time": "22:35:57", "throughput": 386.59, "total_tokens": 2258688}
{"current_steps": 165, "total_steps": 2388, "loss": 0.1774, "lr": 4.941989601039785e-05, "epoch": 0.2071563088512241, "percentage": 6.91, "elapsed_time": "1:41:14", "remaining_time": "22:43:56", "throughput": 383.48, "total_tokens": 2329344}
{"current_steps": 170, "total_steps": 2388, "loss": 0.206, "lr": 4.9384132760232393e-05, "epoch": 0.21343377275580666, "percentage": 7.12, "elapsed_time": "1:44:07", "remaining_time": "22:38:27", "throughput": 384.16, "total_tokens": 2399936}
{"current_steps": 175, "total_steps": 2388, "loss": 0.1893, "lr": 4.934731356661114e-05, "epoch": 0.2197112366603892, "percentage": 7.33, "elapsed_time": "1:45:59", "remaining_time": "22:20:18", "throughput": 388.51, "total_tokens": 2470688}
{"current_steps": 180, "total_steps": 2388, "loss": 0.2074, "lr": 4.9309440023972044e-05, "epoch": 0.22598870056497175, "percentage": 7.54, "elapsed_time": "1:47:48", "remaining_time": "22:02:21", "throughput": 392.88, "total_tokens": 2541136}
{"current_steps": 185, "total_steps": 2388, "loss": 0.2117, "lr": 4.9270513772411145e-05, "epoch": 0.2322661644695543, "percentage": 7.75, "elapsed_time": "12:10:03", "remaining_time": "6 days, 0:53:40", "throughput": 59.62, "total_tokens": 2611712}

As you can see within 3 minutes, results started appearing and was working quite fast. It was supposed to finish in a day. In the end, step 185, it got stuck there because I left it training, and when I came back, somehow my machine went into sleep mode which stopped the GPU and disk. Understandable, and I had to stop this run.

It was impossible to use multiprocessing on Windows due to pickling issues. So, I decided to use docker and give it a go on Linux image expecting that I would see even faster training. I made sure to change the power policy to avoid the going into sleep or hibernation mode. Also, I activated the XMP profile on my bios to boost the RAM usage.

Then, I made the following changes and rerun the training:

model_name_or_path: D:\llms\Qwen\Qwen2.5-VL-7B-Instruct
bf16: true
trust_remote_code: true
image_max_pixels: 1003520  # added

dataset_dir: data
dataset: train_dataset 
eval_dataset: val_dataset  # added small set of 69 samples
overwrite_cache: true
template: qwen2_vl
cutoff_len: 2048
packing: false
max_samples: 100000
dataloader_num_workers: 0
preprocessing_num_workers: 0
tokenized_path: saves/data/training_tokenized/sft

stage: sft
finetuning_type: lora
do_train: true
lora_rank: 8
lora_alpha: 32
lora_dropout: 0
lora_target: all
compute_accuracy: true  # added

num_train_epochs: 3.0
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
per_device_eval_batch_size: 1  # added
include_num_input_tokens_seen: true
learning_rate: 5.0e-05
lr_scheduler_type: cosine
max_grad_norm: 1.0
warmup_ratio: 0.01  # changed from warmup_steps: 1
ddp_timeout: 180000000
optim: adamw_torch
flash_attn: auto
resume_from_checkpoint: null # added
eval_strategy: steps # added
eval_steps: 50 # added

output_dir: saves\Qwen2.5-VL-7B-Instruct\lora\train_2025-03-15-08-22-21 # folder changed
logging_steps: 5
save_steps: 50
plot_loss: true
report_to: wandb

I tried to run it with:

docker with volume mapping from Windows (with and without multiprocessing)
from wsl (after copying all the files + data + model) and docker with volume mapping from wsl Linux (with and without multiprocessing) [I could see the model was loaded faster than in 1)

But it took forever to display the 1 training log result and showed that it would take months to finish. So, I stopped it and thought it might be the XMP which I activated. I reset it back to its previous state and tried all the above again, same thing.

Then, I decided to just rerun it on Windows as I did before but with the configuration additions I made since I wanted evaluation and token accuracy which I don't think is a big overhead given the validation set size. I got the following:

{"current_steps": 5, "total_steps": 2388, "loss": 3.3246, "lr": 1.0416666666666668e-05, "epoch": 0.006277463904582548, "percentage": 0.21, "elapsed_time": "2:13:24", "remaining_time": "44 days, 3:42:36", "throughput": 12.99, "total_tokens": 104016}
{"current_steps": 10, "total_steps": 2388, "loss": 3.0229, "lr": 2.0833333333333336e-05, "epoch": 0.012554927809165096, "percentage": 0.42, "elapsed_time": "4:05:31", "remaining_time": "40 days, 13:04:32", "throughput": 14.08, "total_tokens": 207392}
{"current_steps": 15, "total_steps": 2388, "loss": 2.4983, "lr": 3.125e-05, "epoch": 0.018832391713747645, "percentage": 0.63, "elapsed_time": "5:55:23", "remaining_time": "39 days, 1:02:51", "throughput": 14.54, "total_tokens": 310032}
{"current_steps": 20, "total_steps": 2388, "loss": 1.5887, "lr": 4.166666666666667e-05, "epoch": 0.025109855618330193, "percentage": 0.84, "elapsed_time": "7:15:37", "remaining_time": "35 days, 19:37:27", "throughput": 15.7, "total_tokens": 410400}
{"current_steps": 25, "total_steps": 2388, "loss": 0.5519, "lr": 4.999997792428646e-05, "epoch": 0.031387319522912745, "percentage": 1.05, "elapsed_time": "8:20:25", "remaining_time": "32 days, 20:20:28", "throughput": 17.12, "total_tokens": 514080}
{"current_steps": 30, "total_steps": 2388, "loss": 0.3821, "lr": 4.999920527840592e-05, "epoch": 0.03766478342749529, "percentage": 1.26, "elapsed_time": "9:23:28", "remaining_time": "30 days, 18:09:08", "throughput": 18.27, "total_tokens": 617536}
{"current_steps": 35, "total_steps": 2388, "loss": 0.2776, "lr": 4.999732888583467e-05, "epoch": 0.04394224733207784, "percentage": 1.47, "elapsed_time": "10:26:22", "remaining_time": "29 days, 5:50:50", "throughput": 19.21, "total_tokens": 722016}
{"current_steps": 40, "total_steps": 2388, "loss": 0.239, "lr": 4.999434882941783e-05, "epoch": 0.050219711236660386, "percentage": 1.68, "elapsed_time": "11:31:35", "remaining_time": "28 days, 4:36:28", "throughput": 19.86, "total_tokens": 824256}
{"current_steps": 45, "total_steps": 2388, "loss": 0.2415, "lr": 4.9990265240728674e-05, "epoch": 0.05649717514124294, "percentage": 1.88, "elapsed_time": "12:46:00", "remaining_time": "27 days, 16:43:13", "throughput": 20.13, "total_tokens": 925360}

It took over 2 hours to display the first log result and shows that it would take 44 days to finish! I would like to know why is this happening? Why on first exp it was fast and now it is so slow even though the change is not that big (or am I wrong)?

Note: GPU usage is fine at 90-100%, RAM just 33GB used, and CPU around 20% max. I guess CPU and RAM are because multiprocessing is disabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is the fine-tuning speed so different #7323

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why is the fine-tuning speed so different #7323

Uh oh!

ktobah Mar 16, 2025

Replies: 0 comments

ktobah
Mar 16, 2025