I am currently fine-tuning the Llama 3.1 8B Instruct model using CURLoRA adapters on a single RTX 4090 GPU.

Problem:
-
It takes ~170 seconds per step (batch) during training.
-
Estimated time to complete one epoch is over 14 days.
-
Estimated full 5-epoch training would take around 2+ months at current speed.
-
the process crashes halfway through.
Question:
-
Is this extremely slow training expected when fine-tuning Llama 3.1 8B models with CURLoRA on a 4090?
-
Is there anything I can optimize further while still using CURLoRA? (e.g., sequence length, optimizer settings, etc.)
Additional Notes:
I am currently fine-tuning the Llama 3.1 8B Instruct model using CURLoRA adapters on a single RTX 4090 GPU.
Problem:
It takes ~170 seconds per step (batch) during training.
Estimated time to complete one epoch is over 14 days.
Estimated full 5-epoch training would take around 2+ months at current speed.
the process crashes halfway through.
Question:
Is this extremely slow training expected when fine-tuning Llama 3.1 8B models with CURLoRA on a 4090?
Is there anything I can optimize further while still using CURLoRA? (e.g., sequence length, optimizer settings, etc.)
Additional Notes:
GPU utilization is high (close to 100%) during training.
VRAM usage is around 22.5 GB out of 24 GB (4090 almost fully loaded).