Very slow training speed with CURLoRA on Llama 3.1 8B Instruct

I am currently fine-tuning the Llama 3.1 8B Instruct model using CURLoRA adapters on a single RTX 4090 GPU.

![Image](https://github.com/user-attachments/assets/1f22b861-25fe-40dc-86e5-9f325fc7151f)

Problem:

- It takes ~170 seconds per step (batch) during training.

- Estimated time to complete one epoch is over 14 days.

- Estimated full 5-epoch training would take around 2+ months at current speed.

- the process crashes halfway through.

Question:

- Is this extremely slow training expected when fine-tuning Llama 3.1 8B models with CURLoRA on a 4090?

- Is there anything I can optimize further while still using CURLoRA? (e.g., sequence length, optimizer settings, etc.)

Additional Notes:

- GPU utilization is high (close to 100%) during training.

- VRAM usage is around 22.5 GB out of 24 GB (4090 almost fully loaded).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow training speed with CURLoRA on Llama 3.1 8B Instruct #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Very slow training speed with CURLoRA on Llama 3.1 8B Instruct #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions