You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to fine-tune Qwen2.5-VL-7B on a Windows 11 with 1 RTX 4090 24GB, 95GB RAM, and i9-14900K. All packages were installed according to the docs.
I started the 1st experiment with these arguments:
As you can see within 3 minutes, results started appearing and was working quite fast. It was supposed to finish in a day. In the end, step 185, it got stuck there because I left it training, and when I came back, somehow my machine went into sleep mode which stopped the GPU and disk. Understandable, and I had to stop this run.
It was impossible to use multiprocessing on Windows due to pickling issues. So, I decided to use docker and give it a go on Linux image expecting that I would see even faster training. I made sure to change the power policy to avoid the going into sleep or hibernation mode. Also, I activated the XMP profile on my bios to boost the RAM usage.
Then, I made the following changes and rerun the training:
docker with volume mapping from Windows (with and without multiprocessing)
from wsl (after copying all the files + data + model) and docker with volume mapping from wsl Linux (with and without multiprocessing) [I could see the model was loaded faster than in 1)
But it took forever to display the 1 training log result and showed that it would take months to finish. So, I stopped it and thought it might be the XMP which I activated. I reset it back to its previous state and tried all the above again, same thing.
Then, I decided to just rerun it on Windows as I did before but with the configuration additions I made since I wanted evaluation and token accuracy which I don't think is a big overhead given the validation set size. I got the following:
It took over 2 hours to display the first log result and shows that it would take 44 days to finish! I would like to know why is this happening? Why on first exp it was fast and now it is so slow even though the change is not that big (or am I wrong)?
Note: GPU usage is fine at 90-100%, RAM just 33GB used, and CPU around 20% max. I guess CPU and RAM are because multiprocessing is disabled.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am trying to fine-tune Qwen2.5-VL-7B on a Windows 11 with 1 RTX 4090 24GB, 95GB RAM, and i9-14900K. All packages were installed according to the docs.
I started the 1st experiment with these arguments:
I started training with:
llamafactory-cli train training_args.yaml
from the Windows terminal, got:As you can see within 3 minutes, results started appearing and was working quite fast. It was supposed to finish in a day. In the end, step 185, it got stuck there because I left it training, and when I came back, somehow my machine went into sleep mode which stopped the GPU and disk. Understandable, and I had to stop this run.
It was impossible to use multiprocessing on Windows due to pickling issues. So, I decided to use docker and give it a go on Linux image expecting that I would see even faster training. I made sure to change the power policy to avoid the going into sleep or hibernation mode. Also, I activated the XMP profile on my bios to boost the RAM usage.
Then, I made the following changes and rerun the training:
I tried to run it with:
But it took forever to display the 1 training log result and showed that it would take months to finish. So, I stopped it and thought it might be the XMP which I activated. I reset it back to its previous state and tried all the above again, same thing.
Then, I decided to just rerun it on Windows as I did before but with the configuration additions I made since I wanted evaluation and token accuracy which I don't think is a big overhead given the validation set size. I got the following:
It took over 2 hours to display the first log result and shows that it would take 44 days to finish! I would like to know why is this happening? Why on first exp it was fast and now it is so slow even though the change is not that big (or am I wrong)?
Note: GPU usage is fine at 90-100%, RAM just 33GB used, and CPU around 20% max. I guess CPU and RAM are because multiprocessing is disabled.
Beta Was this translation helpful? Give feedback.
All reactions