-
-
Notifications
You must be signed in to change notification settings - Fork 166
Open
Description
Hey everyone.
I'm training for the first time and I had to cancel the job. When I started it again I saw that the loss had a spike. So I read about it and it seemed that it would normalize in about 200-300 steps. But that's not what's happening.
As you can see in blue I run it until the step 6645 and then I had to cancelled it with about 0.0091 loss. When I resumed it I did it with a save at the start of an epoch on the 6000's step. And how you can see after 1299 steps it only dropped to 0.0095 and this came after a huge spike that reached about 0.0104. Something is wrong, right? Don't know what. I have the --save_state in the command.
Can someone help a newbie?
Thanks!
Metadata
Metadata
Assignees
Labels
No labels