-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time starts without stopping #2261
base: master
Are you sure you want to change the base?
Conversation
If we call train_ch11 or train_concise_ch11 and the code never fits in the condition (like calling these functions with a batch size of 1500 and only one epoch), no time interval will be stored in the times array of the Timer object and this causes a division by zero in Timer.avg() because this function divides by the length of the times array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Excelsior7 that makes sense for the timer, but i think it will still fail in that hyperparam config because animator will not record anything. Since n%200
will not be 0 so the if condition is never run. please update the PR to address that as well, so that we can merge this together.
Code proposal number 1 to address timer and animator issues. At the same time, I propose some code layouts in order to standardize them across the implementations of the train function (train_ch11, train_concise_ch11) across the frameworks (pytorch, mxnet, tensorflow).
To begin with, I would like to specify the principles from which I have developed this code:
Even though the culture of code brings us together, and the above points may seem like common sense, I think they are important points to formalize to make sure I'm not constantly outside a framework of good practices around project management I'm not aware of (since I am a new contributor and I don't have much experience to rely on) or to help the reader answer some of the "why's" they may be asking as they skim through the code or my explanations of it. And if you have any suggestions or comments, I'd be happy to hear them. Now the goal will be to explain the choices behind the modifications I made. To do so, I will copy and paste below the train_ch11() function of the pytorch framework and comment each modified line. The comments will apply to the other implementations. Each block with the following shape indicates a change: #@save
By submitting this pull request, I confirm that you can use, modify, |
Job d2l-en/PR-2261/2 is complete. |
Job d2l-en/PR-2261/3 is complete. |
Job d2l-en/PR-2261/4 is complete. |
Hi @Excelsior7! Thanks for the very detailed explanation of the code changes. But I personally feel it makes the code tad too complex. The code looks correct to me and should fix the issue you raised but that would definitely make it a bit harder to understand. I'll leave this to @astonzhang, please share what you feel. Do you think it is necessary to fix the corner case? Or should we just keep it as is? |
If we call train_ch11 or train_concise_ch11 and the code never fits in the condition (like calling these functions with a batch size of 1500 and only one epoch), no time interval will be stored in the times array of the Timer object and this causes a division by zero in Timer.avg() because this function divides by the length of the times array.