-
Notifications
You must be signed in to change notification settings - Fork 790
Add question-answer example for v2 trainer #2580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add question-answer example for v2 trainer #2580
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: solanyn <[email protected]>
8e336ff
to
75062fd
Compare
@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you. /cc @kubeflow/wg-training-leads @astefanutti |
Signed-off-by: solanyn <[email protected]>
Thanks @Electronic-Waste, I've updated the branch to address your comments! |
@solanyn Thanks for this! And welcome to the Kubeflow community! /lgtm |
Pull Request Test Coverage Report for Build 14721592890Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni
Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Andrew Chen <[email protected]>
/ok-to-test |
* run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <[email protected]>
/rerun-all |
* e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <[email protected]>
* revert change to e2e-run-notebook.sh Signed-off-by: solanyn <[email protected]>
@andreyvelich @Electronic-Waste this should be good to go, let me know if there is anything else you'd like to see in this change! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402
@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Adds a question-answer example mentioned in #2040 using the pytorch runtime.
Checklist: