-
Notifications
You must be signed in to change notification settings - Fork 773
Add question-answer example for v2 trainer #2580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
8e336ff
to
75062fd
Compare
@@ -0,0 +1,592 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. !pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk
Could you please add a comment like: "# Change the version of SDK when we have the first release of Trainer SDK"?
Reply via ReviewNB
@@ -0,0 +1,592 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #8. if r.name == "torch-distributed":
Can we remove this? Since torch_runtime
is unused.
Reply via ReviewNB
@@ -0,0 +1,592 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #17. # Uncomment this to distribute the TrainJob using GPU nodes.
It might be better to remove this comment:)
Reply via ReviewNB
@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you. /cc @kubeflow/wg-training-leads @astefanutti |
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
Thanks @Electronic-Waste, I've updated the branch to address your comments! |
@solanyn Thanks for this! And welcome to the Kubeflow community! /lgtm |
Pull Request Test Coverage Report for Build 14674591093Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni
" resources_per_node={\n", | ||
" \"cpu\": \"3\",\n", | ||
" \"memory\": \"8Gi\",\n", | ||
" \"nvidia.com/gpu\": 1,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way for us to test test script on CPU maybe just for 1 Epoch or distilbert-base-uncased model is too large for CPU testing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can try it out on CPU and make the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it would be nice to see if we could run this example as part of our E2Es: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml#L55
"id": "c31bc8f2", | ||
"metadata": {}, | ||
"source": [ | ||
"# Install the KubeFlow SDK and dependencies" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the installation of Kubeflow SDK as prerequisite for this Notebook.
It is better to redirect users to this guide to get started with Kubeflow Trainer:
https://www.kubeflow.org/docs/components/trainer/getting-started/
" \n", | ||
" trainer.train()\n", | ||
"\n", | ||
" CloudPath(f'gs://{args[\"BUCKET\"]}/{args[\"MODEL_NAME\"]}').upload_from(args[\"MODEL_NAME\"])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the location (e.g. gs://bucket-name
) under args, so users can use s3 or gs with the CloudPath API ?
"\n", | ||
" squad = squad.train_test_split(test_size=0.2)\n", | ||
" \n", | ||
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the MODEL_NAME arg for the model here
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\")\n", | |
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/args["MODEL_NAME"]\")\n", |
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>
New changes are detected. LGTM label has been removed. |
/ok-to-test |
What this PR does / why we need it:
This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Adds a question-answer example mentioned in #2040 using the pytorch runtime.
Checklist: