Skip to content

Add question-answer example for v2 trainer #2580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

solanyn
Copy link

@solanyn solanyn commented Apr 1, 2025

What this PR does / why we need it:

This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Adds a question-answer example mentioned in #2040 using the pytorch runtime.

Checklist:

  • Docs included if any changes are user facing

Sorry, something went wrong.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot requested a review from jinchihe April 1, 2025 16:19
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@solanyn solanyn force-pushed the solanyn/question-answer-example branch from 8e336ff to 75062fd Compare April 1, 2025 16:20
@@ -0,0 +1,592 @@
{
Copy link
Member

@Electronic-Waste Electronic-Waste Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    !pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk

Could you please add a comment like: "# Change the version of SDK when we have the first release of Trainer SDK"?


Reply via ReviewNB

@@ -0,0 +1,592 @@
{
Copy link
Member

@Electronic-Waste Electronic-Waste Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #8.        if r.name == "torch-distributed":

Can we remove this? Since torch_runtime is unused.


Reply via ReviewNB

@@ -0,0 +1,592 @@
{
Copy link
Member

@Electronic-Waste Electronic-Waste Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #17.                # Uncomment this to distribute the TrainJob using GPU nodes.

It might be better to remove this comment:)


Reply via ReviewNB

@Electronic-Waste
Copy link
Member

@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you.

/cc @kubeflow/wg-training-leads @astefanutti

@google-oss-prow google-oss-prow bot requested review from astefanutti and a team April 2, 2025 04:00
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@solanyn
Copy link
Author

solanyn commented Apr 2, 2025

Thanks @Electronic-Waste, I've updated the branch to address your comments!

@Electronic-Waste
Copy link
Member

@solanyn Thanks for this! And welcome to the Kubeflow community!

/lgtm
/assign @kubeflow/wg-training-leads @astefanutti

@coveralls
Copy link

coveralls commented Apr 2, 2025

Pull Request Test Coverage Report for Build 14674591093

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.9%) to 67.356%

Totals Coverage Status
Change from base Build 14160634151: 0.9%
Covered Lines: 1758
Relevant Lines: 2610

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni

" resources_per_node={\n",
" \"cpu\": \"3\",\n",
" \"memory\": \"8Gi\",\n",
" \"nvidia.com/gpu\": 1,\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for us to test test script on CPU maybe just for 1 Epoch or distilbert-base-uncased model is too large for CPU testing ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can try it out on CPU and make the changes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it would be nice to see if we could run this example as part of our E2Es: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml#L55

"id": "c31bc8f2",
"metadata": {},
"source": [
"# Install the KubeFlow SDK and dependencies"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the installation of Kubeflow SDK as prerequisite for this Notebook.
It is better to redirect users to this guide to get started with Kubeflow Trainer:
https://www.kubeflow.org/docs/components/trainer/getting-started/

" \n",
" trainer.train()\n",
"\n",
" CloudPath(f'gs://{args[\"BUCKET\"]}/{args[\"MODEL_NAME\"]}').upload_from(args[\"MODEL_NAME\"])"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the location (e.g. gs://bucket-name) under args, so users can use s3 or gs with the CloudPath API ?

"\n",
" squad = squad.train_test_split(test_size=0.2)\n",
" \n",
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\")\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the MODEL_NAME arg for the model here

Suggested change
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/distilbert-base-uncased\")\n",
" tokenizer = AutoTokenizer.from_pretrained(\"distilbert/args["MODEL_NAME"]\")\n",

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Apr 25, 2025
Copy link

New changes are detected. LGTM label has been removed.

@andreyvelich
Copy link
Member

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants