Skip to content

Add question-answer example for v2 trainer #2580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

solanyn
Copy link
Contributor

@solanyn solanyn commented Apr 1, 2025

What this PR does / why we need it:

This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Adds a question-answer example mentioned in #2040 using the pytorch runtime.

Checklist:

  • Docs included if any changes are user facing

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot requested a review from jinchihe April 1, 2025 16:19
@solanyn solanyn force-pushed the solanyn/question-answer-example branch from 8e336ff to 75062fd Compare April 1, 2025 16:20
@Electronic-Waste
Copy link
Member

@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you.

/cc @kubeflow/wg-training-leads @astefanutti

@google-oss-prow google-oss-prow bot requested review from astefanutti and a team April 2, 2025 04:00
@solanyn
Copy link
Contributor Author

solanyn commented Apr 2, 2025

Thanks @Electronic-Waste, I've updated the branch to address your comments!

@Electronic-Waste
Copy link
Member

@solanyn Thanks for this! And welcome to the Kubeflow community!

/lgtm
/assign @kubeflow/wg-training-leads @astefanutti

@coveralls
Copy link

coveralls commented Apr 2, 2025

Pull Request Test Coverage Report for Build 14721592890

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 26.207%

Totals Coverage Status
Change from base Build 14713473393: 0.0%
Covered Lines: 684
Relevant Lines: 2610

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrew Chen <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Apr 25, 2025
@andreyvelich
Copy link
Member

/ok-to-test

* run train job on CPU
* reduce batch size, dataset size and train epochs
* make upload to bucket optional
* add notebook to e2e-test
* set model name as trainjob argument

Signed-off-by: solanyn <[email protected]>
@andreyvelich
Copy link
Member

/rerun-all

* e2e tests fail if trainjobs launched by notebook do not finish in 3s
* extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes

Signed-off-by: solanyn <[email protected]>
@solanyn
Copy link
Contributor Author

solanyn commented Apr 29, 2025

@andreyvelich @Electronic-Waste this should be good to go, let me know if there is anything else you'd like to see in this change!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Member

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 074d8b8 into kubeflow:master May 9, 2025
17 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.0 milestone May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants