Skip to content

feat: Add training progression tracking feature for RHAITrainer type implementation#20

Merged
abhijeet-dhumal merged 13 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:add-rhai-progression
Nov 25, 2025
Merged

feat: Add training progression tracking feature for RHAITrainer type implementation#20
abhijeet-dhumal merged 13 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:add-rhai-progression

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Nov 12, 2025

What this PR does / why we need it:

RHOAIENG-38273
Implement controller support for polling and tracking training job progression from HTTP metrics endpoints exposed by experimental trainers (e.g., TransformersTrainer).

Related to :
opendatahub-io/kubeflow-sdk#21

Sample TrainJob tested it with :
wrapper-test.yaml

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  annotations:
    trainer.opendatahub.io/framework: transformers
    trainer.opendatahub.io/metrics-poll-interval: '30'
    trainer.opendatahub.io/metrics-port: '28080'
    trainer.opendatahub.io/progression-tracking: enabled
    trainer.opendatahub.io/trainerStatus: '{
            "progressPercentage": 35,
            "estimatedRemainingSeconds": 354,
            "estimatedRemainingTimeSummary": "5 minutes",
            "currentStep": 132,
            "totalSteps": 375,
            "currentEpoch": 1,
            "totalEpochs": 3,
            "trainMetrics": {
              "grad_norm": 0.034348633140325546,
              "learning_rate": 0.000013120000000000001,
              "loss": 0.0015
            },
            "evalMetrics": {
              "eval_loss": 0.0012287567369639874,
              "eval_runtime": 9.2987,
              "eval_samples_per_second": 21.508,
              "eval_steps_per_second": 2.689
            },
            "lastUpdatedTime": "2025-11-20T09:50:13Z"
  }'
...

Sample suceeded trainjob status progression metrics: (the annotation is chnaged a bit in below picture to make all annotations viewable) -
Screenshot 2025-11-20 at 3 43 41 PM

Pre-stop hook injected by controller :
Screenshot 2025-11-20 at 1 40 09 PM

Checklist:

  • Docs included if any changes are user facing

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 12, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-rhai-progression branch 2 times, most recently from 5c72c15 to 35d006e Compare November 12, 2025 19:53
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review November 12, 2025 19:54
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Add training progression tracking feature for experimental impl… feat: Add training progression tracking feature for RHAITrainer type implementation Nov 12, 2025
@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft November 13, 2025 11:00
Comment thread manifests/rhoai/params.env Outdated
Comment thread manifests/rhoai/rbac_progression_patch.yaml Outdated
Comment thread pkg/rhai/constants/constants.go Outdated
Comment thread pkg/rhai/controller/progression_controller.go Outdated
Comment thread pkg/rhai/controller/progression_controller.go Outdated
Comment thread cmd/trainer-controller-manager/main.go Outdated
Comment thread pkg/rhai/progression/progression.go
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
…ementation

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
…ring trainjob termination

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review November 20, 2025 11:49
@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft November 20, 2025 11:50
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review November 21, 2025 09:41
Comment thread cmd/trainer-controller-manager/main.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/controller/trainjob_controller.go Outdated
@robert-bell
Copy link
Copy Markdown
Collaborator

A few small nits, but lgtm otherwise. I'm happy for this to be merged after those nits are resolved.
/lgtm

Comment thread manifests/rhoai/rbac_progression_patch.yaml Outdated
Comment thread manifests/rhoai/params.env Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Comment thread pkg/rhai/progression/progression.go Outdated
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Copy link
Copy Markdown

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@astefanutti
Copy link
Copy Markdown

Thanks @abhijeet-dhumal!

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks a million @astefanutti @robert-bell 🙌

@abhijeet-dhumal abhijeet-dhumal merged commit bdf239e into opendatahub-io:main Nov 25, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants