chore: Trainer: Specialized Trainers by szaher · Pull Request #308 · kubeflow/sdk

szaher · 2026-02-19T22:43:57Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic. Issue: kubeflow#285 Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Copilot

Pull request overview

This PR adds a comprehensive design proposal for specialized trainer abstractions and a RuntimeConfig dataclass to the Kubeflow SDK. The proposal addresses current limitations in the SDK's trainer subsystem by introducing framework-aware trainer classes that bridge the gap between the generic CustomTrainer and the highly specific BuiltinTrainer.

Changes:

Adds a detailed design proposal document describing a new BaseTrainer abstract interface and specialized framework trainers (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer)
Proposes a RuntimeConfig dataclass to cleanly separate runtime environment settings from training logic
Includes comprehensive documentation covering motivation, design details, API examples, migration strategy, test plan, and alternatives considered

Copilot · 2026-02-19T22:47:02Z

docs/proposals/285-specialized-trainers/README.md

+3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
+   Specialized trainers are an additional option, not a replacement.
+4. **Tier 2 trainer implementations.** This proposal defines the extension mechanism
+   and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" throughout the document. This applies to references in text and comments, though the class name "HuggingFaceTrainer" would be correct as Python class names don't use spaces.

Suggested change

and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,

and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)
+
+@dataclass
+class TransformersTrainer(BaseTrainer):
+    """Trainer for HuggingFace Transformers training.
+
+    Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the comment and docstring text.

Suggested change

# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for HuggingFace Transformers training.

Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.

# Example: future Hugging Face trainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for Hugging Face Transformers training.

Wraps Hugging Face's Trainer API and maps to a PyTorch runtime.

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+          │
+    ┌─────┴──────────┐
+    │                │
+HuggingFace     DeepSpeed


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the diagram text.

Suggested change

HuggingFace DeepSpeed

Hugging Face DeepSpeed

krishdef7 · 2026-02-25T16:40:26Z

@szaher @andreyvelich — this proposal is really well thought out, especially the separation between BaseTrainer and framework-specific trainers along with RuntimeConfig.

I had a question regarding TorchTrainer extensibility and runtime selection:

Given that multiple torch-based runtimes may coexist (as discussed earlier in #287), how do you envision selecting the appropriate runtime for a given TorchTrainer instance?

One possible approach could be:

Allow an optional runtime_name in RuntimeConfig (explicit selection), and
Fall back to a priority-based selection among compatible runtimes (e.g., via annotations or ordering) when not specified.

This might help keep the API simple while still supporting multiple backends (e.g., TorchTune vs custom PEFT/TRL runtimes for LLM workflows)

Curious if something along these lines aligns with the intended direction.

Happy to explore this further or prototype once the design is clearer.

kramaranya

Thanks @szaher!
Looks great to me, and it should be a great improvement to the user experience in Kubeflow SDK!

/assign @andreyvelich @astefanutti @briangallagher @Fiona-Waters @MStokluska

kramaranya · 2026-02-26T10:14:46Z

docs/proposals/285-specialized-trainers/README.md

+2. **`RuntimeConfig` dataclass** — A dedicated configuration object that cleanly separates
+   per-job runtime environment settings (packages, pip config, environment variables) from
+   training logic and scaling parameters. This replaces the current pattern where
+   `CustomTrainer` conflates runtime concerns with trainer concerns.


Would this require runtime/controller changes?

no, it should required any backend changes but it should align closely with it.

kramaranya · 2026-02-26T10:19:35Z

docs/proposals/285-specialized-trainers/README.md

+3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
+   Specialized trainers are an additional option, not a replacement.


Is the plan to eventually deprecate those or do we want to always maintain both options?

this is non-goal not a goal of the kep

kramaranya · 2026-02-26T10:24:52Z

docs/proposals/285-specialized-trainers/README.md

+        if runtime.trainer.framework not in self.supported_frameworks:
+            raise ValueError(
+                f"{type(self).__name__} supports frameworks "
+                f"{self.supported_frameworks}, but runtime '{runtime.name}' "
+                f"has framework '{runtime.trainer.framework}'"
+            )


We also would need to validate runtime.trainer.trainer_type too

is that supported by a backend label or annotation?

kramaranya · 2026-02-26T10:27:04Z

docs/proposals/285-specialized-trainers/README.md

+    def get_framework_args(self) -> dict:
+        args = {}
+        if self.max_restarts is not None:
+            args["max-restarts"] = str(self.max_restarts)
+        if self.monitor_interval is not None:
+            args["monitor-interval"] = str(self.monitor_interval)
+        return args


where these new args go in the TrainJob spec?

those aren't backend args it's framework args. it gets passed via the entrypoint to the script that runs in the pods

google-oss-prow · 2026-02-26T12:30:41Z

@kramaranya: GitHub didn't allow me to assign the following users: MStokluska.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

Thanks @szaher!
Looks great to me, and it should be a great improvement to the user experience in Kubeflow SDK!

/assign @andreyvelich @astefanutti @briangallagher @Fiona-Waters @MStokluska

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krishdef7 · 2026-02-26T12:48:45Z

+1 on the points around validation and argument placement, I had a related question while reading through this.

For the framework-specific args (e.g. max_restarts, monitor_interval), where do you envision these being materialized in the resulting TrainJob spec?

Do they map directly to existing fields in the underlying CRDs (e.g. TorchJob/MPIJob), or
Are they intended to flow through a more generic extension mechanism (e.g. annotations / plugin args)?

This also seems tied to whether RuntimeConfig and the specialized trainers remain purely SDK-layer abstractions, or if they imply corresponding changes in the controller/runtime layer.

Clarifying this mapping would help understand how far the abstraction goes (SDK-only vs API/CRD impact).

MStokluska · 2026-03-02T11:25:48Z

Thanks @szaher - it looks really good to me!

Fiona-Waters · 2026-03-02T12:03:48Z

/lgtm

astefanutti

Thanks @szaher!

/lgtm

astefanutti · 2026-03-05T18:01:13Z

/assign @kubeflow/kubeflow-sdk-team

andreyvelich

Thanks @szaher! I left a few comments, sorry for the delay.

andreyvelich · 2026-03-05T19:09:29Z

docs/proposals/285-specialized-trainers/README.md

+
+<!--
+This proposal targets the kubeflow/sdk repository.
+Directory: docs/proposals/285-specialized-trainers/README.md
+-->
+
+|                |                                                              |
+| -------------- | ------------------------------------------------------------ |
+| **Authors**    | @szaher                                                      |
+| **Status**     | Draft                                                        |
+| **Created**    | 2026-02-11                                                   |
+| **Reviewers**  |                                                              |
+| **Supersedes** | N/A                                                          |
+| **Relevant Issues** | https://github.com/kubeflow/sdk/issues/285              |


Instead of this, can we consider to add kep.yaml in a format that we use for k8s. Check: https://github.com/kubernetes-sigs/jobset/blob/main/keps/463-ElasticJobsets/kep.yaml

Also, you can add implementation history as here: https://github.com/kubeflow/trainer/blob/master/docs/proposals/2170-kubeflow-trainer-v2/README.md#implementation-history

andreyvelich · 2026-03-05T19:14:12Z

docs/proposals/285-specialized-trainers/README.md

+(`kubeflow/sdk`) trainer subsystem:
+
+1. **Specialized, framework-aware trainer abstractions** — A new `BaseTrainer` abstract
+   interface and a suite of framework-specific implementations (`TorchTrainer`, `MPITrainer`,


Why MPI and Torch are in the same category?
MPI is just a technology we use for distributed workloads.
As you can see, we create dedicated runtime like DeepSpeed Distributed and MLX Distributed which leverage MPI: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes

That is similar to TorchTrainer it's a sort of custom trainer where we pass your code as to the dedicate runtime as is.

But we don't support MPI Runtime upstream, we only deploy DeepSpeedRuntime and MLXRuntime which are MPI-based, so I am not sure if we should have dedicated MPITrainer

Can we have DeepSpeed Runtime based on torch plugin and another runtime based on MPI?

DeepSpeed can be bootstrapped via mpirun or torchrun: https://www.deepspeed.ai/getting-started/
It depends how users want to configure it, but today we serve only MPI-based runtime for DeepSpeed in Trainer: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/deepspeed_distributed.yaml

Shall we start with dedicated DeepSpeedTrainer which will be MPI-based for now?
@astefanutti Any thoughts?

Maybe @kuizhiqing or @tenzen-y has more information when users want to use mpirun or torchrun to run DeepSpeed workloads?

andreyvelich · 2026-03-05T19:15:49Z

docs/proposals/285-specialized-trainers/README.md

+   `JAXTrainer`, etc.) that automatically discover and validate the correct
+   `ClusterTrainingRuntime` using the `trainer.kubeflow.org/framework` label. This fills the
+   "missing middle" between the overly generic `CustomTrainer` and the overly narrow
+   `BuiltinTrainer`.


Do we require to do any changes to BuiltinTrainer after this? IIRC, we said that this will be the foundation for Builtins as well.
cc @Sapthagiri777 @Electronic-Waste @khushiiagrawal

This connects directly to the TRL/RLHF integration discussed in #2839, a TRLTrainer (or more generally, a post-training specialized trainer) would sit in Tier 2 here, built on top of BaseTrainer. The BuiltinTrainer today has the hardcoded isinstance(trainer.config, TorchTuneConfig) check, a dynamic registry pattern (as proposed in #2839) would let TRL and other frameworks plug in without modifying that dispatch logic. Happy to draft what that integration point would look like if useful.

I think we don't need the builtin Trainer but I am just keeping it for backward compatibility unless we don't care I can update the kep to remove it.

I suppose the question may be how we handle config-driven trainers for post-training LLM fine-tuning that currently fall under the scope of BuiltInTrainers.

If the BaseTrainer hierarchy is purely for function-based trainers then do we segregate config-driven trainers entirely outside of the BaseTrainer scope?

Or if the BaseTrainer hierarchy is meant to be a catch-all for everything then do we classify these "dynamic LLM trainers" as Tier-2 trainers, and if so, how do we differentiate them from function-based Tier 2 trainers such as the proposed TransformersTrainer?

I am assuming either way we may need to retain an interface for config-driven trainers. But the appropriate placement for them may help to determine the scope of kubeflow/trainer#2839 within the broader scope for this KEP.

Many post-training frameworks (TRL, Unsloth, Axolotl, etc.) are effectively config-driven trainers, where the training entrypoint is a framework trainer object (e.g. SFTTrainer) rather than a user-defined train() function.

While experimenting with a small prototype around the backend registry idea from #2839, one pattern that seemed to work well was resolving the backend based on the trainer config type, e.g.:

@register_backend(TRLConfig) class TRLBackend(RuntimeBackend): ...

This keeps the core BaseTrainer abstraction simple while allowing config-driven frameworks to plug in dynamically without expanding SDK dispatch logic.

Conceptually it would allow both styles to coexist:

function-based trainers (TorchTrainer(train_fn=...))

config-driven trainers (TorchTrainer(config=TRLConfig(...)))

with the backend registry selecting the appropriate runtime adapter.

Curious if this aligns with the intended direction for the dynamic trainer framework.

@tariq-hasan is right. @szaher I am trying to understand how are we going to refactor the BuiltinTrainer interface once we implement the BaseTrainer? And how can we dynamically register new LLM fine-tuning framework backends ?

andreyvelich · 2026-03-05T19:18:15Z

docs/proposals/285-specialized-trainers/README.md

+For the majority of distributed training workloads — "run this PyTorch DDP function on
+N nodes" or "run this MPI script across a cluster" — neither abstraction fits well.
+Users must either use the low-level `CustomTrainer` with manual runtime wiring, or
+fall back to raw YAML.


Would love to hear feedback from @vsoch on this proposal.
Do you know if we have any interest from HPC users to leverage our Kubeflow Python SDK for TrainJob submission to manage HPC tasks on k8s?

andreyvelich · 2026-03-05T19:20:27Z

docs/proposals/285-specialized-trainers/README.md

+2. Implement Tier 1 framework-specific trainers (`TorchTrainer`, `MPITrainer`,
+   `JAXTrainer`, `XGBoostTrainer`) that auto-discover runtimes by the
+   `trainer.kubeflow.org/framework` label and validate runtime compatibility.


Who will define which frameworks we will support?
For example, we also have DeepSpeed and MLX framework: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/deepspeed_distributed.yaml#L6

@astefanutti @tenzen-y Any thoughts on exposing framework in the Runtime API directly, or it is fine to use labels for now?

The supported frameworks are defined on the trainer side by the training runtimes. Also my understand is we want it to be extensible so users can bring their own runtimes / frameworks / trainers.

I think using the trainer.kubeflow.org/framework is OK. I remember we discussed whether it could be in the TrainingRuntime API when we introduced it: #31 (comment)

Now that we are aligning the SDK to rely on that information, it may warrant promoting it to a proper spec field.

Accessorily, field indexing is now supported for custom resources: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/#custom-resources-fields

That sounds great! We can later refactor this to a dedicated Runtime API field if that is needed.

Also my understand is we want it to be extensible so users can bring their own runtimes / frameworks / trainers.

Yeah, that makes sense, and users can bring their own Trainers via Tier 1 and Tier 2 extensions as @szaher mentioned in the KEP.

andreyvelich · 2026-03-05T19:49:58Z

docs/proposals/285-specialized-trainers/README.md

+  SDK codebase.
+- `PipConfig` is a separate dataclass rather than inline fields, because pip
+  configuration is a distinct concern with its own options.
+- `packages` replaces `packages_to_install` for brevity.


That will make us in-consistent with KFP.
Do we see any issues since @MStokluska is working on PipelinesClient?

#125
cc @kubeflow/wg-pipeline-leads

the proposal here groups all pip configurations under a specific class which passed to the train function. the pipelines is using decorator to capture that so in their case it might be easier to keep all parameters flat for ease of user as I think most of the dsl.component might have a safe default value.

@dsl.component(base_image="quay.io/my-org/training-image:latest", packages_to_install=[packages for customizing pipelines task runtime]) def submit_training(model_path: str, num_epochs: int = 3, nproc_per_node: int = 1): from kubeflow.trainer import TrainerClient client = TrainerClient() client.train(..., pip_config=PipConfig(extra packages, enable verbose pip, extra index), runtime_config=RuntimeConfig(extra packages that go on the training runtime.)) client.wait_for_job_status("train-job")

I am fine to move these fields under RuntimeConfig, but I suggest to keep parameter name as packages_to_install.

sure, we can keep it as it is.

andreyvelich · 2026-03-05T19:51:23Z

docs/proposals/285-specialized-trainers/README.md

+                "BaseTrainer",        # NEW: accepts any specialized trainer
+            ]
+        ] = None,
+        runtime_config: Optional["RuntimeConfig"] = None,  # NEW


Shall the runtime_config be part of Trainers?
I would imagine a use-case when users want to set custom env for the initializers as well.
cc @akshaychitneni

we can move it to trainers if needed. do we need to also, move the pip config?

Do we need to have separate PipConfig type under RuntimeConfig?

Another feature request I got from users that sometimes they want to install custom tools on top of base image (e.g. using apt install <package_name>).
We don't support it today, and users require to re-build Docker image manually.

I think this one might need a separate KEP. We can use something like olot
@astefanutti WDYT?

Yes it would probably make sense workspace snapshotting #48 be part of the dynamic runtime configuration.

Would dynamic runtime configuration make sense for "built-in" / tier-2 trainers? Would we be able to guarantee their runtimes can be configured dynamically?

If not, runtime_config might have to be part of Trainers conceptually, and maybe Initializers eventually.

Yeah, I think this use-case should be covered by custom trainers when users have more control over training script. But I agree, we can discuss it in the followup proposals.

andreyvelich · 2026-03-05T19:53:47Z

docs/proposals/285-specialized-trainers/README.md

+Each Tier 1 trainer maps 1:1 to a framework identified by the
+`trainer.kubeflow.org/framework` label value.


What happen if users deploy two ClusterTrainingRuntime with the same framework label?

That case is covered in a latter section specifying the SDK fails fast and the user has to pass the name explicitely.

But users can still use TorchTrainer if they provide the runtime name, right?
We just validate that Runtime has the correct framework label?

andreyvelich · 2026-03-05T19:56:59Z

docs/proposals/285-specialized-trainers/README.md

+1. **Framework label check** (in `BaseTrainer.validate_runtime()`): Ensures the
+   runtime's `trainer.kubeflow.org/framework` label value is in the trainer's
+   `supported_frameworks` list.
+
+2. **Framework-specific checks** (in subclass overrides): For example, `MPITrainer`
+   could verify that the runtime's MPI policy source is configured correctly.


I think we should talk more on the validation. Usually, we offload this to the control plane (e.g. webhook).

If that is only framework compatibility, we can directly fetch the desired runtime using the label.

sure, I think we need to validate job before submitting too and backend can validate everything in tact before running the job. SDK validation could just raise warnings.

andreyvelich · 2026-03-05T20:04:00Z

docs/proposals/285-specialized-trainers/README.md

+
+```python
+@dataclass
+class TorchTrainer(BaseTrainer):


@kubeflow/kubeflow-sdk-team @Fiona-Waters @abhijeet-dhumal @akshaychitneni Did we ever explore what capabilities Ray Train implements in their trainers, e.g. TorchTrainer.
I know that they do some changes to the dataset and model to attach it to the Distributor.

Additionally (as a future extension), it would be interesting to validate users code in a way that it is correctly configured for distributed training with Torch or any other framework. For example, we can use AI Agents to analyze users' code before submission and suggest changes/enhancements, since we have a lot of context around distributed configuration on k8s.

cc @VassilisVassiliadis @bigsur0

ray trainers have different philosophy, The trainer gets everything (runtime config, scale config, job config, job script(s), worker specific config) then calls trainer.fit() so there is no client initialization similar to kubeflow (we need that as we don't have a pre-deployed cluster)
we can discuss it if we want

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow · 2026-03-06T11:43:48Z

New changes are detected. LGTM label has been removed.

google-oss-prow · 2026-03-06T11:43:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Saad Zaher <szaher@redhat.com>

SoumyaRaikwar · 2026-03-09T21:41:45Z

@szaher,
Just checking in on the status of this PR! i am currently working on integrating the PyTorch Profiler into the SDK over in #352.
as that the profiler injection logic should be added to the new TorchTrainer introduced in this PR, rather than the framework-agnostic CustomTrainer.
Because of that, my work is currently blocked waiting for this PR to land.

NarayanaSabari · 2026-03-10T06:58:07Z

Hi @szaher, great work on KEP-285! I wanted to flag a related KEP that's complementary to this proposal.

KEP-2839: Dynamic LLM Trainer Framework (tracking issue: kubeflow/trainer#2839) introduces a pluggable LLMBackend ABC for config-driven trainers — backends like TorchTune and TRL where the framework's own CLI is the entrypoint (tune run ..., trl sft ...), not a user-supplied Python function.

In your KEP-285 terminology, these would be Tier 2 config-driven trainers. The two KEPs are designed to be compatible:

Both use trainer.kubeflow.org/framework as the dispatch key (KEP-285 for SDK runtime auto-discovery, KEP-2839 for Go strategy dispatch).
If BuiltinTrainer is eventually deprecated, LLMBackend implementations can migrate to a config-driven Tier 2 trainer under BaseTrainer with minimal changes.
If the framework label is promoted to a Runtime API spec field (as discussed in the review), both KEPs benefit.

This also relates to @tariq-hasan's and @krishdef7's questions about how config-driven trainers fit into the BaseTrainer hierarchy — KEP-2839 is the answer for that use case.

Happy to coordinate on the design. The KEP PR is here: kubeflow/trainer#3263

szaher and others added 5 commits February 11, 2026 19:17

Update docs/proposals/285-specialized-trainers/README.md

6795b74

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

83d2713

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

5902038

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

1562655

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Copilot AI review requested due to automatic review settings February 19, 2026 22:43

google-oss-prow bot requested review from Electronic-Waste and kramaranya February 19, 2026 22:44

google-oss-prow bot added the size/XXL label Feb 19, 2026

Copilot started reviewing on behalf of szaher February 19, 2026 22:44 View session

szaher mentioned this pull request Feb 19, 2026

chore: Trainer: Specialized Trainers #287

Closed

1 task

Copilot AI reviewed Feb 19, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 25, 2026

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training … kubeflow/community#937

Open

kramaranya reviewed Feb 26, 2026

View reviewed changes

google-oss-prow bot assigned andreyvelich, astefanutti, briangallagher and Fiona-Waters Feb 26, 2026

google-oss-prow bot added the lgtm label Mar 2, 2026

astefanutti reviewed Mar 5, 2026

View reviewed changes

andreyvelich reviewed Mar 5, 2026

View reviewed changes

astefanutti mentioned this pull request Mar 6, 2026

feat(trainer): Add PyTorch Profiler integration to CustomTrainer #352

Open

1 task

add KEP.yaml

6241ab2

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow bot removed the lgtm label Mar 6, 2026

fix yaml lint

c178bdf

Signed-off-by: Saad Zaher <szaher@redhat.com>

	and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,
	and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

		3. Deprecating `CustomTrainer` or `BuiltinTrainer`. Both remain supported.
		Specialized trainers are an additional option, not a replacement.

		Each Tier 1 trainer maps 1:1 to a framework identified by the
		`trainer.kubeflow.org/framework` label value.

Conversation

szaher commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

krishdef7 commented Feb 25, 2026

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Feb 26, 2026

Uh oh!

krishdef7 commented Feb 26, 2026

Uh oh!

MStokluska commented Mar 2, 2026

Uh oh!

Fiona-Waters commented Mar 2, 2026

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Mar 5, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andreyvelich Mar 5, 2026 •

edited

Loading

szaher Mar 6, 2026 •

edited

Loading