Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 68 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,69 @@
# Copilot Instructions for Kubeflow SDK
# GitHub Copilot Code Review Instructions

**IMPORTANT**: Always refer to the `AGENTS.md` file in the root directory for comprehensive project guidelines, coding standards, and architectural decisions.
## Review Philosophy

- Only comment when you have HIGH CONFIDENCE (>80%) that an issue exists
- Be concise: one sentence per comment when possible
- Focus on actionable feedback, not observations
- When reviewing text, only comment on clarity issues if the text is genuinely confusing or could lead to errors.
"Could be clearer" is not the same as "is confusing" - stay silent unless HIGH confidence it will cause problems

## Priority Areas (Review These)

### Security & Safety

- Unsafe code blocks without justification
- Command injection risks (shell commands, user input)
- Path traversal vulnerabilities
- Credential exposure or hardcoded secrets
- Missing input validation on external data
- Improper error handling that could leak sensitive info

### Correctness Issues

- Logic errors that could cause panics or incorrect behavior
- Resource leaks (files, connections, memory)
- Off-by-one errors or boundary conditions
- Optional types that don't need to be optional
- Booleans that should default to false but are set as optional
- Overly defensive code that adds unnecessary checks
- Unnecessary comments that just restate what the code already shows (remove them)

### Architecture & Patterns

- Code that violates existing patterns in the codebase
- Missing error handling
- Code that is not following [Python PEP8](https://peps.python.org/pep-0008/) guidelines.

## Project-Specific Context

- See [AGENTS.md](../AGENTS.md) in the root directory for project guidelines and architecture decisions.

## CI Pipeline Context

**Important**: You review PRs immediately, before CI completes. Do not flag issues that CI will catch.

### What Our CI Checks

- `.github/workflows/test-python.yaml` - linting, unit, and integration tests
- `.github/workflows/test-e2e.yaml` - e2e tests

## Skip These (IMPORTANT)

Do not comment on:

- **Style/formatting** - CI handles this (ruff, ty, uv)
- **Test failures** - CI handles this (full test suite)
- **Missing dependencies** - CI handles this

## Response Format

When you identify an issue:

1. **State the problem** (1 sentence)
2. **Why it matters** (1 sentence, only if not obvious)
3. **Suggested fix** (code snippet or specific action)

## When to Stay Silent

If you're uncertain whether something is an issue, don't comment. False positives create noise and reduce trust in the review process.
201 changes: 30 additions & 171 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,34 @@ Before writing code, agents should:
## Repository Map

```
kubeflow/trainer/ # Trainer component
├── backends/kubernetes/ # K8s backend implementation + tests
├── backends/localprocess/ # Local process backend
├── api/ # Client API, TrainerClient
├── types/ # Pydantic v2 data models
└── utils/ # Shared helpers + tests
docs/ # Diagrams and proposals
scripts/ # Project scripts (e.g., changelog)
Root files: AGENTS.md, README.md, pyproject.toml, Makefile, CI workflows
.github/ # GitHub actions for CI/CD
docs/ # Kubeflow SDK documentation
examples/ # Kubeflow SDK examples
kubeflow/ # Main Python package
├── common/ # Shared utilities and types across all projects
|
├── trainer/ # Kubeflow Trainer
│ ├── api/ # TrainerClient - main user interface
│ ├── backends/ # Execution backend implementations
│ │ ├── kubernetes/ # Kubernetes backend
│ │ │ ├── backend.py
│ │ ├── container/ # Container backend for local development
│ │ │ ├── backend.py
│ │ │ └── adapters/ # Docker & Podman adapter implementations
│ │ └── localprocess/ # Subprocess backend for quick prototyping
│ ├── constants/ # Common trainer constants and defaults
│ ├── options/ # Backend configuration options (KubernetesOptions, etc.)
│ ├── types/ # Common trainer types (e.g. TrainJob, CustomTrainer, BuiltinTrainer)
|
├── optimizer/ # Kubeflow Optimizer
│ ├── api/ # OptimizerClient - main user interface
│ ├── backends/ # Execution backend implementations
│ │ └── kubernetes/ # Kubernetes backend
│ ├── types/ # Common optimizer types (e.g. OptimizationJob, Search)
│ └── constants/ # Common optimizer constants and defaults
|
└── hub/ # Kubeflow Hub
└── api/ # ModelRegistryClient - main user interface
```

## Environment & Tooling
Expand All @@ -50,7 +69,7 @@ Root files: AGENTS.md, README.md, pyproject.toml, Makefile, CI workflows
- **Build**: Hatchling (optional `uv build`)
- **Pre-commit**: Config provided and enforced in CI

## Quick Start
## Commands

<!-- BEGIN: AGENT_COMMANDS -->

Expand Down Expand Up @@ -202,32 +221,7 @@ def filter_completed_jobs(jobs: list[str], completed: set[str]) -> list[str]:
- Include `name`, `expected_status`, `config`, `expected_output/error` fields
- Print test execution status for debugging
- Handle both success and exception cases in the same test function

**Test Quality Checklist:**

- [ ] Tests fail when your new logic is broken
- [ ] Happy path is covered
- [ ] Edge cases and error conditions are tested
- [ ] Use fixtures/mocks for external dependencies
- [ ] Tests are deterministic (no flaky tests)

**Test Examples:**

Simple test:

```python
def test_filter_completed_jobs():
"""Test filtering completed jobs from a list."""
jobs = ["job-1", "job-2", "job-3"]
completed = {"job-1", "job-2"}

result = filter_completed_jobs(jobs, completed)

assert result == ["job-3"]
assert len(result) == 1
```

Parametrized test cases (preferred for multiple scenarios):
- Use `pytest.mark.parametrize` with `TestCase` dataclass for multiple test scenarios:

```python
@pytest.mark.parametrize(
Expand Down Expand Up @@ -320,138 +314,3 @@ def submit_job(name: str, config: dict, *, priority: str = "normal") -> str:
- Document all parameters, return values, and exceptions
- Keep descriptions concise but clear
- Use Pydantic v2 models in `kubeflow.trainer.types` for schemas

### 6. Architectural Improvements

**When you encounter code that could be improved, suggest better designs:**

❌ **Poor Design:**

```python
def process_training(data, k8s_client, storage, logger):
# Function doing too many things
validated = validate_data(data)
job = k8s_client.create_job(validated)
storage.save_metadata(job)
logger.info(f"Created job {job.name}")
return job
```

✅ **Better Design:**

```python
@dataclass
class TrainingJobResult:
"""Result of training job submission."""
job_id: str
status: str
created_at: datetime

class TrainingJobManager:
"""Handles training job lifecycle operations."""

def __init__(self, k8s_client: KubernetesClient, storage: Storage):
self.k8s = k8s_client
self.storage = storage

def submit_job(self, config: TrainingConfig) -> TrainingJobResult:
"""Submit and track a new training job."""
validated_config = self._validate_config(config)
job = self._create_k8s_job(validated_config)
self._save_job_metadata(job)
return TrainingJobResult(
job_id=job.name,
status=job.status,
created_at=job.created_at
)
```

## Component: Trainer

**Client entrypoints**: `kubeflow.trainer.api.TrainerClient` and trainer definitions such as `CustomTrainer`

**Trainer Types**:

**CustomTrainer** (`kubeflow.trainer.types.CustomTrainer`):

- **Purpose**: For custom, self-contained training functions that you write yourself
- **Flexibility**: Complete control over the training process
- **Use case**: "Bring your own training code" - maximum flexibility
- **Key attributes**: `func` (your training function), `func_args`, `packages_to_install`, `pip_index_urls`, `num_nodes`, `resources_per_node`, `env`

**CustomTrainerContainer** (`kubeflow.trainer.types.CustomTrainerContainer`):

- **Purpose**: For custom, self-contained container image that you create yourself
- **Flexibility**: Complete control over the training process
- **Use case**: "Bring your own training image" - maximum flexibility
- **Key attributes**: `num_nodes`, `resources_per_node`, `env`

**BuiltinTrainer** (`kubeflow.trainer.types.BuiltinTrainer`):

- **Purpose**: For pre-built training frameworks with existing fine-tuning logic
- **Convenience**: Just configure parameters, training logic is already implemented
- **Use case**: "Use our pre-built trainers" - convenience for common scenarios
- **Key attributes**: `config` (currently only supports `TorchTuneConfig` for LLM fine-tuning with TorchTune)

**Backends**:

- `localprocess`: local execution for fast iteration
- `kubernetes`: K8s-backed jobs, see `backends/kubernetes`

**Typical flow**:

1. Get runtime, define trainer, submit with `TrainerClient().train(...)`
2. `wait_for_job_status(...)` then fetch logs with `get_job_logs(...)`
3. For full example, see README "Run your first PyTorch distributed job"

**Integration patterns**:

- Follow existing patterns in `kubeflow.trainer.backends` for new backends
- Use `kubeflow.trainer.types` for data models and type definitions
- Implement proper error handling and resource cleanup
- Include comprehensive tests for backend implementations

## CI & PRs

**PR Requirements**:

- Title must follow Conventional Commits:
- Types: `chore`, `fix`, `feat`, `revert`
- Scopes: `ci`, `docs`, `examples`, `scripts`, `test`, `trainer`
- CI runs `make verify` and tests on Python 3.9/3.11
- Keep changes focused and minimal; align with existing style

## Releasing

**Version management**:

```bash
make release VERSION=X.Y.Z # Updates kubeflow/__init__.py and generates changelog
```

- Do not commit secrets; verify coverage and lint pass before tagging

## Troubleshooting

- **`uv` not found**: run `make uv` or re-run `make install-dev`
- **Ruff not installed**: `make install-dev` ensures tools; or `uv tool install ruff`
- **Virtualenv issues**: remove `.venv` and re-run `make install-dev`
- **Tests failing locally but not in CI**: run `make verify` to match CI formatting and lint rules

## Quick Reference Checklist

Before submitting code changes:

- [ ] **Breaking Changes**: Verified no public API changes without deprecation
- [ ] **Type Hints**: All functions have complete type annotations and return types
- [ ] **Tests**: New functionality is fully tested with unit tests
- [ ] **Security**: No dangerous patterns (eval, bare except, resource leaks, etc.)
- [ ] **Documentation**: Google-style docstrings for public functions
- [ ] **Code Quality**: `make verify` passes (lint and format)
- [ ] **Architecture**: Suggested improvements where applicable
- [ ] **Commit Message**: Follows Conventional Commits format

## Community & Support

- **Issues/Discussions**: https://github.com/kubeflow/sdk
- **Contributing**: see CONTRIBUTING.md
Loading