Skip to content

Commit e4a2be8

Browse files
authored
chore(docs): Update Copilot Instructions and AGENTS.md (#248)
* chore(docs): Update Copilot Instructions Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update AGENTS.md Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Trainer dir Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
1 parent 92eb7d0 commit e4a2be8

File tree

2 files changed

+98
-173
lines changed

2 files changed

+98
-173
lines changed

.github/copilot-instructions.md

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,69 @@
1-
# Copilot Instructions for Kubeflow SDK
1+
# GitHub Copilot Code Review Instructions
22

3-
**IMPORTANT**: Always refer to the `AGENTS.md` file in the root directory for comprehensive project guidelines, coding standards, and architectural decisions.
3+
## Review Philosophy
4+
5+
- Only comment when you have HIGH CONFIDENCE (>80%) that an issue exists
6+
- Be concise: one sentence per comment when possible
7+
- Focus on actionable feedback, not observations
8+
- When reviewing text, only comment on clarity issues if the text is genuinely confusing or could lead to errors.
9+
"Could be clearer" is not the same as "is confusing" - stay silent unless HIGH confidence it will cause problems
10+
11+
## Priority Areas (Review These)
12+
13+
### Security & Safety
14+
15+
- Unsafe code blocks without justification
16+
- Command injection risks (shell commands, user input)
17+
- Path traversal vulnerabilities
18+
- Credential exposure or hardcoded secrets
19+
- Missing input validation on external data
20+
- Improper error handling that could leak sensitive info
21+
22+
### Correctness Issues
23+
24+
- Logic errors that could cause panics or incorrect behavior
25+
- Resource leaks (files, connections, memory)
26+
- Off-by-one errors or boundary conditions
27+
- Optional types that don't need to be optional
28+
- Booleans that should default to false but are set as optional
29+
- Overly defensive code that adds unnecessary checks
30+
- Unnecessary comments that just restate what the code already shows (remove them)
31+
32+
### Architecture & Patterns
33+
34+
- Code that violates existing patterns in the codebase
35+
- Missing error handling
36+
- Code that is not following [Python PEP8](https://peps.python.org/pep-0008/) guidelines.
37+
38+
## Project-Specific Context
39+
40+
- See [AGENTS.md](../AGENTS.md) in the root directory for project guidelines and architecture decisions.
41+
42+
## CI Pipeline Context
43+
44+
**Important**: You review PRs immediately, before CI completes. Do not flag issues that CI will catch.
45+
46+
### What Our CI Checks
47+
48+
- `.github/workflows/test-python.yaml` - linting, unit, and integration tests
49+
- `.github/workflows/test-e2e.yaml` - e2e tests
50+
51+
## Skip These (IMPORTANT)
52+
53+
Do not comment on:
54+
55+
- **Style/formatting** - CI handles this (ruff, ty, uv)
56+
- **Test failures** - CI handles this (full test suite)
57+
- **Missing dependencies** - CI handles this
58+
59+
## Response Format
60+
61+
When you identify an issue:
62+
63+
1. **State the problem** (1 sentence)
64+
2. **Why it matters** (1 sentence, only if not obvious)
65+
3. **Suggested fix** (code snippet or specific action)
66+
67+
## When to Stay Silent
68+
69+
If you're uncertain whether something is an issue, don't comment. False positives create noise and reduce trust in the review process.

AGENTS.md

Lines changed: 30 additions & 171 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,34 @@ Before writing code, agents should:
3131
## Repository Map
3232

3333
```
34-
kubeflow/trainer/ # Trainer component
35-
├── backends/kubernetes/ # K8s backend implementation + tests
36-
├── backends/localprocess/ # Local process backend
37-
├── api/ # Client API, TrainerClient
38-
├── types/ # Pydantic v2 data models
39-
└── utils/ # Shared helpers + tests
40-
docs/ # Diagrams and proposals
41-
scripts/ # Project scripts (e.g., changelog)
42-
Root files: AGENTS.md, README.md, pyproject.toml, Makefile, CI workflows
34+
.github/ # GitHub actions for CI/CD
35+
docs/ # Kubeflow SDK documentation
36+
examples/ # Kubeflow SDK examples
37+
kubeflow/ # Main Python package
38+
├── common/ # Shared utilities and types across all projects
39+
|
40+
├── trainer/ # Kubeflow Trainer
41+
│ ├── api/ # TrainerClient - main user interface
42+
│ ├── backends/ # Execution backend implementations
43+
│ │ ├── kubernetes/ # Kubernetes backend
44+
│ │ │ ├── backend.py
45+
│ │ ├── container/ # Container backend for local development
46+
│ │ │ ├── backend.py
47+
│ │ │ └── adapters/ # Docker & Podman adapter implementations
48+
│ │ └── localprocess/ # Subprocess backend for quick prototyping
49+
│ ├── constants/ # Common trainer constants and defaults
50+
│ ├── options/ # Backend configuration options (KubernetesOptions, etc.)
51+
│ ├── types/ # Common trainer types (e.g. TrainJob, CustomTrainer, BuiltinTrainer)
52+
|
53+
├── optimizer/ # Kubeflow Optimizer
54+
│ ├── api/ # OptimizerClient - main user interface
55+
│ ├── backends/ # Execution backend implementations
56+
│ │ └── kubernetes/ # Kubernetes backend
57+
│ ├── types/ # Common optimizer types (e.g. OptimizationJob, Search)
58+
│ └── constants/ # Common optimizer constants and defaults
59+
|
60+
└── hub/ # Kubeflow Hub
61+
└── api/ # ModelRegistryClient - main user interface
4362
```
4463

4564
## Environment & Tooling
@@ -50,7 +69,7 @@ Root files: AGENTS.md, README.md, pyproject.toml, Makefile, CI workflows
5069
- **Build**: Hatchling (optional `uv build`)
5170
- **Pre-commit**: Config provided and enforced in CI
5271

53-
## Quick Start
72+
## Commands
5473

5574
<!-- BEGIN: AGENT_COMMANDS -->
5675

@@ -202,32 +221,7 @@ def filter_completed_jobs(jobs: list[str], completed: set[str]) -> list[str]:
202221
- Include `name`, `expected_status`, `config`, `expected_output/error` fields
203222
- Print test execution status for debugging
204223
- Handle both success and exception cases in the same test function
205-
206-
**Test Quality Checklist:**
207-
208-
- [ ] Tests fail when your new logic is broken
209-
- [ ] Happy path is covered
210-
- [ ] Edge cases and error conditions are tested
211-
- [ ] Use fixtures/mocks for external dependencies
212-
- [ ] Tests are deterministic (no flaky tests)
213-
214-
**Test Examples:**
215-
216-
Simple test:
217-
218-
```python
219-
def test_filter_completed_jobs():
220-
"""Test filtering completed jobs from a list."""
221-
jobs = ["job-1", "job-2", "job-3"]
222-
completed = {"job-1", "job-2"}
223-
224-
result = filter_completed_jobs(jobs, completed)
225-
226-
assert result == ["job-3"]
227-
assert len(result) == 1
228-
```
229-
230-
Parametrized test cases (preferred for multiple scenarios):
224+
- Use `pytest.mark.parametrize` with `TestCase` dataclass for multiple test scenarios:
231225

232226
```python
233227
@pytest.mark.parametrize(
@@ -320,138 +314,3 @@ def submit_job(name: str, config: dict, *, priority: str = "normal") -> str:
320314
- Document all parameters, return values, and exceptions
321315
- Keep descriptions concise but clear
322316
- Use Pydantic v2 models in `kubeflow.trainer.types` for schemas
323-
324-
### 6. Architectural Improvements
325-
326-
**When you encounter code that could be improved, suggest better designs:**
327-
328-
**Poor Design:**
329-
330-
```python
331-
def process_training(data, k8s_client, storage, logger):
332-
# Function doing too many things
333-
validated = validate_data(data)
334-
job = k8s_client.create_job(validated)
335-
storage.save_metadata(job)
336-
logger.info(f"Created job {job.name}")
337-
return job
338-
```
339-
340-
**Better Design:**
341-
342-
```python
343-
@dataclass
344-
class TrainingJobResult:
345-
"""Result of training job submission."""
346-
job_id: str
347-
status: str
348-
created_at: datetime
349-
350-
class TrainingJobManager:
351-
"""Handles training job lifecycle operations."""
352-
353-
def __init__(self, k8s_client: KubernetesClient, storage: Storage):
354-
self.k8s = k8s_client
355-
self.storage = storage
356-
357-
def submit_job(self, config: TrainingConfig) -> TrainingJobResult:
358-
"""Submit and track a new training job."""
359-
validated_config = self._validate_config(config)
360-
job = self._create_k8s_job(validated_config)
361-
self._save_job_metadata(job)
362-
return TrainingJobResult(
363-
job_id=job.name,
364-
status=job.status,
365-
created_at=job.created_at
366-
)
367-
```
368-
369-
## Component: Trainer
370-
371-
**Client entrypoints**: `kubeflow.trainer.api.TrainerClient` and trainer definitions such as `CustomTrainer`
372-
373-
**Trainer Types**:
374-
375-
**CustomTrainer** (`kubeflow.trainer.types.CustomTrainer`):
376-
377-
- **Purpose**: For custom, self-contained training functions that you write yourself
378-
- **Flexibility**: Complete control over the training process
379-
- **Use case**: "Bring your own training code" - maximum flexibility
380-
- **Key attributes**: `func` (your training function), `func_args`, `packages_to_install`, `pip_index_urls`, `num_nodes`, `resources_per_node`, `env`
381-
382-
**CustomTrainerContainer** (`kubeflow.trainer.types.CustomTrainerContainer`):
383-
384-
- **Purpose**: For custom, self-contained container image that you create yourself
385-
- **Flexibility**: Complete control over the training process
386-
- **Use case**: "Bring your own training image" - maximum flexibility
387-
- **Key attributes**: `num_nodes`, `resources_per_node`, `env`
388-
389-
**BuiltinTrainer** (`kubeflow.trainer.types.BuiltinTrainer`):
390-
391-
- **Purpose**: For pre-built training frameworks with existing fine-tuning logic
392-
- **Convenience**: Just configure parameters, training logic is already implemented
393-
- **Use case**: "Use our pre-built trainers" - convenience for common scenarios
394-
- **Key attributes**: `config` (currently only supports `TorchTuneConfig` for LLM fine-tuning with TorchTune)
395-
396-
**Backends**:
397-
398-
- `localprocess`: local execution for fast iteration
399-
- `kubernetes`: K8s-backed jobs, see `backends/kubernetes`
400-
401-
**Typical flow**:
402-
403-
1. Get runtime, define trainer, submit with `TrainerClient().train(...)`
404-
2. `wait_for_job_status(...)` then fetch logs with `get_job_logs(...)`
405-
3. For full example, see README "Run your first PyTorch distributed job"
406-
407-
**Integration patterns**:
408-
409-
- Follow existing patterns in `kubeflow.trainer.backends` for new backends
410-
- Use `kubeflow.trainer.types` for data models and type definitions
411-
- Implement proper error handling and resource cleanup
412-
- Include comprehensive tests for backend implementations
413-
414-
## CI & PRs
415-
416-
**PR Requirements**:
417-
418-
- Title must follow Conventional Commits:
419-
- Types: `chore`, `fix`, `feat`, `revert`
420-
- Scopes: `ci`, `docs`, `examples`, `scripts`, `test`, `trainer`
421-
- CI runs `make verify` and tests on Python 3.9/3.11
422-
- Keep changes focused and minimal; align with existing style
423-
424-
## Releasing
425-
426-
**Version management**:
427-
428-
```bash
429-
make release VERSION=X.Y.Z # Updates kubeflow/__init__.py and generates changelog
430-
```
431-
432-
- Do not commit secrets; verify coverage and lint pass before tagging
433-
434-
## Troubleshooting
435-
436-
- **`uv` not found**: run `make uv` or re-run `make install-dev`
437-
- **Ruff not installed**: `make install-dev` ensures tools; or `uv tool install ruff`
438-
- **Virtualenv issues**: remove `.venv` and re-run `make install-dev`
439-
- **Tests failing locally but not in CI**: run `make verify` to match CI formatting and lint rules
440-
441-
## Quick Reference Checklist
442-
443-
Before submitting code changes:
444-
445-
- [ ] **Breaking Changes**: Verified no public API changes without deprecation
446-
- [ ] **Type Hints**: All functions have complete type annotations and return types
447-
- [ ] **Tests**: New functionality is fully tested with unit tests
448-
- [ ] **Security**: No dangerous patterns (eval, bare except, resource leaks, etc.)
449-
- [ ] **Documentation**: Google-style docstrings for public functions
450-
- [ ] **Code Quality**: `make verify` passes (lint and format)
451-
- [ ] **Architecture**: Suggested improvements where applicable
452-
- [ ] **Commit Message**: Follows Conventional Commits format
453-
454-
## Community & Support
455-
456-
- **Issues/Discussions**: https://github.com/kubeflow/sdk
457-
- **Contributing**: see CONTRIBUTING.md

0 commit comments

Comments
 (0)