Skip to content

Commit a561896

Browse files
Merge pull request #6 from tarun-etikala/poc-e2e
poc-tekton
2 parents beef880 + 36a9c01 commit a561896

8 files changed

Lines changed: 2119 additions & 0 deletions

File tree

.github/workflows/e2e-tekton.yml

Lines changed: 469 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Tekton-based E2E Testing for Knowledge-Tuning
2+
3+
This directory contains Tekton Pipeline resources for running E2E tests on RHOAI with Tekton Dashboard visibility and built-in retry capabilities.
4+
5+
## Two Execution Modes
6+
7+
### Single-Pod Mode (Default, Recommended)
8+
9+
All 6 notebooks run in a **single Pod** with step-by-step visibility:
10+
11+
```text
12+
┌─────────────────────────────────────────────────────────────────────────────┐
13+
│ Single Pod (Tekton Task) │
14+
├─────────────────────────────────────────────────────────────────────────────┤
15+
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
16+
│ │ Step 1 │─▶│ Step 2 │─▶│ Step 3 │─▶│ Step 4 │─▶│ Step 5 │─▶│ Step 6 │ │
17+
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │
18+
│ │
19+
│ ✅ Fast startup (single Pod) ✅ Tekton Dashboard visibility │
20+
│ ✅ Simple debugging (one log stream) ✅ Built-in retry at Task level │
21+
│ ✅ Shared GPU/filesystem ✅ No PVC required │
22+
└─────────────────────────────────────────────────────────────────────────────┘
23+
```
24+
25+
### Multi-Pod Mode (Optional)
26+
27+
Each notebook runs in a **separate Pod** for step isolation:
28+
29+
```text
30+
┌─────────────────────────────────────────────────────────────────────────────┐
31+
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
32+
│ │ Pod 1 │──▶│ Pod 2 │──▶│ Pod 3 │──▶│ Pod 4 │──▶│ Pod 5 │──▶ 6 │
33+
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
34+
│ ↓ ↓ ↓ ↓ ↓ │
35+
│ ═══════════════ Shared PVC for data/outputs ═══════════════════ │
36+
└─────────────────────────────────────────────────────────────────────────────┘
37+
```
38+
39+
## Mode Comparison
40+
41+
| Feature | Single-Pod | Multi-Pod |
42+
|---------|------------|-----------|
43+
| Startup time | ✅ Fast | ❌ Slower (Pod per step) |
44+
| Debugging | ✅ Simple | ❌ Multiple logs |
45+
| Step retry | ✅ Task level | ✅ Per-step |
46+
| Dashboard visibility | ✅ Yes | ✅ Yes |
47+
| Step isolation | ❌ Shared | ✅ Isolated |
48+
| PVC required | ❌ No | ✅ Yes |
49+
50+
## Files
51+
52+
| File | Description |
53+
|------|-------------|
54+
| `resources.yaml` | Namespace, PVCs, ServiceAccount |
55+
| `task-e2e-single-pod.yaml` | Single-pod Task (all steps in one Pod) |
56+
| `pipeline-single-pod.yaml` | Pipeline wrapper for single-pod mode |
57+
| `task-notebook-runner.yaml` | Multi-pod Task (one notebook per call) |
58+
| `pipeline-e2e.yaml` | Multi-pod Pipeline (6 TaskRuns) |
59+
60+
## Prerequisites
61+
62+
1. **OpenShift Pipelines Operator** installed on RHOAI cluster
63+
2. **GPU nodes** available with label `nvidia.com/gpu.present=true`
64+
3. **GitHub Secrets** configured:
65+
- `OPENSHIFT_SERVER`: API server URL
66+
- `OPENSHIFT_TOKEN`: Service account token
67+
68+
## Quick Start
69+
70+
### Option 1: Via GitHub Actions
71+
72+
```bash
73+
# Single-pod mode (default, recommended)
74+
gh workflow run e2e-tekton.yml \
75+
-f profile=minimal \
76+
-f mode=single-pod
77+
78+
# Multi-pod mode (step isolation)
79+
gh workflow run e2e-tekton.yml \
80+
-f profile=minimal \
81+
-f mode=multi-pod
82+
```
83+
84+
### Option 2: Using trigger script
85+
86+
```bash
87+
# Single-pod mode (default)
88+
./trigger-pipeline.sh --profile minimal
89+
90+
# Multi-pod mode
91+
./trigger-pipeline.sh --profile minimal --mode multi-pod
92+
93+
# With options
94+
./trigger-pipeline.sh --profile standard --branch feature-branch --skip "1,6"
95+
```
96+
97+
### Option 3: Manual Setup on RHOAI
98+
99+
```bash
100+
# 1. Apply resources (single-pod mode)
101+
oc apply -f resources.yaml
102+
oc apply -f task-e2e-single-pod.yaml
103+
oc apply -f pipeline-single-pod.yaml
104+
105+
# 2. Start a PipelineRun
106+
cat <<EOF | oc apply -f -
107+
apiVersion: tekton.dev/v1beta1
108+
kind: PipelineRun
109+
metadata:
110+
generateName: e2e-manual-
111+
namespace: e2e-tests
112+
spec:
113+
pipelineRef:
114+
name: e2e-knowledge-tuning-single-pod
115+
serviceAccountName: e2e-pipeline-sa
116+
params:
117+
- name: test-profile
118+
value: "minimal"
119+
podTemplate:
120+
tolerations:
121+
- key: "nvidia.com/gpu"
122+
operator: "Exists"
123+
effect: "NoSchedule"
124+
nodeSelector:
125+
nvidia.com/gpu.present: "true"
126+
EOF
127+
128+
# 3. Watch progress
129+
tkn pipelinerun logs -f -n e2e-tests
130+
```
131+
132+
## Monitoring
133+
134+
### Tekton CLI
135+
136+
```bash
137+
# List PipelineRuns
138+
tkn pipelinerun list -n e2e-tests
139+
140+
# Watch logs
141+
tkn pipelinerun logs <name> -f -n e2e-tests
142+
143+
# Describe run
144+
tkn pipelinerun describe <name> -n e2e-tests
145+
```
146+
147+
### OpenShift Console
148+
149+
1. Navigate to **Pipelines****PipelineRuns**
150+
2. Select namespace: `e2e-tests`
151+
3. Click on a PipelineRun to see task status
152+
153+
### Tekton Dashboard (if installed)
154+
155+
```bash
156+
# Port-forward to dashboard
157+
oc port-forward svc/tekton-dashboard 9097:9097 -n openshift-pipelines
158+
159+
# Open http://localhost:9097
160+
```
161+
162+
## Parameters
163+
164+
| Parameter | Default | Description |
165+
|-----------|---------|-------------|
166+
| `git-url` | GitHub repo | Repository to clone |
167+
| `git-revision` | `main` | Branch or tag |
168+
| `test-profile` | `minimal` | Test profile |
169+
| `student-model` | SmolLM2-135M | Model for testing |
170+
| `teacher-model` | SmolLM2-135M | Teacher model |
171+
| `skip-steps` | `""` | Steps to skip |
172+
173+
## Retry Configuration
174+
175+
Each task has built-in retry:
176+
177+
```yaml
178+
tasks:
179+
- name: step-02-data-processing
180+
retries: 1 # Retry once on failure
181+
timeout: "30m"
182+
```
183+
184+
To change retry count, edit `pipeline-e2e.yaml`.
185+
186+
## Troubleshooting
187+
188+
### Task stuck in Pending
189+
190+
```bash
191+
# Check pod status
192+
oc get pods -n e2e-tests -l tekton.dev/pipelineRun=<name>
193+
194+
# Check events
195+
oc get events -n e2e-tests --sort-by='.lastTimestamp'
196+
```
197+
198+
### GPU not allocated
199+
200+
```bash
201+
# Check GPU node availability
202+
oc get nodes -l nvidia.com/gpu.present=true
203+
204+
# Check pod tolerations
205+
oc describe pod <pod-name> -n e2e-tests
206+
```
207+
208+
### View task logs
209+
210+
```bash
211+
# Get specific task logs
212+
tkn taskrun logs <taskrun-name> -n e2e-tests
213+
```
214+
215+
## Cleanup
216+
217+
```bash
218+
# Delete old PipelineRuns (keep last 5)
219+
tkn pipelinerun delete -n e2e-tests --keep 5
220+
221+
# Delete all resources
222+
oc delete -f resources.yaml
223+
oc delete -f task-notebook-runner.yaml
224+
oc delete -f pipeline-e2e.yaml
225+
```

0 commit comments

Comments
 (0)