|
| 1 | +# Tekton-based E2E Testing for Knowledge-Tuning |
| 2 | + |
| 3 | +This directory contains Tekton Pipeline resources for running E2E tests on RHOAI with Tekton Dashboard visibility and built-in retry capabilities. |
| 4 | + |
| 5 | +## Two Execution Modes |
| 6 | + |
| 7 | +### Single-Pod Mode (Default, Recommended) |
| 8 | + |
| 9 | +All 6 notebooks run in a **single Pod** with step-by-step visibility: |
| 10 | + |
| 11 | +```text |
| 12 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 13 | +│ Single Pod (Tekton Task) │ |
| 14 | +├─────────────────────────────────────────────────────────────────────────────┤ |
| 15 | +│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ |
| 16 | +│ │ Step 1 │─▶│ Step 2 │─▶│ Step 3 │─▶│ Step 4 │─▶│ Step 5 │─▶│ Step 6 │ │ |
| 17 | +│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │ |
| 18 | +│ │ |
| 19 | +│ ✅ Fast startup (single Pod) ✅ Tekton Dashboard visibility │ |
| 20 | +│ ✅ Simple debugging (one log stream) ✅ Built-in retry at Task level │ |
| 21 | +│ ✅ Shared GPU/filesystem ✅ No PVC required │ |
| 22 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 23 | +``` |
| 24 | + |
| 25 | +### Multi-Pod Mode (Optional) |
| 26 | + |
| 27 | +Each notebook runs in a **separate Pod** for step isolation: |
| 28 | + |
| 29 | +```text |
| 30 | +┌─────────────────────────────────────────────────────────────────────────────┐ |
| 31 | +│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ |
| 32 | +│ │ Pod 1 │──▶│ Pod 2 │──▶│ Pod 3 │──▶│ Pod 4 │──▶│ Pod 5 │──▶ 6 │ |
| 33 | +│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ |
| 34 | +│ ↓ ↓ ↓ ↓ ↓ │ |
| 35 | +│ ═══════════════ Shared PVC for data/outputs ═══════════════════ │ |
| 36 | +└─────────────────────────────────────────────────────────────────────────────┘ |
| 37 | +``` |
| 38 | + |
| 39 | +## Mode Comparison |
| 40 | + |
| 41 | +| Feature | Single-Pod | Multi-Pod | |
| 42 | +|---------|------------|-----------| |
| 43 | +| Startup time | ✅ Fast | ❌ Slower (Pod per step) | |
| 44 | +| Debugging | ✅ Simple | ❌ Multiple logs | |
| 45 | +| Step retry | ✅ Task level | ✅ Per-step | |
| 46 | +| Dashboard visibility | ✅ Yes | ✅ Yes | |
| 47 | +| Step isolation | ❌ Shared | ✅ Isolated | |
| 48 | +| PVC required | ❌ No | ✅ Yes | |
| 49 | + |
| 50 | +## Files |
| 51 | + |
| 52 | +| File | Description | |
| 53 | +|------|-------------| |
| 54 | +| `resources.yaml` | Namespace, PVCs, ServiceAccount | |
| 55 | +| `task-e2e-single-pod.yaml` | Single-pod Task (all steps in one Pod) | |
| 56 | +| `pipeline-single-pod.yaml` | Pipeline wrapper for single-pod mode | |
| 57 | +| `task-notebook-runner.yaml` | Multi-pod Task (one notebook per call) | |
| 58 | +| `pipeline-e2e.yaml` | Multi-pod Pipeline (6 TaskRuns) | |
| 59 | + |
| 60 | +## Prerequisites |
| 61 | + |
| 62 | +1. **OpenShift Pipelines Operator** installed on RHOAI cluster |
| 63 | +2. **GPU nodes** available with label `nvidia.com/gpu.present=true` |
| 64 | +3. **GitHub Secrets** configured: |
| 65 | + - `OPENSHIFT_SERVER`: API server URL |
| 66 | + - `OPENSHIFT_TOKEN`: Service account token |
| 67 | + |
| 68 | +## Quick Start |
| 69 | + |
| 70 | +### Option 1: Via GitHub Actions |
| 71 | + |
| 72 | +```bash |
| 73 | +# Single-pod mode (default, recommended) |
| 74 | +gh workflow run e2e-tekton.yml \ |
| 75 | + -f profile=minimal \ |
| 76 | + -f mode=single-pod |
| 77 | + |
| 78 | +# Multi-pod mode (step isolation) |
| 79 | +gh workflow run e2e-tekton.yml \ |
| 80 | + -f profile=minimal \ |
| 81 | + -f mode=multi-pod |
| 82 | +``` |
| 83 | + |
| 84 | +### Option 2: Using trigger script |
| 85 | + |
| 86 | +```bash |
| 87 | +# Single-pod mode (default) |
| 88 | +./trigger-pipeline.sh --profile minimal |
| 89 | + |
| 90 | +# Multi-pod mode |
| 91 | +./trigger-pipeline.sh --profile minimal --mode multi-pod |
| 92 | + |
| 93 | +# With options |
| 94 | +./trigger-pipeline.sh --profile standard --branch feature-branch --skip "1,6" |
| 95 | +``` |
| 96 | + |
| 97 | +### Option 3: Manual Setup on RHOAI |
| 98 | + |
| 99 | +```bash |
| 100 | +# 1. Apply resources (single-pod mode) |
| 101 | +oc apply -f resources.yaml |
| 102 | +oc apply -f task-e2e-single-pod.yaml |
| 103 | +oc apply -f pipeline-single-pod.yaml |
| 104 | + |
| 105 | +# 2. Start a PipelineRun |
| 106 | +cat <<EOF | oc apply -f - |
| 107 | +apiVersion: tekton.dev/v1beta1 |
| 108 | +kind: PipelineRun |
| 109 | +metadata: |
| 110 | + generateName: e2e-manual- |
| 111 | + namespace: e2e-tests |
| 112 | +spec: |
| 113 | + pipelineRef: |
| 114 | + name: e2e-knowledge-tuning-single-pod |
| 115 | + serviceAccountName: e2e-pipeline-sa |
| 116 | + params: |
| 117 | + - name: test-profile |
| 118 | + value: "minimal" |
| 119 | + podTemplate: |
| 120 | + tolerations: |
| 121 | + - key: "nvidia.com/gpu" |
| 122 | + operator: "Exists" |
| 123 | + effect: "NoSchedule" |
| 124 | + nodeSelector: |
| 125 | + nvidia.com/gpu.present: "true" |
| 126 | +EOF |
| 127 | + |
| 128 | +# 3. Watch progress |
| 129 | +tkn pipelinerun logs -f -n e2e-tests |
| 130 | +``` |
| 131 | + |
| 132 | +## Monitoring |
| 133 | + |
| 134 | +### Tekton CLI |
| 135 | + |
| 136 | +```bash |
| 137 | +# List PipelineRuns |
| 138 | +tkn pipelinerun list -n e2e-tests |
| 139 | + |
| 140 | +# Watch logs |
| 141 | +tkn pipelinerun logs <name> -f -n e2e-tests |
| 142 | + |
| 143 | +# Describe run |
| 144 | +tkn pipelinerun describe <name> -n e2e-tests |
| 145 | +``` |
| 146 | + |
| 147 | +### OpenShift Console |
| 148 | + |
| 149 | +1. Navigate to **Pipelines** → **PipelineRuns** |
| 150 | +2. Select namespace: `e2e-tests` |
| 151 | +3. Click on a PipelineRun to see task status |
| 152 | + |
| 153 | +### Tekton Dashboard (if installed) |
| 154 | + |
| 155 | +```bash |
| 156 | +# Port-forward to dashboard |
| 157 | +oc port-forward svc/tekton-dashboard 9097:9097 -n openshift-pipelines |
| 158 | + |
| 159 | +# Open http://localhost:9097 |
| 160 | +``` |
| 161 | + |
| 162 | +## Parameters |
| 163 | + |
| 164 | +| Parameter | Default | Description | |
| 165 | +|-----------|---------|-------------| |
| 166 | +| `git-url` | GitHub repo | Repository to clone | |
| 167 | +| `git-revision` | `main` | Branch or tag | |
| 168 | +| `test-profile` | `minimal` | Test profile | |
| 169 | +| `student-model` | SmolLM2-135M | Model for testing | |
| 170 | +| `teacher-model` | SmolLM2-135M | Teacher model | |
| 171 | +| `skip-steps` | `""` | Steps to skip | |
| 172 | + |
| 173 | +## Retry Configuration |
| 174 | + |
| 175 | +Each task has built-in retry: |
| 176 | + |
| 177 | +```yaml |
| 178 | +tasks: |
| 179 | + - name: step-02-data-processing |
| 180 | + retries: 1 # Retry once on failure |
| 181 | + timeout: "30m" |
| 182 | +``` |
| 183 | +
|
| 184 | +To change retry count, edit `pipeline-e2e.yaml`. |
| 185 | + |
| 186 | +## Troubleshooting |
| 187 | + |
| 188 | +### Task stuck in Pending |
| 189 | + |
| 190 | +```bash |
| 191 | +# Check pod status |
| 192 | +oc get pods -n e2e-tests -l tekton.dev/pipelineRun=<name> |
| 193 | +
|
| 194 | +# Check events |
| 195 | +oc get events -n e2e-tests --sort-by='.lastTimestamp' |
| 196 | +``` |
| 197 | + |
| 198 | +### GPU not allocated |
| 199 | + |
| 200 | +```bash |
| 201 | +# Check GPU node availability |
| 202 | +oc get nodes -l nvidia.com/gpu.present=true |
| 203 | +
|
| 204 | +# Check pod tolerations |
| 205 | +oc describe pod <pod-name> -n e2e-tests |
| 206 | +``` |
| 207 | + |
| 208 | +### View task logs |
| 209 | + |
| 210 | +```bash |
| 211 | +# Get specific task logs |
| 212 | +tkn taskrun logs <taskrun-name> -n e2e-tests |
| 213 | +``` |
| 214 | + |
| 215 | +## Cleanup |
| 216 | + |
| 217 | +```bash |
| 218 | +# Delete old PipelineRuns (keep last 5) |
| 219 | +tkn pipelinerun delete -n e2e-tests --keep 5 |
| 220 | +
|
| 221 | +# Delete all resources |
| 222 | +oc delete -f resources.yaml |
| 223 | +oc delete -f task-notebook-runner.yaml |
| 224 | +oc delete -f pipeline-e2e.yaml |
| 225 | +``` |
0 commit comments