refactoring

tsebastiani · tsebastiani · commit e7905949cf38 · 2026-02-02T15:09:09.000+01:00
diff --git a/REQUIREMENTS.md b/REQUIREMENTS.md
@@ -98,13 +98,163 @@ All tests passing ✅
 
 ## Moving from direct scenario creation to CRD approach
 
-- I want to change the current implementation of the KRKN job creation, I want to have a CRD called `KrknChaosJob` that must have 
-  all the details currently defined in the `createScenarioJob` method in internal/api/handlers.go to instantiate a job
-- I'd like that the reconcile loop keeps track of the Job status and updates the `KrknChaosJob` accordinglyall
-- I want to have a controller able to reconcile the `KrknChaosJob` and instantiate the the chaos job as it does the createScenarioJob
-- I want that the current /scenarios/run methods creates the new CR `KrknChaosJob` and returns the job uuid
-- I want that the `GetScenarioRunStatus` is eventually adapted to this new behaviour
+### ✅ Implemented
+- Changed from direct Pod creation to CRD-based approach with `KrknScenarioRun`
+- Controller reconciles `KrknScenarioRun` and creates jobs for each target cluster
+- API endpoints updated to use `scenarioRunName` as primary identifier
+- Multi-cluster support with aggregated status
+
+### API Endpoints Structure (Nested Approach)
+
+```
+POST   /api/v1/scenarios/run
+       → Creates KrknScenarioRun CR
+       → Returns: {scenarioRunName, clusterNames, totalTargets}
+
+GET    /api/v1/scenarios/run/{scenarioRunName}
+       → Returns aggregated status with list of clusterJobs (each with jobId)
+
+GET    /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}
+       → Returns status of a single job (TODO)
+
+GET    /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs
+       → WebSocket stream of logs for specific job (TODO - currently uses clusterName)
+
+DELETE /api/v1/scenarios/run/{scenarioRunName}
+       → Deletes entire scenario run (all jobs)
+
+DELETE /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}
+       → Terminates a single job (TODO)
+```
+
+### 🔧 TODO: Pod Recreation and Retry Logic
+
+#### Current Behavior
+- Controller creates one pod per cluster when KrknScenarioRun is created
+- No automatic retry on pod failure
+- No distinction between user-initiated deletion and failure
+
+#### Requirements
+
+**1. Automatic Retry on Failure**
+- When a pod fails (phase=Failed), the controller should retry creating a new pod
+- Maximum number of retry attempts should be configurable (suggested: 3)
+- Retry attempts should be tracked in ClusterJobStatus
+- Exponential backoff between retries (suggested: 10s, 30s, 60s)
+
+**2. Manual Cancellation vs Failure**
+- User-initiated job deletion (DELETE /jobs/{jobId}) should NOT trigger retry
+- Need to distinguish between "pod failed" and "user cancelled"
+- Proposed solution: Add a field to ClusterJobStatus to track cancellation intent
+
+**3. Job Lifecycle States**
+```
+Pending → Running → Succeeded (terminal)
+                  → Failed → Retrying → Running → ...
+                          → Cancelled (terminal, no retry)
+                          → MaxRetriesExceeded (terminal)
+```
+
+#### Proposed Solution Options
+
+##### Option A: Cancellation Field in Status (Recommended)
+```go
+type ClusterJobStatus struct {
+    ClusterName string
+    JobId       string
+    PodName     string
+    Phase       string  // Pending, Running, Succeeded, Failed, Cancelled, MaxRetriesExceeded
+
+    // NEW FIELDS
+    RetryCount      int       `json:"retryCount,omitempty"`
+    MaxRetries      int       `json:"maxRetries,omitempty"`  // Default: 3
+    CancelRequested bool      `json:"cancelRequested,omitempty"`
+    LastRetryTime   *metav1.Time `json:"lastRetryTime,omitempty"`
+
+    StartTime      *metav1.Time
+    CompletionTime *metav1.Time
+    Message        string
+}
+```
+
+**Controller Logic**:
+```go
+// In updateClusterJobStatuses()
+if pod.Status.Phase == corev1.PodFailed {
+    if job.CancelRequested {
+        job.Phase = "Cancelled"  // Terminal, no retry
+    } else if job.RetryCount < job.MaxRetries {
+        job.Phase = "Retrying"
+        job.RetryCount++
+        job.LastRetryTime = now
+        // Create new pod with new jobId
+        createClusterJob(ctx, scenarioRun, clusterName)
+    } else {
+        job.Phase = "MaxRetriesExceeded"  // Terminal
+    }
+}
+```
+
+**DELETE /jobs/{jobId} Handler**:
+```go
+func (h *Handler) DeleteJob(w http.ResponseWriter, r *http.Request) {
+    // 1. Find KrknScenarioRun containing this jobId
+    // 2. Set job.CancelRequested = true in status
+    // 3. Delete the pod
+    // 4. Controller sees CancelRequested → does NOT retry
+}
+```
+
+##### Option B: Finalizers for Cancellation Tracking
+- Add finalizer `krkn.krkn-chaos.dev/job-cancellation` to ClusterJobStatus
+- When user deletes job, add finalizer before deleting pod
+- Controller checks for finalizer → skips retry
+- More complex, but leverages K8s patterns
+
+##### Option C: Separate CancellationRequest CR
+- Create a new CRD `KrknJobCancellation` to track cancellation intent
+- Controller watches both KrknScenarioRun and KrknJobCancellation
+- More decoupled, but adds complexity
+
+#### Implementation Plan (TODO)
+
+1. **Phase 1: Add Retry Fields to CRD**
+   - Update `ClusterJobStatus` with retry tracking fields
+   - Add `maxRetries` to `KrknScenarioRunSpec` (default: 3)
+   - Regenerate manifests
+
+2. **Phase 2: Implement Retry Logic in Controller**
+   - Detect pod failure vs cancellation
+   - Implement retry with exponential backoff
+   - Update job phase to reflect retry state
+   - Create new pod with new jobId on retry
+
+3. **Phase 3: Add DELETE /jobs/{jobId} Endpoint**
+   - Parse scenarioRunName and jobId from path
+   - Set CancelRequested flag in CR status
+   - Delete pod
+   - Return success
+
+4. **Phase 4: Update GET /jobs/{jobId} and Logs Endpoints**
+   - Change from `/logs/{clusterName}` to `/logs/{jobId}`
+   - Support nested path: `/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs`
+   - Add endpoint: `GET /scenarios/run/{scenarioRunName}/jobs/{jobId}` for single job status
+
+#### Configuration
+
+Add to KrknScenarioRunSpec:
+```yaml
+apiVersion: krkn.krkn-chaos.dev/v1alpha1
+kind: KrknScenarioRun
+spec:
+  # ... existing fields ...
+
+  # Retry configuration
+  maxRetries: 3  # Default: 3, set to 0 to disable retry
+  retryBackoff: exponential  # exponential or fixed
+  retryDelay: 10s  # Initial delay for exponential, fixed delay for fixed
+```
 
 ### Overview
-TODO
+The CRD-based approach provides better state management, automatic reconciliation, and improved observability compared to direct Pod creation.
 
diff --git a/config/rbac/role.yaml b/config/rbac/role.yaml
@@ -44,3 +44,30 @@ rules:
   - get
   - patch
   - update
+- apiGroups:
+  - krkn.krkn-chaos.dev
+  resources:
+  - krknoperatortargets
+  verbs:
+  - get
+  - list
+  - watch
+- apiGroups:
+  - krkn.krkn-chaos.dev
+  resources:
+  - krkntargetrequests
+  verbs:
+  - create
+  - get
+  - list
+  - patch
+  - update
+  - watch
+- apiGroups:
+  - krkn.krkn-chaos.dev
+  resources:
+  - krkntargetrequests/status
+  verbs:
+  - get
+  - patch
+  - update
diff --git a/config/rbac/role_binding.yaml b/config/rbac/role_binding.yaml
@@ -1,14 +1,13 @@
 apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
+kind: ClusterRoleBinding
 metadata:
   labels:
     app.kubernetes.io/name: krkn-operator
     app.kubernetes.io/managed-by: kustomize
   name: manager-rolebinding
-  namespace: system
 roleRef:
   apiGroup: rbac.authorization.k8s.io
-  kind: Role
+  kind: ClusterRole
   name: manager-role
 subjects:
 - kind: ServiceAccount
diff --git a/internal/api/handlers.go b/internal/api/handlers.go
@@ -1048,7 +1048,7 @@ var upgrader = websocket.Upgrader{
 	},
 }
 
-// GetScenarioRunLogs handles GET /api/v1/scenarios/run/{scenarioRunName}/logs/{clusterName} endpoint
+// GetScenarioRunLogs handles GET /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs endpoint
 // It streams the stdout/stderr logs of a running or completed job via WebSocket
 func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 	logger := log.Log.WithName("websocket-logs")
@@ -1064,8 +1064,8 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 	}
 	defer conn.Close()
 
-	// Extract scenarioRunName and clusterName from path
-	// Path format: /api/v1/scenarios/run/{scenarioRunName}/logs/{clusterName}
+	// Extract scenarioRunName and jobId from path
+	// Path format: /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs
 	path := r.URL.Path
 	prefix := "/api/v1/scenarios/run/"
 
@@ -1078,68 +1078,54 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 	// Remove prefix
 	remainder := path[len(prefix):]
 
-	// Split by "/logs/"
-	parts := strings.Split(remainder, "/logs/")
+	// Split by "/jobs/" and "/logs"
+	parts := strings.Split(remainder, "/jobs/")
 	if len(parts) != 2 {
 		logger.Error(nil, "Invalid logs endpoint path format", "path", path)
-		conn.WriteMessage(websocket.TextMessage, []byte("ERROR: Invalid path format. Expected: /api/v1/scenarios/run/{scenarioRunName}/logs/{clusterName}"))
+		conn.WriteMessage(websocket.TextMessage, []byte("ERROR: Invalid path format. Expected: /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs"))
 		return
 	}
 
 	scenarioRunName := parts[0]
-	clusterName := parts[1]
+	jobIdAndLogs := parts[1]
 
-	if scenarioRunName == "" || clusterName == "" {
-		logger.Error(nil, "Empty scenarioRunName or clusterName in request path", "path", path)
-		conn.WriteMessage(websocket.TextMessage, []byte("ERROR: scenarioRunName and clusterName cannot be empty"))
+	// Extract jobId (remove "/logs" suffix)
+	if !strings.HasSuffix(jobIdAndLogs, "/logs") {
+		logger.Error(nil, "Invalid logs endpoint path format", "path", path)
+		conn.WriteMessage(websocket.TextMessage, []byte("ERROR: Invalid path format. Expected: /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs"))
 		return
 	}
 
-	logger.Info("WebSocket connection established", "scenarioRunName", scenarioRunName, "clusterName", clusterName, "client_ip", r.RemoteAddr)
-
-	ctx := context.Background()
+	jobId := strings.TrimSuffix(jobIdAndLogs, "/logs")
 
-	// Fetch the KrknScenarioRun CR
-	var scenarioRun krknv1alpha1.KrknScenarioRun
-	if err := h.client.Get(ctx, client.ObjectKey{
-		Name:      scenarioRunName,
-		Namespace: h.namespace,
-	}, &scenarioRun); err != nil {
-		logger.Error(err, "Failed to fetch scenario run", "scenarioRunName", scenarioRunName)
-		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Scenario run '%s' not found", scenarioRunName)))
+	if scenarioRunName == "" || jobId == "" {
+		logger.Error(nil, "Empty scenarioRunName or jobId in request path", "path", path)
+		conn.WriteMessage(websocket.TextMessage, []byte("ERROR: scenarioRunName and jobId cannot be empty"))
 		return
 	}
 
-	// Find the job for the requested cluster
-	var jobId, podName string
-	for _, job := range scenarioRun.Status.ClusterJobs {
-		if job.ClusterName == clusterName {
-			jobId = job.JobId
-			podName = job.PodName
-			break
-		}
-	}
+	logger.Info("WebSocket connection established", "scenarioRunName", scenarioRunName, "jobId", jobId, "client_ip", r.RemoteAddr)
 
-	if jobId == "" {
-		logger.Error(nil, "Cluster not found in scenario run",
-			"scenarioRunName", scenarioRunName,
-			"clusterName", clusterName)
-		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Cluster '%s' not found in scenario run '%s'", clusterName, scenarioRunName)))
+	ctx := context.Background()
+
+	// Find pod by jobId label (no need to fetch the CR)
+	var podList corev1.PodList
+	if err := h.client.List(ctx, &podList, client.InNamespace(h.namespace), client.MatchingLabels{
+		"krkn-job-id": jobId,
+	}); err != nil {
+		logger.Error(err, "Failed to list pods", "jobId", jobId)
+		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Failed to list pods: %s", err.Error())))
 		return
 	}
 
-	// Fetch the pod
-	var pod corev1.Pod
-	if err := h.client.Get(ctx, client.ObjectKey{
-		Name:      podName,
-		Namespace: h.namespace,
-	}, &pod); err != nil {
-		logger.Error(err, "Failed to fetch pod", "podName", podName)
-		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Pod '%s' not found", podName)))
+	if len(podList.Items) == 0 {
+		logger.Error(nil, "Job not found", "jobId", jobId)
+		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Job with ID '%s' not found", jobId)))
 		return
 	}
 
-	logger.Info("Found pod for cluster", "scenarioRunName", scenarioRunName, "clusterName", clusterName, "podName", pod.Name, "podPhase", pod.Status.Phase)
+	pod := podList.Items[0]
+	logger.Info("Found pod for job", "scenarioRunName", scenarioRunName, "jobId", jobId, "podName", pod.Name, "podPhase", pod.Status.Phase)
 
 	// Parse query parameters
 	follow := r.URL.Query().Get("follow") == "true"
@@ -1163,7 +1149,7 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 
 	logger.Info("Opening log stream",
 		"scenarioRunName", scenarioRunName,
-		"clusterName", clusterName,
+		"jobId", jobId,
 		"podName", pod.Name,
 		"follow", follow,
 		"timestamps", timestamps)
@@ -1174,15 +1160,15 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 	if err != nil {
 		logger.Error(err, "Failed to open log stream",
 			"scenarioRunName", scenarioRunName,
-			"clusterName", clusterName,
+			"jobId", jobId,
 			"podName", pod.Name,
 			"namespace", h.namespace)
 		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Failed to open log stream: %s", err.Error())))
 		return
 	}
 	defer stream.Close()
 
-	logger.Info("Streaming logs started", "scenarioRunName", scenarioRunName, "clusterName", clusterName, "podName", pod.Name)
+	logger.Info("Streaming logs started", "scenarioRunName", scenarioRunName, "jobId", jobId, "podName", pod.Name)
 
 	// Read logs line by line and send via WebSocket
 	scanner := bufio.NewScanner(stream)
@@ -1193,7 +1179,7 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 		if err != nil {
 			logger.Error(err, "Failed to write log line to WebSocket, client likely disconnected",
 				"scenarioRunName", scenarioRunName,
-				"clusterName", clusterName,
+				"jobId", jobId,
 				"podName", pod.Name,
 				"linesStreamed", lineCount)
 			return
@@ -1205,7 +1191,7 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 	if err := scanner.Err(); err != nil {
 		logger.Error(err, "Log stream scanner error",
 			"scenarioRunName", scenarioRunName,
-			"clusterName", clusterName,
+			"jobId", jobId,
 			"podName", pod.Name,
 			"linesStreamed", lineCount)
 		conn.WriteMessage(websocket.TextMessage, []byte(fmt.Sprintf("ERROR: Log stream error: %s", err.Error())))
@@ -1214,7 +1200,7 @@ func (h *Handler) GetScenarioRunLogs(w http.ResponseWriter, r *http.Request) {
 
 	logger.Info("Log streaming completed",
 		"scenarioRunName", scenarioRunName,
-		"clusterName", clusterName,
+		"jobId", jobId,
 		"podName", pod.Name,
 		"totalLines", lineCount)
 
@@ -1383,8 +1369,8 @@ func (h *Handler) ScenariosRunRouter(w http.ResponseWriter, r *http.Request) {
 	}
 
 	if strings.HasPrefix(path, "/api/v1/scenarios/run/") {
-		// Check for logs endpoint: /api/v1/scenarios/run/{scenarioRunName}/logs/{clusterName}
-		if strings.Contains(path, "/logs/") && r.Method == http.MethodGet {
+		// Check for logs endpoint: /api/v1/scenarios/run/{scenarioRunName}/jobs/{jobId}/logs
+		if strings.Contains(path, "/jobs/") && strings.HasSuffix(path, "/logs") && r.Method == http.MethodGet {
 			h.GetScenarioRunLogs(w, r)
 		} else if r.Method == http.MethodGet {
 			h.GetScenarioRunStatus(w, r)