|
| 1 | +# Memgraph Operator - Required Fixes |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +During debugging of KB pod crashes (2025-12-05), we identified several issues in the memgraph-operator that cause high CPU usage on Memgraph MAIN instances and incorrect health reporting. |
| 6 | + |
| 7 | +## Issue Summary |
| 8 | + |
| 9 | +1. **SHOW REPLICAS output format changed in Memgraph v2.21.0** - Parser is incompatible |
| 10 | +2. **Status updates trigger immediate re-reconciliation** - Backoff is ineffective |
| 11 | +3. **Multiple queries per reconciliation** - Overwhelms Memgraph under frequent reconciliation |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Issue 1: SHOW REPLICAS Parser Incompatibility (HIGH PRIORITY) |
| 16 | + |
| 17 | +### Problem |
| 18 | + |
| 19 | +The `parseShowReplicasOutput` function in `internal/memgraph/client.go:334-368` expects the old Memgraph output format: |
| 20 | + |
| 21 | +``` |
| 22 | +| name | host | port | mode | status | |
| 23 | +``` |
| 24 | + |
| 25 | +But **Memgraph v2.21.0** returns a different format: |
| 26 | + |
| 27 | +``` |
| 28 | +| name | socket_address | sync_mode | system_info | data_info | |
| 29 | +``` |
| 30 | + |
| 31 | +The parser reads `parts[5]` (0-indexed) as the status field, but that's now `data_info` which contains `{}` (empty JSON object). This doesn't match "ready" or "replicating", so all replicas are incorrectly marked as **unhealthy**. |
| 32 | + |
| 33 | +### Evidence |
| 34 | + |
| 35 | +``` |
| 36 | +Actual output: |
| 37 | +| "oteldemo1_scout_kb_graph_0" | "oteldemo1-scout-kb-graph-0...10000" | "async" | Null | {} | |
| 38 | +
|
| 39 | +Operator log: |
| 40 | +{"level":"warn","msg":"unhealthy replica detected","replica":"oteldemo1_scout_kb_graph_0","status":"{}"} |
| 41 | +``` |
| 42 | + |
| 43 | +### Impact |
| 44 | + |
| 45 | +- All replicas are reported as unhealthy even when replication is working correctly |
| 46 | +- Triggers frequent reconciliation attempts |
| 47 | +- Emits spurious warning events |
| 48 | + |
| 49 | +### Required Changes |
| 50 | + |
| 51 | +**File:** `internal/memgraph/client.go` |
| 52 | + |
| 53 | +1. Update `parseShowReplicasOutput` to handle the new column format: |
| 54 | + - Column 1: `name` |
| 55 | + - Column 2: `socket_address` (contains host:port combined) |
| 56 | + - Column 3: `sync_mode` |
| 57 | + - Column 4: `system_info` (JSON with system state) |
| 58 | + - Column 5: `data_info` (JSON with replication state) |
| 59 | + |
| 60 | +2. Parse `system_info` and `data_info` JSON fields to determine replica health: |
| 61 | + - `system_info` may contain: `{ts: <timestamp>, behind: <count>, status: "ready"|"syncing"|...}` |
| 62 | + - `data_info` contains per-database replication state |
| 63 | + |
| 64 | +3. Update `ReplicaInfo` struct to include new fields: |
| 65 | + ```go |
| 66 | + type ReplicaInfo struct { |
| 67 | + Name string |
| 68 | + SocketAddress string // Combined host:port |
| 69 | + SyncMode string |
| 70 | + SystemInfo *ReplicaSystemInfo // Parsed JSON |
| 71 | + DataInfo map[string]interface{} // Parsed JSON |
| 72 | + } |
| 73 | + |
| 74 | + type ReplicaSystemInfo struct { |
| 75 | + Timestamp int64 `json:"ts"` |
| 76 | + Behind int64 `json:"behind"` |
| 77 | + Status string `json:"status"` |
| 78 | + } |
| 79 | + ``` |
| 80 | + |
| 81 | +4. Update health check logic in `internal/controller/replication.go:222-233` to use parsed SystemInfo: |
| 82 | + ```go |
| 83 | + // Check SystemInfo.Status if available, otherwise check if DataInfo is populated |
| 84 | + if replica.SystemInfo != nil { |
| 85 | + status = strings.ToLower(replica.SystemInfo.Status) |
| 86 | + } |
| 87 | + ``` |
| 88 | + |
| 89 | +**File:** `internal/memgraph/client_test.go` |
| 90 | + |
| 91 | +5. Update `TestParseShowReplicasOutput` with test cases for the new format |
| 92 | + |
| 93 | +### Testing |
| 94 | + |
| 95 | +- Test with Memgraph v2.21.0 output format |
| 96 | +- Verify backward compatibility with older Memgraph versions if needed |
| 97 | +- Confirm replicas show as healthy when replication is working |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## Issue 2: Status Updates Trigger Immediate Re-reconciliation (MEDIUM PRIORITY) |
| 102 | + |
| 103 | +### Problem |
| 104 | + |
| 105 | +Every reconciliation ends with `r.Status().Update(ctx, cluster)` at `internal/controller/memgraphcluster_controller.go:516`. |
| 106 | + |
| 107 | +In controller-runtime, updating the status of a watched resource triggers a watch event, which schedules an immediate reconciliation. This bypasses the `RequeueAfter` delay we return. |
| 108 | + |
| 109 | +### Evidence |
| 110 | + |
| 111 | +``` |
| 112 | +Expected: Reconciliation every 30 seconds (requeueAfterLong) |
| 113 | +Actual: Reconciliation every ~1-7 seconds |
| 114 | +``` |
| 115 | + |
| 116 | +The backoff fix added at lines 273-279 is ineffective because: |
| 117 | +1. We return `ctrl.Result{RequeueAfter: 30*time.Second}` |
| 118 | +2. But then `updateStatus()` modifies the cluster |
| 119 | +3. The status update triggers a watch event |
| 120 | +4. Controller-runtime schedules reconciliation immediately |
| 121 | + |
| 122 | +### Impact |
| 123 | + |
| 124 | +- 4-8x more reconciliations than intended |
| 125 | +- Each reconciliation makes multiple Memgraph queries |
| 126 | +- Overwhelms MAIN instance CPU |
| 127 | + |
| 128 | +### Required Changes |
| 129 | + |
| 130 | +**File:** `internal/controller/memgraphcluster_controller.go` |
| 131 | + |
| 132 | +Option A: **Skip status update when nothing changed** |
| 133 | +```go |
| 134 | +func (r *MemgraphClusterReconciler) updateStatus(...) error { |
| 135 | + // Compare old vs new status |
| 136 | + if reflect.DeepEqual(oldStatus, newStatus) { |
| 137 | + return nil // Skip update if nothing changed |
| 138 | + } |
| 139 | + return r.Status().Update(ctx, cluster) |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +Option B: **Use rate limiting / debouncing** |
| 144 | +- Track last status update time |
| 145 | +- Only update status if >N seconds since last update |
| 146 | +- Or use controller-runtime's built-in rate limiting |
| 147 | + |
| 148 | +Option C: **Separate status update frequency from reconciliation frequency** |
| 149 | +- Only update status on significant changes |
| 150 | +- Use annotations or a separate mechanism for tracking reconciliation state |
| 151 | + |
| 152 | +### Recommended Approach |
| 153 | + |
| 154 | +Option A is simplest. Implement deep comparison of status before updating. Most reconciliations don't change status, so this eliminates most spurious updates. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Issue 3: Multiple Memgraph Queries Per Reconciliation (MEDIUM PRIORITY) |
| 159 | + |
| 160 | +### Problem |
| 161 | + |
| 162 | +Each reconciliation makes 4+ queries to the MAIN instance: |
| 163 | + |
| 164 | +1. `ensureMainRole()` → `GetReplicationRole()` (line 135 in replication.go) |
| 165 | +2. `ConfigureReplication()` → `ShowReplicas()` (line 63) |
| 166 | +3. For each replica: `ensureReplicaRole()` → `GetReplicationRole()` (line 160) |
| 167 | +4. `CheckReplicationHealth()` → `ShowReplicas()` (line 212) |
| 168 | +5. `collectStorageMetrics()` → `GetStorageInfo()` for each pod |
| 169 | + |
| 170 | +With 2 replicas and frequent reconciliation, this is ~6 queries every few seconds. |
| 171 | + |
| 172 | +### Impact |
| 173 | + |
| 174 | +- High query load on MAIN instance |
| 175 | +- CPU saturation when combined with actual workload |
| 176 | +- Slow response times for application queries |
| 177 | + |
| 178 | +### Required Changes |
| 179 | + |
| 180 | +**File:** `internal/controller/memgraphcluster_controller.go` |
| 181 | + |
| 182 | +1. **Cache query results within a reconciliation:** |
| 183 | + ```go |
| 184 | + type reconcileContext struct { |
| 185 | + replicationRole map[string]string // podName -> role |
| 186 | + replicas []ReplicaInfo |
| 187 | + storageInfo map[string]*StorageInfo |
| 188 | + } |
| 189 | + ``` |
| 190 | + |
| 191 | +2. **Reduce redundant queries:** |
| 192 | + - `ShowReplicas()` is called twice - once in `ConfigureReplication` and once in `CheckReplicationHealth`. Call once and reuse. |
| 193 | + - `GetReplicationRole()` is called for MAIN and each REPLICA. Could batch or cache. |
| 194 | + |
| 195 | +3. **Make storage metrics collection optional or less frequent:** |
| 196 | + - Only collect every N reconciliations |
| 197 | + - Or make it configurable via CRD spec |
| 198 | + |
| 199 | +**File:** `internal/controller/replication.go` |
| 200 | + |
| 201 | +4. Refactor `ConfigureReplication` to return replica info for reuse: |
| 202 | + ```go |
| 203 | + func (rm *ReplicationManager) ConfigureReplication(...) ([]ReplicaInfo, error) { |
| 204 | + replicas, err := rm.client.ShowReplicas(...) |
| 205 | + // ... configure replication ... |
| 206 | + return replicas, nil |
| 207 | + } |
| 208 | + ``` |
| 209 | + |
| 210 | +5. Remove separate `CheckReplicationHealth` call - integrate into `ConfigureReplication` |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## Issue 4: Validation Test Writes to Database (LOW PRIORITY) |
| 215 | + |
| 216 | +### Problem |
| 217 | + |
| 218 | +The `testReplicationLag` function in `internal/controller/validation.go:105-172` performs invasive health checks: |
| 219 | +- Writes a test node to MAIN |
| 220 | +- Reads from each REPLICA |
| 221 | +- Deletes the test node |
| 222 | + |
| 223 | +### Impact |
| 224 | + |
| 225 | +- Adds write load to database during health checks |
| 226 | +- Could interfere with actual workload |
| 227 | +- Leaves orphan data if cleanup fails |
| 228 | + |
| 229 | +### Required Changes |
| 230 | + |
| 231 | +**File:** `internal/controller/validation.go` |
| 232 | + |
| 233 | +1. Consider using read-only health checks: |
| 234 | + - Query node/edge counts |
| 235 | + - Compare counts between MAIN and REPLICAs |
| 236 | + - Use Memgraph's built-in replication lag metrics if available |
| 237 | + |
| 238 | +2. Make validation configurable: |
| 239 | + ```yaml |
| 240 | + spec: |
| 241 | + validation: |
| 242 | + enabled: true |
| 243 | + mode: "readonly" # or "write-test" |
| 244 | + interval: 60s |
| 245 | + ``` |
| 246 | +
|
| 247 | +3. Add circuit breaker for validation: |
| 248 | + - Skip validation if previous N attempts failed |
| 249 | + - Reduce frequency when cluster is under load |
| 250 | +
|
| 251 | +--- |
| 252 | +
|
| 253 | +## Implementation Order |
| 254 | +
|
| 255 | +1. **Issue 1** (SHOW REPLICAS parser) - Must fix first, causes all other issues to be worse |
| 256 | +2. **Issue 2** (Status update triggering reconciliation) - Significant impact reduction |
| 257 | +3. **Issue 3** (Query reduction) - Optimization |
| 258 | +4. **Issue 4** (Validation) - Nice to have |
| 259 | +
|
| 260 | +## Estimated Effort |
| 261 | +
|
| 262 | +| Issue | Effort | Files Changed | |
| 263 | +|-------|--------|---------------| |
| 264 | +| Issue 1 | 2-3 hours | client.go, client_test.go, replication.go | |
| 265 | +| Issue 2 | 1-2 hours | memgraphcluster_controller.go | |
| 266 | +| Issue 3 | 2-3 hours | memgraphcluster_controller.go, replication.go | |
| 267 | +| Issue 4 | 1-2 hours | validation.go | |
| 268 | +
|
| 269 | +**Total: ~6-10 hours** |
| 270 | +
|
| 271 | +## Version Compatibility |
| 272 | +
|
| 273 | +After fixing Issue 1, consider: |
| 274 | +- Testing with Memgraph v2.18, v2.19, v2.20, v2.21 |
| 275 | +- The SHOW REPLICAS format may have changed between versions |
| 276 | +- May need version detection and format-specific parsing |
0 commit comments