Skip to content

Commit 551b142

Browse files
committed
adding tests
1 parent cd176e3 commit 551b142

16 files changed

+3980
-0
lines changed

FIXES-REQUIRED.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Memgraph Operator - Required Fixes
2+
3+
## Background
4+
5+
During debugging of KB pod crashes (2025-12-05), we identified several issues in the memgraph-operator that cause high CPU usage on Memgraph MAIN instances and incorrect health reporting.
6+
7+
## Issue Summary
8+
9+
1. **SHOW REPLICAS output format changed in Memgraph v2.21.0** - Parser is incompatible
10+
2. **Status updates trigger immediate re-reconciliation** - Backoff is ineffective
11+
3. **Multiple queries per reconciliation** - Overwhelms Memgraph under frequent reconciliation
12+
13+
---
14+
15+
## Issue 1: SHOW REPLICAS Parser Incompatibility (HIGH PRIORITY)
16+
17+
### Problem
18+
19+
The `parseShowReplicasOutput` function in `internal/memgraph/client.go:334-368` expects the old Memgraph output format:
20+
21+
```
22+
| name | host | port | mode | status |
23+
```
24+
25+
But **Memgraph v2.21.0** returns a different format:
26+
27+
```
28+
| name | socket_address | sync_mode | system_info | data_info |
29+
```
30+
31+
The parser reads `parts[5]` (0-indexed) as the status field, but that's now `data_info` which contains `{}` (empty JSON object). This doesn't match "ready" or "replicating", so all replicas are incorrectly marked as **unhealthy**.
32+
33+
### Evidence
34+
35+
```
36+
Actual output:
37+
| "oteldemo1_scout_kb_graph_0" | "oteldemo1-scout-kb-graph-0...10000" | "async" | Null | {} |
38+
39+
Operator log:
40+
{"level":"warn","msg":"unhealthy replica detected","replica":"oteldemo1_scout_kb_graph_0","status":"{}"}
41+
```
42+
43+
### Impact
44+
45+
- All replicas are reported as unhealthy even when replication is working correctly
46+
- Triggers frequent reconciliation attempts
47+
- Emits spurious warning events
48+
49+
### Required Changes
50+
51+
**File:** `internal/memgraph/client.go`
52+
53+
1. Update `parseShowReplicasOutput` to handle the new column format:
54+
- Column 1: `name`
55+
- Column 2: `socket_address` (contains host:port combined)
56+
- Column 3: `sync_mode`
57+
- Column 4: `system_info` (JSON with system state)
58+
- Column 5: `data_info` (JSON with replication state)
59+
60+
2. Parse `system_info` and `data_info` JSON fields to determine replica health:
61+
- `system_info` may contain: `{ts: <timestamp>, behind: <count>, status: "ready"|"syncing"|...}`
62+
- `data_info` contains per-database replication state
63+
64+
3. Update `ReplicaInfo` struct to include new fields:
65+
```go
66+
type ReplicaInfo struct {
67+
Name string
68+
SocketAddress string // Combined host:port
69+
SyncMode string
70+
SystemInfo *ReplicaSystemInfo // Parsed JSON
71+
DataInfo map[string]interface{} // Parsed JSON
72+
}
73+
74+
type ReplicaSystemInfo struct {
75+
Timestamp int64 `json:"ts"`
76+
Behind int64 `json:"behind"`
77+
Status string `json:"status"`
78+
}
79+
```
80+
81+
4. Update health check logic in `internal/controller/replication.go:222-233` to use parsed SystemInfo:
82+
```go
83+
// Check SystemInfo.Status if available, otherwise check if DataInfo is populated
84+
if replica.SystemInfo != nil {
85+
status = strings.ToLower(replica.SystemInfo.Status)
86+
}
87+
```
88+
89+
**File:** `internal/memgraph/client_test.go`
90+
91+
5. Update `TestParseShowReplicasOutput` with test cases for the new format
92+
93+
### Testing
94+
95+
- Test with Memgraph v2.21.0 output format
96+
- Verify backward compatibility with older Memgraph versions if needed
97+
- Confirm replicas show as healthy when replication is working
98+
99+
---
100+
101+
## Issue 2: Status Updates Trigger Immediate Re-reconciliation (MEDIUM PRIORITY)
102+
103+
### Problem
104+
105+
Every reconciliation ends with `r.Status().Update(ctx, cluster)` at `internal/controller/memgraphcluster_controller.go:516`.
106+
107+
In controller-runtime, updating the status of a watched resource triggers a watch event, which schedules an immediate reconciliation. This bypasses the `RequeueAfter` delay we return.
108+
109+
### Evidence
110+
111+
```
112+
Expected: Reconciliation every 30 seconds (requeueAfterLong)
113+
Actual: Reconciliation every ~1-7 seconds
114+
```
115+
116+
The backoff fix added at lines 273-279 is ineffective because:
117+
1. We return `ctrl.Result{RequeueAfter: 30*time.Second}`
118+
2. But then `updateStatus()` modifies the cluster
119+
3. The status update triggers a watch event
120+
4. Controller-runtime schedules reconciliation immediately
121+
122+
### Impact
123+
124+
- 4-8x more reconciliations than intended
125+
- Each reconciliation makes multiple Memgraph queries
126+
- Overwhelms MAIN instance CPU
127+
128+
### Required Changes
129+
130+
**File:** `internal/controller/memgraphcluster_controller.go`
131+
132+
Option A: **Skip status update when nothing changed**
133+
```go
134+
func (r *MemgraphClusterReconciler) updateStatus(...) error {
135+
// Compare old vs new status
136+
if reflect.DeepEqual(oldStatus, newStatus) {
137+
return nil // Skip update if nothing changed
138+
}
139+
return r.Status().Update(ctx, cluster)
140+
}
141+
```
142+
143+
Option B: **Use rate limiting / debouncing**
144+
- Track last status update time
145+
- Only update status if >N seconds since last update
146+
- Or use controller-runtime's built-in rate limiting
147+
148+
Option C: **Separate status update frequency from reconciliation frequency**
149+
- Only update status on significant changes
150+
- Use annotations or a separate mechanism for tracking reconciliation state
151+
152+
### Recommended Approach
153+
154+
Option A is simplest. Implement deep comparison of status before updating. Most reconciliations don't change status, so this eliminates most spurious updates.
155+
156+
---
157+
158+
## Issue 3: Multiple Memgraph Queries Per Reconciliation (MEDIUM PRIORITY)
159+
160+
### Problem
161+
162+
Each reconciliation makes 4+ queries to the MAIN instance:
163+
164+
1. `ensureMainRole()``GetReplicationRole()` (line 135 in replication.go)
165+
2. `ConfigureReplication()``ShowReplicas()` (line 63)
166+
3. For each replica: `ensureReplicaRole()``GetReplicationRole()` (line 160)
167+
4. `CheckReplicationHealth()``ShowReplicas()` (line 212)
168+
5. `collectStorageMetrics()``GetStorageInfo()` for each pod
169+
170+
With 2 replicas and frequent reconciliation, this is ~6 queries every few seconds.
171+
172+
### Impact
173+
174+
- High query load on MAIN instance
175+
- CPU saturation when combined with actual workload
176+
- Slow response times for application queries
177+
178+
### Required Changes
179+
180+
**File:** `internal/controller/memgraphcluster_controller.go`
181+
182+
1. **Cache query results within a reconciliation:**
183+
```go
184+
type reconcileContext struct {
185+
replicationRole map[string]string // podName -> role
186+
replicas []ReplicaInfo
187+
storageInfo map[string]*StorageInfo
188+
}
189+
```
190+
191+
2. **Reduce redundant queries:**
192+
- `ShowReplicas()` is called twice - once in `ConfigureReplication` and once in `CheckReplicationHealth`. Call once and reuse.
193+
- `GetReplicationRole()` is called for MAIN and each REPLICA. Could batch or cache.
194+
195+
3. **Make storage metrics collection optional or less frequent:**
196+
- Only collect every N reconciliations
197+
- Or make it configurable via CRD spec
198+
199+
**File:** `internal/controller/replication.go`
200+
201+
4. Refactor `ConfigureReplication` to return replica info for reuse:
202+
```go
203+
func (rm *ReplicationManager) ConfigureReplication(...) ([]ReplicaInfo, error) {
204+
replicas, err := rm.client.ShowReplicas(...)
205+
// ... configure replication ...
206+
return replicas, nil
207+
}
208+
```
209+
210+
5. Remove separate `CheckReplicationHealth` call - integrate into `ConfigureReplication`
211+
212+
---
213+
214+
## Issue 4: Validation Test Writes to Database (LOW PRIORITY)
215+
216+
### Problem
217+
218+
The `testReplicationLag` function in `internal/controller/validation.go:105-172` performs invasive health checks:
219+
- Writes a test node to MAIN
220+
- Reads from each REPLICA
221+
- Deletes the test node
222+
223+
### Impact
224+
225+
- Adds write load to database during health checks
226+
- Could interfere with actual workload
227+
- Leaves orphan data if cleanup fails
228+
229+
### Required Changes
230+
231+
**File:** `internal/controller/validation.go`
232+
233+
1. Consider using read-only health checks:
234+
- Query node/edge counts
235+
- Compare counts between MAIN and REPLICAs
236+
- Use Memgraph's built-in replication lag metrics if available
237+
238+
2. Make validation configurable:
239+
```yaml
240+
spec:
241+
validation:
242+
enabled: true
243+
mode: "readonly" # or "write-test"
244+
interval: 60s
245+
```
246+
247+
3. Add circuit breaker for validation:
248+
- Skip validation if previous N attempts failed
249+
- Reduce frequency when cluster is under load
250+
251+
---
252+
253+
## Implementation Order
254+
255+
1. **Issue 1** (SHOW REPLICAS parser) - Must fix first, causes all other issues to be worse
256+
2. **Issue 2** (Status update triggering reconciliation) - Significant impact reduction
257+
3. **Issue 3** (Query reduction) - Optimization
258+
4. **Issue 4** (Validation) - Nice to have
259+
260+
## Estimated Effort
261+
262+
| Issue | Effort | Files Changed |
263+
|-------|--------|---------------|
264+
| Issue 1 | 2-3 hours | client.go, client_test.go, replication.go |
265+
| Issue 2 | 1-2 hours | memgraphcluster_controller.go |
266+
| Issue 3 | 2-3 hours | memgraphcluster_controller.go, replication.go |
267+
| Issue 4 | 1-2 hours | validation.go |
268+
269+
**Total: ~6-10 hours**
270+
271+
## Version Compatibility
272+
273+
After fixing Issue 1, consider:
274+
- Testing with Memgraph v2.18, v2.19, v2.20, v2.21
275+
- The SHOW REPLICAS format may have changed between versions
276+
- May need version detection and format-specific parsing

0 commit comments

Comments
 (0)