-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Bug Description
Currently, mongodb establishes connection with different platform-connectors instances, fault-quarantine, node-drainer, fault-remediation modules. At times, it happens that platform-connector restarts or other modules which is using store_connector and store_Watcher to connect with mongodb restarts and then new connection gets established with it with tearing apart from the previous stale connection. Due to this, in large clusters when these components keeps getting restarted, everytime new connection gets created, it consumes ~1mb more memory and as soon as the upper limit of mongodb pod is exhausted, the mongodb pod gets crashed.
Please refer to below code changes which needs to be done in order to fix this.
https://github.com/NVIDIA/NVSentinel/blob/main/store-client/pkg/storewatcher/watch_store.go#L358
=> this will impact FQ, node-drainer, Fault-remediation module
For platform-connectors also, We need to call disconnect here https://github.com/NVIDIA/NVSentinel/blob/main/platform-connectors/pkg/connectors/store/store_connector.go#L132
// CRITICAL: Disconnect the MongoDB client to release connections back to the pool
// Without this, the client's connection pool remains open, causing connection accumulation
// Note: When using ClientManager, this is safe because:
// 1. If ClientManager is used, this client is shared and Disconnect is a no-op for pooled clients
// 2. If ClientManager is not used (legacy), this properly closes the standalone client
if w.client != nil {
if disconnectErr := w.client.Disconnect(ctx); disconnectErr != nil {
slog.Warn("Failed to disconnect MongoDB client", "client", w.clientName, "error", disconnectErr)
// Don't override original error if changeStream.Close() failed
if err == nil {
err = disconnectErr
}
} else {
slog.Info("Successfully disconnected MongoDB client", "client", w.clientName)
}
}
this code should be added to both places so that connections gets released back to pool
Component
Core Service
Steps to Reproduce
Did not reproduce it in local cluster. Therotically, the steps should be.
- Create a k8s cluster.
- Continuously delete the platform-connector pods, fq,node-drainer pods and let them come up again and establishing connection with mongodb.
- At some point, the mongodb should get OOM because everytime the new pod comes up and establish connection, the previous connection was not getting terminated properly.
Environment
- NVSentinel version: v0.3.0
- Kubernetes version:na
- Deployment method:helm
Logs/Output
No response