Skip to content

[Bug]: mongob pod is getting OOM killed in large clusters #645

@deesharma24

Description

@deesharma24

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

Currently, mongodb establishes connection with different platform-connectors instances, fault-quarantine, node-drainer, fault-remediation modules. At times, it happens that platform-connector restarts or other modules which is using store_connector and store_Watcher to connect with mongodb restarts and then new connection gets established with it with tearing apart from the previous stale connection. Due to this, in large clusters when these components keeps getting restarted, everytime new connection gets created, it consumes ~1mb more memory and as soon as the upper limit of mongodb pod is exhausted, the mongodb pod gets crashed.

Please refer to below code changes which needs to be done in order to fix this.

https://github.com/NVIDIA/NVSentinel/blob/main/store-client/pkg/storewatcher/watch_store.go#L358
=> this will impact FQ, node-drainer, Fault-remediation module

For platform-connectors also, We need to call disconnect here https://github.com/NVIDIA/NVSentinel/blob/main/platform-connectors/pkg/connectors/store/store_connector.go#L132

// CRITICAL: Disconnect the MongoDB client to release connections back to the pool
    // Without this, the client's connection pool remains open, causing connection accumulation
    // Note: When using ClientManager, this is safe because:
    // 1. If ClientManager is used, this client is shared and Disconnect is a no-op for pooled clients
    // 2. If ClientManager is not used (legacy), this properly closes the standalone client
    if w.client != nil {
        if disconnectErr := w.client.Disconnect(ctx); disconnectErr != nil {
            slog.Warn("Failed to disconnect MongoDB client", "client", w.clientName, "error", disconnectErr)
            // Don't override original error if changeStream.Close() failed
            if err == nil {
                err = disconnectErr
            }
        } else {
            slog.Info("Successfully disconnected MongoDB client", "client", w.clientName)
        }
    }

this code should be added to both places so that connections gets released back to pool

Component

Core Service

Steps to Reproduce

Did not reproduce it in local cluster. Therotically, the steps should be.

  1. Create a k8s cluster.
  2. Continuously delete the platform-connector pods, fq,node-drainer pods and let them come up again and establishing connection with mongodb.
  3. At some point, the mongodb should get OOM because everytime the new pod comes up and establish connection, the previous connection was not getting terminated properly.

Environment

  • NVSentinel version: v0.3.0
  • Kubernetes version:na
  • Deployment method:helm

Logs/Output

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions