Skip to content

[BUG] Rescheduling of opensearch-cluster-bootstrap can lead to cluster going down #1071

@cbn-targit

Description

@cbn-targit

What is the bug?

Whenever my opensearch-cluster-bootstrap-0 pod is rescheduled on another node, for any reason, it goes into a state where it just keeps spamming

[ERROR][o.o.s.a.BackendRegistry ] [opensearch-cluster-bootstrap-0] OpenSearch Security not initialized. (you may need to run securityadmin)

If the bootstrapper is in this state, and one of my opensearch nodes need to restart for any reason, they enter the same state, and my opensearch cluster goes down.

If i then manually run opensearch-cluster-securityconfig-update job, everything fixes itself

How can one reproduce the bug?

  • Use my OpenSearchCluster CRD
  • Wait for OpenSearch cluster to be ready
  • Manually kill opensearch-cluster-bootstrap-0
  • Kill opensearch-clusters-master-0
  • See that it's broken
  • Run opensearch-cluster-securityconfig-update
  • See that it is fixed again
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch-cluster
  namespace: opensearch
spec:
  security:
    config:
    tls:
       http:
         generate: true 
       transport:
         generate: true
         perNode: true
  general:
    httpPort: 9200
    serviceName: opensearch-cluster
    version: 3.1.0
    drainDataNodes: true
    setVMMaxMapCount: true
  dashboards:
    tls:
      enable: true
      generate: true
    version: 3.1.0
    enable: true
    replicas: 1
    resources:
      requests:
         memory: "512Mi"
         cpu: "200m"
      limits:
         memory: "512Mi"
         cpu: "200m"
  nodePools:
    - component: masters
      replicas: 3
      resources:
         requests:
            memory: "2Gi"
            cpu: "500m"
         limits:
            memory: "2Gi"
            cpu: "500m"
      roles:
        - "data"
        - "cluster_manager"
      persistence:
         emptyDir: {}

What is the expected behavior?

For the cluster to be able to survive various kubernetes rescheduling operations

What is your host/environment?

AKS - Kubernetes v1.31.8
OpenSearch Operator: 2.8.0
OpenSearch Cluster: 3.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions