TiDB operator cannot update nor contact the TiDB cluster after the TLS config is changed

## Bug Report

**What version of Kubernetes are you using?**

1.28.1

**What version of TiDB Operator are you using?**

1.6.0

**What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?**

standard

**What's the status of the TiDB cluster pods?**


**What did you do?**

We tried to change the TLS configuration of the TiDB cluster, but found that the operator fails to contact the TiDB cluster after we change the `tlsCluster.enabled` option. This bug happens both when we change it from `true` to `false` and `false` to `true`.
The root cause of this bug is due to the mismatch of operator's pd client's TLS configuration and the server's config. 

To reproduce this bug:
1. First, create the certificate secrets according to the documentation: https://docs.pingcap.com/tidb-in-kubernetes/stable/enable-tls-between-components#using-cert-manager
2. Install TiDB cluster by applying the CR with `tlsCluster.enabled` to `true`
```
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: basic
spec:
  version: v8.1.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  tlsCluster:
    enabled: true
  pd:
    baseImage: pingcap/pd
    maxFailoverCount: 0
    replicas: 1
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    # If only 1 TiKV is deployed, the TiKV region leader
    # cannot be transferred during upgrade, so we have
    # to configure a short timeout
    evictLeaderTimeout: 1m
    replicas: 1
    requests:
      storage: "1Gi"
    config:
      storage:
        # In basic examples, we set this to avoid using too much storage.
        reserve-space: "0MB"
        engine: "partitioned-raft-kv"
      rocksdb:
        # In basic examples, we set this to avoid the following error in some Kubernetes clusters:
        # "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 1
    service:
      type: ClusterIP
    config: {}
```
3. Change the `tlsCluster.enabled` to `false`
```
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: basic
spec:
  version: v8.1.0
  timezone: UTC
  pvReclaimPolicy: Retain
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: alpine:3.16.0
  tlsCluster:
    enabled: true
  pd:
    baseImage: pingcap/pd
    maxFailoverCount: 0
    replicas: 1
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    # If only 1 TiKV is deployed, the TiKV region leader
    # cannot be transferred during upgrade, so we have
    # to configure a short timeout
    evictLeaderTimeout: 1m
    replicas: 1
    requests:
      storage: "1Gi"
    config:
      storage:
        # In basic examples, we set this to avoid using too much storage.
        reserve-space: "0MB"
        engine: "partitioned-raft-kv"
      rocksdb:
        # In basic examples, we set this to avoid the following error in some Kubernetes clusters:
        # "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
        max-open-files: 256
      raftdb:
        max-open-files: 256
  tidb:
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 1
    service:
      type: ClusterIP
    config: {}
```

**What did you expect to see?**
The operator should be able to reconfigure the TLS config for the TiDB cluster.

**What did you see instead?**
The operator fails with the error logs:
```
| E0830 20:39:35.006167       1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF                   │
│ E0830 20:39:35.020171       1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing    │
│ E0830 20:39:41.747446       1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF                   │
│ E0830 20:39:41.760679       1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing
```

The root cause is the pd client's TLS configuration. After the tlsCluster configuration gets changed, the operator uses a different client with a different TLS configuration than what the TiDB cluster is currently using. This causes the healthcheck to fail due to the client error, although the cluster is already healthy.

https://github.com/pingcap/tidb-operator/blob/ff467a6e7a0563b31a3ace2fe5d060774012780d/pkg/pdapi/pd_control.go#L241C1-L252C3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB operator cannot update nor contact the TiDB cluster after the TLS config is changed #5728

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TiDB operator cannot update nor contact the TiDB cluster after the TLS config is changed #5728

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions