-
Notifications
You must be signed in to change notification settings - Fork 531
Description
Bug Report
What version of Kubernetes are you using?
1.28.1
What version of TiDB Operator are you using?
1.6.0
What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?
standard
What's the status of the TiDB cluster pods?
What did you do?
We tried to change the TLS configuration of the TiDB cluster, but found that the operator fails to contact the TiDB cluster after we change the tlsCluster.enabled option. This bug happens both when we change it from true to false and false to true.
The root cause of this bug is due to the mismatch of operator's pd client's TLS configuration and the server's config.
To reproduce this bug:
- First, create the certificate secrets according to the documentation: https://docs.pingcap.com/tidb-in-kubernetes/stable/enable-tls-between-components#using-cert-manager
- Install TiDB cluster by applying the CR with
tlsCluster.enabledtotrue
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: basic
spec:
version: v8.1.0
timezone: UTC
pvReclaimPolicy: Retain
enableDynamicConfiguration: true
configUpdateStrategy: RollingUpdate
discovery: {}
helper:
image: alpine:3.16.0
tlsCluster:
enabled: true
pd:
baseImage: pingcap/pd
maxFailoverCount: 0
replicas: 1
requests:
storage: "1Gi"
config: {}
tikv:
baseImage: pingcap/tikv
maxFailoverCount: 0
# If only 1 TiKV is deployed, the TiKV region leader
# cannot be transferred during upgrade, so we have
# to configure a short timeout
evictLeaderTimeout: 1m
replicas: 1
requests:
storage: "1Gi"
config:
storage:
# In basic examples, we set this to avoid using too much storage.
reserve-space: "0MB"
engine: "partitioned-raft-kv"
rocksdb:
# In basic examples, we set this to avoid the following error in some Kubernetes clusters:
# "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
max-open-files: 256
raftdb:
max-open-files: 256
tidb:
baseImage: pingcap/tidb
maxFailoverCount: 0
replicas: 1
service:
type: ClusterIP
config: {}
- Change the
tlsCluster.enabledtofalse
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: basic
spec:
version: v8.1.0
timezone: UTC
pvReclaimPolicy: Retain
enableDynamicConfiguration: true
configUpdateStrategy: RollingUpdate
discovery: {}
helper:
image: alpine:3.16.0
tlsCluster:
enabled: true
pd:
baseImage: pingcap/pd
maxFailoverCount: 0
replicas: 1
requests:
storage: "1Gi"
config: {}
tikv:
baseImage: pingcap/tikv
maxFailoverCount: 0
# If only 1 TiKV is deployed, the TiKV region leader
# cannot be transferred during upgrade, so we have
# to configure a short timeout
evictLeaderTimeout: 1m
replicas: 1
requests:
storage: "1Gi"
config:
storage:
# In basic examples, we set this to avoid using too much storage.
reserve-space: "0MB"
engine: "partitioned-raft-kv"
rocksdb:
# In basic examples, we set this to avoid the following error in some Kubernetes clusters:
# "the maximum number of open file descriptors is too small, got 1024, expect greater or equal to 82920"
max-open-files: 256
raftdb:
max-open-files: 256
tidb:
baseImage: pingcap/tidb
maxFailoverCount: 0
replicas: 1
service:
type: ClusterIP
config: {}
What did you expect to see?
The operator should be able to reconfigure the TLS config for the TiDB cluster.
What did you see instead?
The operator fails with the error logs:
| E0830 20:39:35.006167 1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF │
│ E0830 20:39:35.020171 1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing │
│ E0830 20:39:41.747446 1 pd_member_manager.go:201] failed to sync TidbCluster: [default/advanced-tidb]'s status, error: Get "https://advanced-tidb-pd.default:2379/pd/api/v1/health": EOF │
│ E0830 20:39:41.760679 1 tidb_cluster_controller.go:144] TidbCluster: default/advanced-tidb, sync failed tidbcluster: [default/advanced-tidb]'s pd status sync failed, can not to be upgraded, requeuing
The root cause is the pd client's TLS configuration. After the tlsCluster configuration gets changed, the operator uses a different client with a different TLS configuration than what the TiDB cluster is currently using. This causes the healthcheck to fail due to the client error, although the cluster is already healthy.