Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions content/docs/guides/databases/database-troubleshooting-guide.en.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
title: Database Troubleshooting Guide
description: A step-by-step guide to troubleshooting KubeBlocks database issues on Sealos
---

This guide walks you through a systematic approach to diagnosing and resolving database issues managed by KubeBlocks on Sealos.

## 1. Check Cluster Status

KubeBlocks core design: Cluster → Component → InstanceSet → Pod. Cluster is the top-level CRD, and its `status.phase` is the aggregation of all underlying statuses.

### Cluster is Running

This means KubeBlocks considers everything is normal. The issue may be at the application layer (incorrect connection string, permissions, etc.). Check whether you can connect to the database and execute commands (e.g., unable to write due to primary-replica replication).

#### Connect to the Database

**Method 1: Using kbcli**

```bash
kbcli cluster connect <cluster-name> -n <ns>
```

**Method 2: Using kubectl**

**Step 1: Retrieve information**

```bash
# Get Service name
kubectl get svc -n <ns>

# Get password
kubectl get secret -n <ns> | grep <cluster-name>
kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data.password}' | base64 -d
```

**Step 2: Connect**

Connect directly after entering the Pod:

```bash
# Enter the Pod
kubectl exec -it <pod> -n <ns> -- bash

# Use the appropriate connection command for your database
# MySQL
mysql -u root -p

# MongoDB
mongosh -u root -p

# Redis
redis-cli -a <password>

# PostgreSQL
psql -U postgres
```

Or connect via the Sealos Terminal:

![Connect via Sealos Terminal](../images/database-troubleshooting-en.png)

```bash
# MySQL
mysql -h <service>.<ns>.svc -P 3306 -u root -p<password>

# MongoDB
mongosh 'mongodb://root:<password>@<service>.<ns>.svc:27017'

# Redis
redis-cli -u redis://default:<password>@<service>.<ns>.svc:6379

# PostgreSQL
psql 'postgresql://postgres:<password>@<service>.<ns>.svc:5432'
```

### Cluster Status is Not Running

You need to:

- Describe the Cluster (check Events and Status)

```bash
kubectl describe cluster <cluster-name> -n <ns>
```

- Proceed to **Step 2** to check Pod status

## 2. Check Pod Status

- A Pod that is **not Running** indicates an infrastructure-layer issue — scheduling, storage, image pulling, resource quotas, etc.
- A Pod that is **Running** but the service is abnormal means the infrastructure is fine; the issue is at the application layer — database configuration, permissions, primary-replica replication logic, etc.

### Pod is Running

Check the database logs:

```bash
# Enter the Pod
kubectl exec -it mysql1-mysql-0 -n <ns> -- bash

# View database logs
cd /data/mysql/log
cat mysqld-error.log
```

Log paths for different databases:

| Database | Log Path |
|------------|-------------------------------------------------------|
| MySQL | `/data/mysql/log/mysqld-error.log` |
| MongoDB | `/var/log/mongodb/mongodb.log` |
| PostgreSQL | `/home/postgres/pgdata/pgroot/pg_log/postgresql.log` |
| Redis | `/data/running.log` |

### Pod is Not Running

**1. Use `describe` and `logs` to find out why the Pod cannot start:**

- **describe**: Events recorded by Kubernetes itself. Every resource in K8s has an Events list that records all operations performed by controllers and kubelet on the resource, such as scheduling failures, image pull failures, disk mount failures, etc. Use this to identify the root cause.
- **logs**: The stdout/stderr output of the container. This includes the container's own errors as well as some database errors. Use this for detailed log information.

```bash
# View the stdout from the previous container exit. Useful when the container is in a restart loop.
kubectl logs <pod> -n <ns> --previous
```

**2. Check the database's own logs**

When a Pod is not Running, the container logs under `/var/log/containers/` on the node are subject to log rotation and cleanup. However, the log files written by the database to the PV are persistent.

**A. Find the PVC and Node corresponding to the Pod:**

```bash
# Check which node the Pod is on
kubectl get pod -n <ns> -owide

# Check PVC
kubectl get pvc -n <ns> -owide
```

**B. After obtaining the PVC and Node information:**

I. SSH into the node:

```bash
ssh <node-name>
```

II. Use the PVC to find the mount directory and check allocated/used/remaining/usage ratio:

```bash
# List disk/mount usage on the current node.
df -h | grep <pvc-name>
```

III. Then view the logs based on the mount path.

## 3. Check KB Controller

When the Cluster status is abnormal but Pod and database logs show no obvious errors, check the KB Controller logs.

Get the Pods in the `kb-system` namespace and then check the controller Pod's logs:

- If the controller Pod itself is **Running** and not crashing, then `describe` will not yield useful information. Use `logs` to check the controller's internal business logic.

```bash
kubectl logs <pod> -n <ns>
```

- If the controller Pod itself is in an abnormal state, such as persistent **CrashLoopBackOff**: first use `describe` to identify the type of issue, then use `logs` to view detailed logs.
171 changes: 171 additions & 0 deletions content/docs/guides/databases/database-troubleshooting-guide.zh-cn.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
title: 数据库排障指南
description: 在 Sealos 上排查 KubeBlocks 数据库问题的分步指南
---

本指南将带你系统性地诊断和解决 Sealos 上由 KubeBlocks 管理的数据库问题。

## 1. 查看 Cluster 状态

KubeBlocks 核心设计:Cluster → Component → InstanceSet → Pod。Cluster 是最顶层的 CRD,它的 `status.phase` 是所有下层状态的聚合。

### Cluster 状态为 Running

说明 KubeBlocks 认为一切正常,问题可能在应用层(连接串写错、权限等)。检查能否连上数据库,能否执行命令(比如主从原因导致不可写入)。

#### 连接数据库

**方法 1:使用 kbcli**

```bash
kbcli cluster connect <cluster-name> -n <ns>
```

**方法 2:使用 kubectl**

**Step 1:获取信息**

```bash
# 获取 Service 名
kubectl get svc -n <ns>

# 获取密码
kubectl get secret -n <ns> | grep <cluster-name>
kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data.password}' | base64 -d
```

**Step 2:连接**

进入 Pod 之后直接连接:

```bash
# 进入 Pod
kubectl exec -it <pod> -n <ns> -- bash

# 根据不同数据库使用不同连接命令
# MySQL
mysql -u root -p

# MongoDB
mongosh -u root -p

# Redis
redis-cli -a <password>

# PostgreSQL
psql -U postgres
```

或通过 Sealos 终端连接:

![通过 Sealos 终端连接数据库](../images/database-troubleshooting.zh-cn.png)

```bash
# MySQL
mysql -h <service>.<ns>.svc -P 3306 -u root -p<password>

# MongoDB
mongosh 'mongodb://root:<password>@<service>.<ns>.svc:27017'

# Redis
redis-cli -u redis://default:<password>@<service>.<ns>.svc:6379

# PostgreSQL
psql 'postgresql://postgres:<password>@<service>.<ns>.svc:5432'
```

### Cluster 状态非 Running

需要:

- Describe Cluster(查看 Events 和 Status)

```bash
kubectl describe cluster <cluster-name> -n <ns>
```

- 到**第 2 步**查看 Pod 状态

## 2. 查看 Pod 状态

- Pod **不 Running** 意味着问题在基础设施层——调度、存储、镜像、资源配额等。
- Pod **Running** 但业务不正常,说明基础设施没问题,问题在应用层——数据库自身的配置、权限、主从复制逻辑等。

### Pod 状态为 Running

查看数据库日志:

```bash
# 进入 Pod
kubectl exec -it mysql1-mysql-0 -n <ns> -- bash

# 查看数据库日志
cd /data/mysql/log
cat mysqld-error.log
```

不同数据库的日志路径:

| 数据库 | 日志路径 |
|------------|--------------------------------------------------------|
| MySQL | `/data/mysql/log/mysqld-error.log` |
| MongoDB | `/var/log/mongodb/mongodb.log` |
| PostgreSQL | `/home/postgres/pgdata/pgroot/pg_log/postgresql.log` |
| Redis | `/data/running.log` |

### Pod 状态非 Running

**1. 通过 `describe` 和 `logs` 查看 Pod 为什么起不来:**

- **describe**:K8s 自己记录的事件。K8s 里每个资源都有 Events 列表,记录了控制器和 kubelet 对这个资源做的所有操作,比如调度失败、镜像拉取失败、磁盘挂不上等。作为定性原因。
- **logs**:容器的 stdout/stderr 输出,包括容器自己的报错以及数据库的一部分报错。作为详细日志。

```bash
# 查看上一次该容器退出前的标准输出,适用于容器频繁重启的情况
kubectl logs <pod> -n <ns> --previous
```

**2. 查看数据库自己的日志**

Pod 非 Running 的情况下,节点上 `/var/log/containers/` 下的容器日志会被轮转清理。但是数据库写到 PV 里的日志文件是持久化的。

**A. 查看 Pod 对应的 PVC 和 Node:**

```bash
# 查看 Pod 在哪个节点上
kubectl get pod -n <ns> -owide

# 查看 PVC
kubectl get pvc -n <ns> -owide
```

**B. 得到 PVC 和 Node 信息之后:**

I. SSH 连接上 Node:

```bash
ssh <node-name>
```

II. 通过 PVC 可以找到挂载目录以及分配/占用/剩余/占用比:

```bash
# 列出当前节点所有磁盘/挂载点的使用情况
df -h | grep <pvc-name>
```

III. 再根据挂载路径查看日志。

## 3. 查看 KB Controller

当 Cluster 状态异常、但 Pod 和数据库日志均无明显错误时,查看 KB Controller 日志。

获取 `kb-system` 命名空间下的 Pod,然后查看 Controller Pod 的日志:

- 如果 Controller Pod 本身 **Running** 没有崩溃,那么 `describe` 找不到有用信息。使用 `logs` 查看 Controller 内部业务逻辑。

```bash
kubectl logs <pod> -n <ns>
```

- 如果 Controller Pod 本身状态不正常,比如持续 **CrashLoopBackOff**:先 `describe` 看是什么类型的问题,再 `logs` 看具体日志。
2 changes: 1 addition & 1 deletion content/docs/guides/databases/meta.en.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"title": "Databases",
"pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus"]
"pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus", "database-troubleshooting-guide"]
}
2 changes: 1 addition & 1 deletion content/docs/guides/databases/meta.zh-cn.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"title": "数据库",
"pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus"]
"pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus", "database-troubleshooting-guide"]
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.