labring · im0x0ing · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/content/docs/guides/databases/database-troubleshooting-guide.en.mdx b/content/docs/guides/databases/database-troubleshooting-guide.en.mdx
@@ -0,0 +1,171 @@
+---
+title: Database Troubleshooting Guide
+description: A step-by-step guide to troubleshooting KubeBlocks database issues on Sealos
+---
+
+This guide walks you through a systematic approach to diagnosing and resolving database issues managed by KubeBlocks on Sealos.
+
+## 1. Check Cluster Status
+
+KubeBlocks core design: Cluster → Component → InstanceSet → Pod. Cluster is the top-level CRD, and its `status.phase` is the aggregation of all underlying statuses.
+
+### Cluster is Running
+
+This means KubeBlocks considers everything is normal. The issue may be at the application layer (incorrect connection string, permissions, etc.). Check whether you can connect to the database and execute commands (e.g., unable to write due to primary-replica replication).
+
+#### Connect to the Database
+
+**Method 1: Using kbcli**
+
+```bash
+kbcli cluster connect <cluster-name> -n <ns>
+```
+
+**Method 2: Using kubectl**
+
+**Step 1: Retrieve information**
+
+```bash
+# Get Service name
+kubectl get svc -n <ns>
+
+# Get password
+kubectl get secret -n <ns> | grep <cluster-name>
+kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data.password}' | base64 -d
+```
+
+**Step 2: Connect**
+
+Connect directly after entering the Pod:
+
+```bash
+# Enter the Pod
+kubectl exec -it <pod> -n <ns> -- bash
+
+# Use the appropriate connection command for your database
+# MySQL
+mysql -u root -p
+
+# MongoDB
+mongosh -u root -p
+
+# Redis
+redis-cli -a <password>
+
+# PostgreSQL
+psql -U postgres
+```
+
+Or connect via the Sealos Terminal:
+
+![Connect via Sealos Terminal](../images/database-troubleshooting-en.png)
+
+```bash
+# MySQL
+mysql -h <service>.<ns>.svc -P 3306 -u root -p<password>
+
+# MongoDB
+mongosh 'mongodb://root:<password>@<service>.<ns>.svc:27017'
+
+# Redis
+redis-cli -u redis://default:<password>@<service>.<ns>.svc:6379
+
+# PostgreSQL
+psql 'postgresql://postgres:<password>@<service>.<ns>.svc:5432'
+```
+
+### Cluster Status is Not Running
+
+You need to:
+
+- Describe the Cluster (check Events and Status)
+
+```bash
+kubectl describe cluster <cluster-name> -n <ns>
+```
+
+- Proceed to **Step 2** to check Pod status
+
+## 2. Check Pod Status
+
+- A Pod that is **not Running** indicates an infrastructure-layer issue — scheduling, storage, image pulling, resource quotas, etc.
+- A Pod that is **Running** but the service is abnormal means the infrastructure is fine; the issue is at the application layer — database configuration, permissions, primary-replica replication logic, etc.
+
+### Pod is Running
+
+Check the database logs:
+
+```bash
+# Enter the Pod
+kubectl exec -it mysql1-mysql-0 -n <ns> -- bash
+
+# View database logs
+cd /data/mysql/log
+cat mysqld-error.log
+```
+
+Log paths for different databases:
+
+| Database   | Log Path                                              |
+|------------|-------------------------------------------------------|
+| MySQL      | `/data/mysql/log/mysqld-error.log`                    |
+| MongoDB    | `/var/log/mongodb/mongodb.log`                        |
+| PostgreSQL | `/home/postgres/pgdata/pgroot/pg_log/postgresql.log`  |
+| Redis      | `/data/running.log`                                   |
+
+### Pod is Not Running
+
+**1. Use `describe` and `logs` to find out why the Pod cannot start:**
+
+- **describe**: Events recorded by Kubernetes itself. Every resource in K8s has an Events list that records all operations performed by controllers and kubelet on the resource, such as scheduling failures, image pull failures, disk mount failures, etc. Use this to identify the root cause.
+- **logs**: The stdout/stderr output of the container. This includes the container's own errors as well as some database errors. Use this for detailed log information.
+
+```bash
+# View the stdout from the previous container exit. Useful when the container is in a restart loop.
+kubectl logs <pod> -n <ns> --previous
+```
+
+**2. Check the database's own logs**
+
+When a Pod is not Running, the container logs under `/var/log/containers/` on the node are subject to log rotation and cleanup. However, the log files written by the database to the PV are persistent.
+
+**A. Find the PVC and Node corresponding to the Pod:**
+
+```bash
+# Check which node the Pod is on
+kubectl get pod -n <ns> -owide
+
+# Check PVC
+kubectl get pvc -n <ns> -owide
+```
+
+**B. After obtaining the PVC and Node information:**
+
+I. SSH into the node:
+
+```bash
+ssh <node-name>
+```
+
+II. Use the PVC to find the mount directory and check allocated/used/remaining/usage ratio:
+
+```bash
+# List disk/mount usage on the current node.
+df -h | grep <pvc-name>
+```
+
+III. Then view the logs based on the mount path.
+
+## 3. Check KB Controller
+
+When the Cluster status is abnormal but Pod and database logs show no obvious errors, check the KB Controller logs.
+
+Get the Pods in the `kb-system` namespace and then check the controller Pod's logs:
+
+- If the controller Pod itself is **Running** and not crashing, then `describe` will not yield useful information. Use `logs` to check the controller's internal business logic.
+
+```bash
+kubectl logs <pod> -n <ns>
+```
+
+- If the controller Pod itself is in an abnormal state, such as persistent **CrashLoopBackOff**: first use `describe` to identify the type of issue, then use `logs` to view detailed logs.
diff --git a/content/docs/guides/databases/database-troubleshooting-guide.zh-cn.mdx b/content/docs/guides/databases/database-troubleshooting-guide.zh-cn.mdx
@@ -0,0 +1,171 @@
+---
+title: 数据库排障指南
+description: 在 Sealos 上排查 KubeBlocks 数据库问题的分步指南
+---
+
+本指南将带你系统性地诊断和解决 Sealos 上由 KubeBlocks 管理的数据库问题。
+
+## 1. 查看 Cluster 状态
+
+KubeBlocks 核心设计：Cluster → Component → InstanceSet → Pod。Cluster 是最顶层的 CRD，它的 `status.phase` 是所有下层状态的聚合。
+
+### Cluster 状态为 Running
+
+说明 KubeBlocks 认为一切正常，问题可能在应用层（连接串写错、权限等）。检查能否连上数据库，能否执行命令（比如主从原因导致不可写入）。
+
+#### 连接数据库
+
+**方法 1：使用 kbcli**
+
+```bash
+kbcli cluster connect <cluster-name> -n <ns>
+```
+
+**方法 2：使用 kubectl**
+
+**Step 1：获取信息**
+
+```bash
+# 获取 Service 名
+kubectl get svc -n <ns>
+
+# 获取密码
+kubectl get secret -n <ns> | grep <cluster-name>
+kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data.password}' | base64 -d
+```
+
+**Step 2：连接**
+
+进入 Pod 之后直接连接：
+
+```bash
+# 进入 Pod
+kubectl exec -it <pod> -n <ns> -- bash
+
+# 根据不同数据库使用不同连接命令
+# MySQL
+mysql -u root -p
+
+# MongoDB
+mongosh -u root -p
+
+# Redis
+redis-cli -a <password>
+
+# PostgreSQL
+psql -U postgres
+```
+
+或通过 Sealos 终端连接：
+
+![通过 Sealos 终端连接数据库](../images/database-troubleshooting.zh-cn.png)
+
+```bash
+# MySQL
+mysql -h <service>.<ns>.svc -P 3306 -u root -p<password>
+
+# MongoDB
+mongosh 'mongodb://root:<password>@<service>.<ns>.svc:27017'
+
+# Redis
+redis-cli -u redis://default:<password>@<service>.<ns>.svc:6379
+
+# PostgreSQL
+psql 'postgresql://postgres:<password>@<service>.<ns>.svc:5432'
+```
+
+### Cluster 状态非 Running
+
+需要：
+
+- Describe Cluster（查看 Events 和 Status）
+
+```bash
+kubectl describe cluster <cluster-name> -n <ns>
+```
+
+- 到**第 2 步**查看 Pod 状态
+
+## 2. 查看 Pod 状态
+
+- Pod **不 Running** 意味着问题在基础设施层——调度、存储、镜像、资源配额等。
+- Pod **Running** 但业务不正常，说明基础设施没问题，问题在应用层——数据库自身的配置、权限、主从复制逻辑等。
+
+### Pod 状态为 Running
+
+查看数据库日志：
+
+```bash
+# 进入 Pod
+kubectl exec -it mysql1-mysql-0 -n <ns> -- bash
+
+# 查看数据库日志
+cd /data/mysql/log
+cat mysqld-error.log
+```
+
+不同数据库的日志路径：
+
+| 数据库     | 日志路径                                                |
+|------------|--------------------------------------------------------|
+| MySQL      | `/data/mysql/log/mysqld-error.log`                     |
+| MongoDB    | `/var/log/mongodb/mongodb.log`                         |
+| PostgreSQL | `/home/postgres/pgdata/pgroot/pg_log/postgresql.log`   |
+| Redis      | `/data/running.log`                                    |
+
+### Pod 状态非 Running
+
+**1. 通过 `describe` 和 `logs` 查看 Pod 为什么起不来：**
+
+- **describe**：K8s 自己记录的事件。K8s 里每个资源都有 Events 列表，记录了控制器和 kubelet 对这个资源做的所有操作，比如调度失败、镜像拉取失败、磁盘挂不上等。作为定性原因。
+- **logs**：容器的 stdout/stderr 输出，包括容器自己的报错以及数据库的一部分报错。作为详细日志。
+
+```bash
+# 查看上一次该容器退出前的标准输出，适用于容器频繁重启的情况
+kubectl logs <pod> -n <ns> --previous
+```
+
+**2. 查看数据库自己的日志**
+
+Pod 非 Running 的情况下，节点上 `/var/log/containers/` 下的容器日志会被轮转清理。但是数据库写到 PV 里的日志文件是持久化的。
+
+**A. 查看 Pod 对应的 PVC 和 Node：**
+
+```bash
+# 查看 Pod 在哪个节点上
+kubectl get pod -n <ns> -owide
+
+# 查看 PVC
+kubectl get pvc -n <ns> -owide
+```
+
+**B. 得到 PVC 和 Node 信息之后：**
+
+I. SSH 连接上 Node：
+
+```bash
+ssh <node-name>
+```
+
+II. 通过 PVC 可以找到挂载目录以及分配/占用/剩余/占用比：
+
+```bash
+# 列出当前节点所有磁盘/挂载点的使用情况
+df -h | grep <pvc-name>
+```
+
+III. 再根据挂载路径查看日志。
+
+## 3. 查看 KB Controller
+
+当 Cluster 状态异常、但 Pod 和数据库日志均无明显错误时，查看 KB Controller 日志。
+
+获取 `kb-system` 命名空间下的 Pod，然后查看 Controller Pod 的日志：
+
+- 如果 Controller Pod 本身 **Running** 没有崩溃，那么 `describe` 找不到有用信息。使用 `logs` 查看 Controller 内部业务逻辑。
+
+```bash
+kubectl logs <pod> -n <ns>
+```
+
+- 如果 Controller Pod 本身状态不正常，比如持续 **CrashLoopBackOff**：先 `describe` 看是什么类型的问题，再 `logs` 看具体日志。
diff --git a/content/docs/guides/databases/meta.en.json b/content/docs/guides/databases/meta.en.json
@@ -1,4 +1,4 @@
 {
     "title": "Databases",
-    "pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus"]
+    "pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus", "database-troubleshooting-guide"]
 }
diff --git a/content/docs/guides/databases/meta.zh-cn.json b/content/docs/guides/databases/meta.zh-cn.json
@@ -1,4 +1,4 @@
 {
     "title": "数据库",
-    "pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus"]
+    "pages": ["postgresql", "mysql", "redis", "mongodb", "kafka", "milvus", "database-troubleshooting-guide"]
 }
diff --git a/content/docs/guides/images/database-troubleshooting-en.png b/content/docs/guides/images/database-troubleshooting-en.png
diff --git a/content/docs/guides/images/database-troubleshooting.zh-cn.png b/content/docs/guides/images/database-troubleshooting.zh-cn.png