Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2836160
feat(repair): add design doc and ignore patterns for data repair tool
hzcheng Mar 3, 2026
4433b1b
feat(repair): implement runtime context and vnode filtering
hzcheng Mar 3, 2026
b28f3ce
feat(repair): add precheck for data directory, disk space, and target…
hzcheng Mar 3, 2026
a2fa5eb
feat(repair): implement backup directory creation and session tracking
hzcheng Mar 3, 2026
055f6af
feat(repair): add progress reporting and summary generation for repai…
hzcheng Mar 3, 2026
758c098
feat(repair): add session resume capability for interrupted repairs
hzcheng Mar 3, 2026
906834f
feat(repair): add force-mode wal scheduling for taosd -r
hzcheng Mar 3, 2026
9604247
refactor(dnode): extract repair workflow from dmMain main
hzcheng Mar 3, 2026
e88de5f
feat(repair): add WAL pre-repair backup and rollback protection
hzcheng Mar 3, 2026
164a906
feat(repair): add TSDB scan, block analysis, and rebuild MVP
hzcheng Mar 3, 2026
07f77cd
feat(repair): add force+tsdb workflow and tsdb repair smoke script
hzcheng Mar 3, 2026
986fe0c
feat(repair): implement force tsdb/meta workflows and add repair smok…
hzcheng Mar 4, 2026
114f83d
feat(repair): close force+meta rebuild loop and wire replica dispatch…
hzcheng Mar 4, 2026
0c3ba94
feat(repair): implement replica vnode degrade flow for mode=replica
hzcheng Mar 4, 2026
d02d21d
feat(repair): complete replica rollback flow and add copy-mode mock t…
hzcheng Mar 4, 2026
69c6df7
feat(repair): add SSH/SCP copy workflow for copy mode
hzcheng Mar 4, 2026
b0d1a47
feat(repair): complete copy-mode hardening with owner fix and rollback
hzcheng Mar 4, 2026
668e53d
feat(ci): add repair fixture generator for wal/tsdb/meta corruption c…
hzcheng Mar 4, 2026
be524fb
feat(ci): add repair mode matrix script for force/replica/copy accept…
hzcheng Mar 4, 2026
2da91c0
docs(maintenance): document taosd -r file-level repair in zh/en guides
hzcheng Mar 4, 2026
1fc2db3
chore(plans): finalize P8 progress and add repair release checklist
hzcheng Mar 4, 2026
8cb9fef
fix(repair): make resume step-aware and harden copy shell inputs
hzcheng Mar 5, 2026
d2ed042
refactor(dnode): unify repair long-option parsing and remove magic le…
hzcheng Mar 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,66 @@ test/screenlog*
test/output.tmp

CMakeUserPresets.json
.agents/skills/brainstorming/SKILL.md
.agents/skills/dispatching-parallel-agents/SKILL.md
.agents/skills/executing-plans/SKILL.md
.agents/skills/finishing-a-development-branch/SKILL.md
.agents/skills/planning-with-files/examples.md
.agents/skills/planning-with-files/reference.md
.agents/skills/planning-with-files/SKILL.md
.agents/skills/planning-with-files/scripts/check-complete.ps1
.agents/skills/planning-with-files/scripts/check-complete.sh
.agents/skills/planning-with-files/scripts/init-session.ps1
.agents/skills/planning-with-files/scripts/init-session.sh
.agents/skills/planning-with-files/scripts/session-catchup.py
.agents/skills/planning-with-files/templates/findings.md
.agents/skills/planning-with-files/templates/progress.md
.agents/skills/planning-with-files/templates/task_plan.md
.agents/skills/receiving-code-review/SKILL.md
.agents/skills/requesting-code-review/code-reviewer.md
.agents/skills/requesting-code-review/SKILL.md
.agents/skills/subagent-driven-development/code-quality-reviewer-prompt.md
.agents/skills/subagent-driven-development/implementer-prompt.md
.agents/skills/subagent-driven-development/SKILL.md
.agents/skills/subagent-driven-development/spec-reviewer-prompt.md
.agents/skills/systematic-debugging/condition-based-waiting-example.ts
.agents/skills/systematic-debugging/condition-based-waiting.md
.agents/skills/systematic-debugging/CREATION-LOG.md
.agents/skills/systematic-debugging/defense-in-depth.md
.agents/skills/systematic-debugging/find-polluter.sh
.agents/skills/systematic-debugging/root-cause-tracing.md
.agents/skills/systematic-debugging/SKILL.md
.agents/skills/systematic-debugging/test-academic.md
.agents/skills/systematic-debugging/test-pressure-1.md
.agents/skills/systematic-debugging/test-pressure-2.md
.agents/skills/systematic-debugging/test-pressure-3.md
.agents/skills/test-driven-development/SKILL.md
.agents/skills/test-driven-development/testing-anti-patterns.md
.agents/skills/using-git-worktrees/SKILL.md
.agents/skills/using-superpowers/SKILL.md
.agents/skills/verification-before-completion/SKILL.md
.agents/skills/writing-plans/SKILL.md
.agents/skills/writing-skills/anthropic-best-practices.md
.agents/skills/writing-skills/graphviz-conventions.dot
.agents/skills/writing-skills/persuasion-principles.md
.agents/skills/writing-skills/render-graphs.js
.agents/skills/writing-skills/SKILL.md
.agents/skills/writing-skills/testing-skills-with-subagents.md
.agents/skills/writing-skills/examples/CLAUDE_MD_TESTING.md
.claude/settings.local.json
.claude/skills/brainstorming
.claude/skills/dispatching-parallel-agents
.claude/skills/executing-plans
.claude/skills/finishing-a-development-branch
.claude/skills/planning-with-files
.claude/skills/receiving-code-review
.claude/skills/requesting-code-review
.claude/skills/subagent-driven-development
.claude/skills/systematic-debugging
.claude/skills/test-driven-development
.claude/skills/using-git-worktrees
.claude/skills/using-superpowers
.claude/skills/verification-before-completion
.claude/skills/writing-plans
.claude/skills/writing-skills
skills-lock.json
21 changes: 21 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# TDengine Session Conventions

## Progress Reporting (Persistent Rule)

Check failure on line 3 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Headings should be surrounded by blank lines

AGENTS.md:3 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Progress Reporting (Persistent Rule)"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md022.md
- Every progress report must include a visual progress bar.

Check failure on line 4 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Lists should be surrounded by blank lines

AGENTS.md:4 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- Every progress report must i..."] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md032.md
- Use the task table in `task_plan.md` as the source of truth.
- Show at least:
- overall percentage
- completed/total tasks
- bar visualization

## Required Progress Bar Format

Check failure on line 11 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Headings should be surrounded by blank lines

AGENTS.md:11 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Required Progress Bar Format"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md022.md
- Use this format in each report:

Check failure on line 12 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Lists should be surrounded by blank lines

AGENTS.md:12 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- Use this format in each repo..."] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md032.md
- `进度: <percent>% [<bar>] <done>/<total>`
- Bar width: 20 characters.
- Filled: `#`
- Empty: `-`

## Calculation Rule

Check failure on line 18 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Headings should be surrounded by blank lines

AGENTS.md:18 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Calculation Rule"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md022.md
- `done`: number of tasks with status `completed`.

Check failure on line 19 in AGENTS.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Lists should be surrounded by blank lines

AGENTS.md:19 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- `done`: number of tasks with..."] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md032.md
- `total`: number of tasks with status in `{completed, in_progress, pending}`.
- `percent = done / total * 100` (keep one decimal place).
55 changes: 55 additions & 0 deletions docs/en/08-operation/04-maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,61 @@ restore qnode on dnode <dnode_id>; # Restore qnode on dnode
- This feature is based on the recovery of existing replication capabilities, not disaster recovery or backup recovery. Therefore, for the mnode and vnode to be recovered, the prerequisite for using this command is that the other two replicas of the mnode or vnode can still function normally.
- This command cannot repair individual files in the data directory that are damaged or lost. For example, if individual files or data in an mnode or vnode are damaged, it is not possible to recover a specific file or block of data individually. In this case, you can choose to completely clear the data of that mnode/vnode and then perform recovery.

## File-Level Repair (`taosd -r`)

For file-level corruption under a vnode directory (`wal/tsdb/meta`), you can use `taosd -r` for offline repair. This workflow complements `restore dnode`, which is focused on node-level recovery.

### Supported Scope

- `--node-type`: currently `vnode`
- `--file-type`: `wal`, `tsdb`, `meta`
- `--mode`: `force`, `replica`, `copy`

### Common Command Examples

```bash
# 1) force: run local file repair for the target vnode (example: WAL)
taosd -r \
--node-type vnode \
--file-type wal \
--vnode-id 2 \
--mode force \
--backup-path /var/lib/taos/repair-backup

# 2) replica: degrade the local bad replica and trigger replication recovery
taosd -r \
--node-type vnode \
--file-type wal \
--vnode-id 2 \
--mode replica \
--backup-path /var/lib/taos/repair-backup

# 3) copy: recover by copying files from a specified replica node (requires ssh/scp)
taosd -r \
--node-type vnode \
--file-type wal \
--vnode-id 2 \
--mode copy \
--replica-node 192.168.1.24:/var/lib/taos \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.

Suggested change
--replica-node 192.168.1.24:/var/lib/taos \
--replica-node <replica_host>:/path/to/remote/taos/data \

--backup-path /var/lib/taos/repair-backup
```

### Operations Validation and Troubleshooting

- During execution, the process prints `repair progress` and a final `repair summary`.
- Each run writes session artifacts under the backup path:
- `repair.log`: human-readable detail log
- `repair.state.json`: machine-readable checkpoint state (used for resume)
- Recommended checks:
- Ensure `repair.log` includes expected step details, such as `copy replica detail` or `replica restore detail`.
- Verify `step/status/doneVnodes/totalVnodes` in `repair.state.json`.

### Notes

- `copy` mode requires `--replica-node=<host>:<absolute-path>` and reachable `ssh/scp`.
- The repair flow creates backup first and rolls back on failures. It is recommended to always set `--backup-path`.
- For full node/logical-node failures, prefer the `restore dnode` workflow above.

## Splitting Virtual Groups

When a vgroup is overloaded with CPU or Disk resource usage due to too many subtables, after adding a dnode, you can split the vgroup into two virtual groups using the `split vgroup` command. After the split, the newly created two vgroups will undertake the read and write services originally provided by one vgroup. This command was first released in version 3.0.6.0, and it is recommended to use the latest version whenever possible.
Expand Down
144 changes: 144 additions & 0 deletions docs/plans/2026-03-03-data-repair-tool-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# TDengine 数据修复工具设计文档(`taosd -r` 扩展)

## 1. 背景与目标

Check failure on line 3 in docs/plans/2026-03-03-data-repair-tool-design.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Headings should be surrounded by blank lines

docs/plans/2026-03-03-data-repair-tool-design.md:3 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## 1. 背景与目标"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md022.md
- 需求来源:`/Projects/work/TDengine/.vscode/dev/数据修复工具 - RS.md`

Check failure on line 4 in docs/plans/2026-03-03-data-repair-tool-design.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Lists should be surrounded by blank lines

docs/plans/2026-03-03-data-repair-tool-design.md:4 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- 需求来源:`/Projects/work/TDengin..."] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md032.md
- 目标:在不新增独立程序的前提下,把 `taosd -r` 扩展为可控、可追踪、可恢复的数据修复工具。
- 首期范围:`--node-type=vnode`,`--file-type=wal|tsdb|meta`,支持 `force/replica/copy` 三模式编排。
- 术语约定:本文中的 `META` 即“时序数据元数据”(历史文档中的 `TDB`)。

## 2. 方案比较(2-3 种)

### 方案 A:增量扩展 `taosd -r`(推荐)

Check failure on line 11 in docs/plans/2026-03-03-data-repair-tool-design.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Headings should be surrounded by blank lines

docs/plans/2026-03-03-data-repair-tool-design.md:11 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "### 方案 A:增量扩展 `taosd -r`(推荐)"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md022.md
- 做法:

Check failure on line 12 in docs/plans/2026-03-03-data-repair-tool-design.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Lists should be surrounded by blank lines

docs/plans/2026-03-03-data-repair-tool-design.md:12 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- 做法:"] https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md032.md
- 在 `dmMain.c` 增加新参数解析与校验;
- 新增 repair session 编排层;
- WAL 修复复用现有 `walCheckAndRepair*`;
- TSDB/META 逐步补齐 repair handler。
- 优点:
- 改动路径短,复用现有启动链与 vnode 生命周期;
- 适合分阶段交付(先 WAL,再 TSDB/META);
- 运维入口统一(符合需求)。
- 缺点:
- 启动流程与修复流程耦合,需要谨慎处理正常启动路径回归风险。

### 方案 B:新增独立 repair 子程序
- 做法:新增 `taosrepair` 风格工具,绕开 `taosd` 启动路径。
- 优点:
- 模块边界清晰,便于封闭测试。
- 缺点:
- 与需求“基于 taosd -r 扩展”冲突;
- 需要重复接入大量现有内部模块和配置解析逻辑。

### 方案 C:按 SQL 管理命令驱动修复(类似 restore dnode)
- 做法:走 mnode 事务,远程驱动 dnode 执行文件修复。
- 优点:
- 统一集群控制平面。
- 缺点:
- 文件级修复语义不适合纯远程事务;
- 社区版/企业版分叉明显,落地周期长。

## 3. 推荐方案
- 采用方案 A。
- 原因:最贴近需求、复用度最高、可按 1 小时任务切片渐进落地。

## 4. 目标架构

### 4.1 总体模块
- `CLI Parser`:解析 `--node-type/--file-type/--vnode-id/--backup-path/--mode/--replica-node`。
- `Validator`:参数合法性和组合规则校验。
- `Repair Session`:会话上下文、任务分解、并发控制、状态持久化。
- `Preflight`:空间检查、文件存在性检查、权限检查。
- `Backup Manager`:修复前备份原始文件。
- `Mode Handler`:
- `force` -> `wal/tsdb/meta` 子处理器;
- `replica` -> 副本恢复触发流程;
- `copy` -> 远端文件拷贝流程。
- `Reporter`:过程进度、repair.log、摘要输出。

### 4.2 数据流(简化)
1. `taosd -r ...` 启动。
2. 解析参数 -> 校验。
3. 构建 `repair session`,定位目标 vnode 列表。
4. 执行 preflight,创建备份目录与状态文件。
5. 按 `mode + file-type` 调度处理器。
6. 持续写入 `repair.log` 和 `repair.state.json`。
7. 输出汇总:成功/失败 vnode、恢复条目、损坏条目、耗时。

## 5. 模式级设计

### 5.1 force 模式
- `wal`:
- 优先复用 `walCheckAndRepairMeta/Idx`;
- 增加“修复前备份 + 结构化日志”;
- 增加可重放性检查结果归档。
- `tsdb`:
- 枚举 `data/head/sma/stt`;
- 校验块级完整性;
- 保留可恢复块、剔除不可恢复块;
- 重建最小可用结构。
- `meta`:
- 解析可读元数据;
- 联合 WAL/TSDB 推导缺失元数据;
- 对无法推导项打标并告警。

### 5.2 replica 模式
- 目标:触发当前损坏 vnode 从健康副本进行全量同步。
- 设计方向:
- 将本地损坏副本置为不可读写状态;
- 通过版本/任期策略触发同步;
- 复用现有 restore/vgroup 事务动作(需评估社区版路径)。

### 5.3 copy 模式
- 目标:当数据体量大时,用“离线副本文件拷贝”快速恢复。
- 核心步骤:
- 解析 `--replica-node`;
- 建立远端连接;
- 全量拷贝目标 vnode 目录文件;
- 同步权限与 owner;
- 完成后一致性校验。

## 6. 安全与一致性设计
- 任何写操作前必须完成备份(`--backup-path=none` 例外时需告警)。
- preflight 失败即停止修复,不进入破坏性步骤。
- 关键步骤写状态检查点,异常退出后可恢复续跑。
- 默认“先保守后激进”:优先保留可确认正确的数据。

## 7. 会话中断恢复机制(开发与运行双层)

### 7.1 开发过程恢复
- 以仓库根目录 `task_plan.md/findings.md/progress.md` 作为持久化工作记忆。
- 每完成 1 个任务立即更新状态与日志。
- 恢复时直接定位 `in_progress` 任务继续。

### 7.2 运行时修复恢复
- 每次修复会生成:
- `repair.log`:人类可读日志;
- `repair.state.json`:机器可读状态检查点。
- 下次执行同一任务时可读取状态文件,跳过已完成步骤,继续未完成步骤。

## 8. 测试策略
- 单元测试:
- 参数解析与校验;
- 备份路径生成与状态文件读写;
- mode dispatch 路由。
- 组件测试:
- WAL 修复样例(损坏 idx、截断 log)。
- TSDB 块损坏样例。
- META 元数据缺失样例。
- 系统测试:
- 单副本 force 场景。
- 三副本 replica/copy 场景。
- 故障注入:磁盘不足、文件缺失、副本不可达。

## 9. 风险与缓解
- 风险:TSDB/META 修复复杂度高,首版难以一次做到“全恢复”。
- 缓解:先交付 WAL MVP,分阶段扩展恢复深度。
- 风险:社区版/企业版恢复能力分叉。
- 缓解:将 `replica/copy` 路径做能力探测和清晰报错。
- 风险:修复逻辑影响正常启动路径。
- 缓解:修复逻辑只在 `-r` 显式开启,默认路径零影响。

## 10. 设计确认点
- 是否同意按优先级 `force+wal -> force+tsdb -> force+meta -> replica -> copy` 推进。
- 是否同意首版 `--node-type` 只支持 `vnode`,其他值先返回 `not supported`。
- 是否同意把“会话恢复”作为第一批基础设施(而不是后补)。
Loading
Loading