feat: data recovery by hzcheng · Pull Request #34668 · taosdata/TDengine

hzcheng · 2026-03-04T02:15:49Z

Description

Issue(s)

Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

Is the user manual updated?
Are the test cases passed and automated?
Is there no significant decrease in test coverage?

Add comprehensive design documentation for extending `taosd -r` into a controlled, traceable, and recoverable data repair tool. The design outlines three implementation approaches with detailed technical specifications, API designs, and test plans. Key additions: - Design document (`docs/plans/2026-03-03-data-repair-tool-design.md`) covering: - Background and objectives for data repair functionality - Three implementation approaches with pros/cons analysis - Detailed technical design including CLI interface, session management, and repair handlers - API specifications, error handling, and recovery mechanisms - Testing strategy and phased delivery plan - Gitignore patterns for agent skills and development artifacts to exclude from version control: - Agent skill documentation and templates - Claude configuration and skill directories - Development scripts and testing materials The design focuses on incremental extension of existing `taosd -r` functionality to support vnode repair with WAL, TSDB, and META file types using force/replica/copy operation modes.

- Add `SRepairCtx` structure to store repair session context including session ID, start time, and runtime parameters - Implement `tRepairInitCtx()` to initialize repair context after CLI argument validation in `dmMain.c` - Add vnode filtering capability by parsing `--vnode-id` into integer array with deduplication and validation - Introduce `tRepairShouldRepairVnode()` as entry point for vnode traversal filtering - Reject invalid `--vnode-id` values (e.g., "2,a") during context initialization - Enforce code constraints by forbidding `strtol/strtoll` usage and requiring `taosStr2Int32/taosStr2Int64` wrappers - Fix `strtok_r` mutation issue by using temporary buffer for vnode ID parsing

… paths - Implement `tRepairPrecheck()` to validate repair prerequisites before initialization - Check data directory existence, backup path (if configured), and disk available space threshold - Verify vnode subdirectories (wal/tsdb/meta) exist for vnode node type - Integrate precheck into `dmMain.c` after config loading and before `dmInit()` - Use `tsDataSpace.reserved` as minimum disk space threshold; skip check if zero - Fail fast with descriptive error message when precheck fails - Update documentation and progress tracking for T2.3 completion

- Add `tRepairPrepareBackupDir()` to pre-create backup directories per vnode with path rules for explicit `--backup-path` or default `<dataDir>/backup/` - Add session file management functions: `tRepairPrepareSessionFiles()`, `tRepairAppendSessionLog()`, `tRepairWriteSessionState()` for repair.log and repair.state.json - Implement atomic state file writing using *.tmp + rename to reduce corruption risk - Integrate backup directory creation and session file initialization into dmMain.c with fail-fast behavior on errors - Update progress documentation to reflect completion of T2.4 and T2.5 tasks

…r sessions - Add `tRepairNeedReportProgress()` to throttle progress reports based on time intervals - Add `tRepairBuildProgressLine()` to construct progress output with session/step/vnode completion metrics - Add `tRepairBuildSummaryLine()` to generate final summary with status and statistics - Integrate progress reporting in `dmMain.c` for precheck, backup, and preflight stages - Update `repair.state.json` to track step transitions from precheck → backup → preflight(ready) - Add `AGENTS.md` documentation for session conventions and progress bar formatting rules - Update progress tracking in `progress.md` and `findings.md` to reflect T2.6 completion

- Implement `tRepairTryResumeSession()` to scan backup directories for incomplete repair sessions - Resume sessions from latest `repair.state.json` with status `initialized` or `running` - Skip already processed vnodes using `doneVnodes` counter from saved state - Update `dmMain.c` to attempt session recovery after precheck phase - Document JSON parsing nuance: use `tjsonGetObjectItem()` to check field existence instead of relying on `tjsonGetStringValue2()` return code - Update progress tracking to reflect completion of T2.7 task

- add tRepairNeedRunWalForceRepair and tRepairBuildVnodeTargetPath - wire force+wal scheduling in dmMain via walInit and walOpen/walClose per vnode - persist wal step progress/state/log with existing repair session flow - add common tests for wal-force predicate and vnode target path builder - update task/progress/findings docs for T3.1 completion

- move repair precheck/resume/backup/progress logic into dmRunRepairWorkflow() - isolate force+wal scheduling in dmRunForceWalRepair() - keep main focused on startup/shutdown orchestration and unified cleanup - preserve existing repair error handling and return semantics

- add tRepairBackupVnodeTarget and tRepairRollbackVnodeTarget APIs - backup vnode WAL target before walOpen in force-repair workflow - rollback WAL target on walOpen failure to avoid partial damage - add commonTests coverage for backup/rollback success and invalid args - update task_plan/progress/findings for T3.2 completion

- add TSDB recursive scan API and integrate tsdb precheck validation - add TSDB block analysis report with recoverable/corrupted counters - add TSDB rebuild API that keeps recoverable blocks (head+data only) - add unit tests for scan/analyze/rebuild and invalid-arg scenarios

- add tRepairNeedRunTsdbForceRepair and unit test coverage - wire dmRunForceTsdbRepair into dmRunRepairWorkflow - implement tsdb analyze/rebuild/switch flow with rollback on failure - add tests/ci/repair_tsdb_force.sh for mixed recoverable/corrupted tsdb scenario - sync task_plan.md, progress.md, and findings.md for T4.4/T4.5 completion

…e coverage - add TSDB force repair scheduling in dmMain with analyze/rebuild/switch flow - add rollback guard and session log/state updates for TSDB repair failures - add META scan model and APIs (required/optional files) in trepair - add force+meta scheduling predicate and workflow integration in dmRunRepairWorkflow - extend precheck to validate META directory integrity before execution - add common tests for TSDB/META scheduling and META scan behavior - add tests/ci/repair_tsdb_force.sh to validate mixed recoverable/corrupted TSDB scenario - update task_plan/progress/findings for T4.4/T4.5 completion and T5.1 completion

… stub - add tRepairRebuildMetaFiles to rebuild meta output and backfill required files - integrate meta rebuild -> rename activation with rollback and rebuild detail logging - add tRepairNeedRunReplicaRepair and dmRunReplicaRepair for explicit mode=replica dispatch - add CI smoke script tests/ci/repair_meta_force.sh for partial/complete meta corruption cases - extend common tests for RebuildMetaFiles* and NeedRunReplicaRepair* - sync task/progress/findings docs for T5.4, T5.5, and T6.1 completion

- add tRepairDegradeReplicaVnode() to atomically write vnode degrade marker (replica.degrade.marker.json) with availability/sync/version/term policies - upgrade dmRunReplicaRepair() from dispatch stub to per-vnode degrade execution - append replica degrade detail logs and keep step=replica progress/state updates - add DegradeReplicaVnode test coverage in commonTests - update task_plan/progress/findings: mark T6.2 completed and move T6.3 in_progress Tested: - cmake --build debug -j8 --target commonTest - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter='RepairOptionParseTest.NeedRunReplicaRepair*:RepairOptionParseTest.DegradeReplicaVnode*' - ASAN_OPTIONS=detect_leaks=0 ctest --test-dir debug -R commonTest --output-on-failure - cmake --build debug -j8 --target taosd - replica smoke with TAOS_DATA_DIR (exit code 47, marker and logs verified)

…ransfer foundation - add replica restore hint writer and degraded-marker rollback helper - wire replica failure branches to unified artifact rollback and restore-detail logging - enforce copy-mode --replica-node format (<host>:<absolute-path>) - add replica endpoint parser and local mock vnode copy API for follow-up SCP integration - extend commonTests for replica hint/rollback and copy endpoint/mock transfer paths - update task_plan/progress/findings for T6.3~T7.2 completion

- add copy-mode scheduling and SSH/SCP command builders in trepair - implement SSH/SCP vnode target copy execution with exit-code parsing - integrate dmRunCopyRepair into dmRunRepairWorkflow with state/log/progress updates - add TDD coverage for copy scheduling and command construction - update task_plan/progress/findings for T7.3 completion

- restore local uid/gid/mode after ssh/scp copy - add remote/local digest verification for post-copy consistency - add per-vnode backup and rollback on copy failure - extend copy tests for permission fix and consistency mismatch detection - update task/progress/findings docs for T7.4 and T7.5 completion

…ases

…ance

gemini-code-assist · 2026-03-04T02:16:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances TDengine's data robustness by introducing a new file-level data recovery tool accessible via taosd -r. The tool provides flexible repair options for vnode WAL, TSDB, and META files, supporting various scenarios from local corruption fixes to replica-based recovery. A key feature is the ability to resume repair operations, ensuring resilience against interruptions. The changes also include extensive internal documentation and new system tests to validate the functionality.

Highlights

New Data Recovery Tool: Introduced a comprehensive data recovery tool integrated into the taosd -r command, enabling file-level repair for vnode WAL, TSDB, and META files.
Multiple Repair Modes: Implemented three distinct repair modes: force for single-replica self-recovery, replica for triggering full synchronization from healthy replicas, and copy for direct file transfer from a specified replica node.
Resumable Repair Sessions: Added functionality for resumable repair sessions, allowing the tool to pick up from where it left off after an interruption by persisting session state to repair.state.json.
Enhanced Command-Line Options: Extended taosd -r with new command-line options such as --node-type, --file-type, --vnode-id, --backup-path, --mode, and --replica-node for granular control over the repair process.
Comprehensive Documentation and Testing: Provided detailed design, implementation plans, and release checklists, along with new system-level test scripts and updated user manuals in both English and Chinese to cover the new data recovery features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitignore
- Added numerous entries to ignore .agents and .claude related files, and skills-lock.json.
AGENTS.md
- Added a new document outlining TDengine Session Conventions, including progress reporting and required progress bar format.
docs/en/08-operation/04-maintenance.md
- Updated to include a new section on File-Level Repair (taosd -r), detailing supported scopes, command examples, validation, troubleshooting, and notes.
docs/plans/2026-03-03-data-repair-tool-design.md
- Added a new design document in Chinese for the TDengine data repair tool, covering background, goals, solution comparison, recommended architecture, mode-level design, safety, consistency, session recovery, testing strategy, risks, and confirmation points.
docs/plans/2026-03-03-data-repair-tool-implementation.md
- Added a new implementation plan document for the TDengine data repair tool, detailing the goal, architecture, tech stack, execution notes, and a task-by-task breakdown for implementation.
docs/plans/2026-03-04-data-repair-release-checklist.md
- Added a new release checklist document for the TDengine data repair feature, including scope, verification commands, results, release risks, and sign-off.
docs/zh/08-operation/05-maintenance.md
- Updated to include a new section on File-Level Repair (taosd -r) in Chinese, detailing supported scopes, command examples, operational validation, troubleshooting, and notes.
findings.md
- Added a new document detailing the findings from the data repair tool requirements research, including terminology clarification, key requirements, existing code analysis, reusable capabilities, main gaps, architectural recommendations, and discovered issues during execution.
include/common/trepair.h
- Added a new header file defining data structures and function prototypes for the data repair tool, including enums for node types, file types, and repair modes, CLI arguments, repair context, and various repair-related operations.
include/libs/wal/wal.h
- Modified to add SWalRepairStats struct and walGetRepairStats function prototype.
- Added repairStats member to SWal struct.
progress.md
- Added a new document logging the development progress of the data repair tool, including checkpoints, session logs for each task, completed tasks, and risk warnings.
source/common/src/trepair.c
- Added a new C source file implementing the data repair tool functionalities, including parsing CLI options, validating arguments, initializing repair context, handling different repair modes (force, replica, copy), managing backup and rollback, session state persistence, progress reporting, and meta/tsdb/wal file scanning and rebuilding.
source/dnode/mgmt/exe/dmMain.c
- Modified to integrate new command-line arguments for the data repair tool.
- Validated new command-line arguments and initialized the repair context.
- Orchestrated the repair workflow based on the selected mode and file type.
source/libs/wal/src/walMeta.c
- Modified to increment corruptedSegments and rebuiltIdxEntries in SWalRepairStats during WAL meta and index repair.
source/libs/wal/src/walMgmt.c
- Modified to add walGetRepairStats function.
- Initialized repairStats in walOpen.
source/libs/wal/test/walMetaTest.cpp
- Modified to add new test cases for walGetRepairStats.
- Added tests to track corrupted segments and rebuilt index entries during WAL repair.
task_plan.md
- Added a new document outlining the overall task plan for the TDengine data repair tool, including goals, scope, current status, phase milestones, detailed 1-hour task breakdown, interruption recovery mechanism, and error records.
tests/ci/repair_fixture_generator.sh
- Added a new shell script to generate reproducible corruption fixtures for WAL, TSDB, and META files for testing the repair tool.
tests/ci/repair_meta_force.sh
- Added a new shell script to test the force mode repair for META files, including partial and complete corruption scenarios.
tests/ci/repair_mode_matrix.sh
- Added a new shell script to run a matrix of system tests for the repair tool across force (TSDB/META), replica, and copy modes.
tests/ci/repair_tsdb_force.sh
- Added a new shell script to test the force mode repair for TSDB files, including mixed recoverable and corrupted blocks.

Activity

The author hzcheng has meticulously documented the development process, including task breakdowns, progress logs, and findings, indicating a structured and thorough approach.
The development followed a Test-Driven Development (TDD) methodology, with failing tests written before implementation and subsequent verification.
Extensive internal documentation, including design and implementation plans, has been created to guide the feature's development and future maintenance.
New system-level test scripts have been added to ensure the reliability and correctness of the data recovery tool across various scenarios and modes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: a data repair tool integrated into taosd -r. The changes are extensive, covering design documentation, implementation plans, the core C logic, command-line integration, and both unit and system-level tests. The overall structure is well-thought-out, with a clear separation of concerns between parsing, validation, and the repair workflow itself.

My review has identified a critical security vulnerability related to command injection in the copy mode, which must be addressed. I've also included a couple of medium-severity suggestions regarding a hardcoded limit and improving documentation examples for better usability.

Given the complexity and the critical nature of a data repair tool, thorough testing beyond the added scripts is highly recommended, especially focusing on edge cases and failure/rollback scenarios.

_{Note: Security Review did not run due to the size of the PR.}

gemini-code-assist · 2026-03-04T02:18:46Z

source/common/src/trepair.c

+  int32_t len = tsnprintf(cmd, cmdSize, "%s -o BatchMode=yes -o StrictHostKeyChecking=no %s \"test -d '%s'\"",
+                          sshBin, replicaHost, remoteTargetPath);


Potential command injection vulnerability. The replicaHost variable, which comes from user input, is directly embedded into a shell command string. While the validation in tRepairValidateReplicaNodeEndpoint checks for spaces, it's not sufficient to prevent all forms of command injection. A malicious value for --replica-node (e.g., somehost;reboot) could lead to arbitrary command execution on the server. This vulnerability exists in tRepairBuildCopySshProbeCmd, tRepairBuildCopyScpCmd, tRepairBuildCopySshStatCmd, and tRepairBuildCopySshDigestCmd. You should implement stricter validation for the hostname part of the replica-node parameter, allowing only characters that are valid in a hostname (e.g., alphanumeric, -, and .).

gemini-code-assist · 2026-03-04T02:18:46Z

docs/en/08-operation/04-maintenance.md

+  --file-type wal \
+  --vnode-id 2 \
+  --mode copy \
+  --replica-node 192.168.1.24:/var/lib/taos \


The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.

Suggested change

--replica-node 192.168.1.24:/var/lib/taos \

--replica-node <replica_host>:/path/to/remote/taos/data \

gemini-code-assist · 2026-03-04T02:18:46Z

docs/zh/08-operation/05-maintenance.md

+  --file-type wal \
+  --vnode-id 2 \
+  --mode copy \
+  --replica-node 192.168.1.24:/var/lib/taos \


The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.

Suggested change

--replica-node 192.168.1.24:/var/lib/taos \

--replica-node <副本主机>:/远程taos数据目录的绝对路径 \

gemini-code-assist · 2026-03-04T02:18:46Z

include/common/trepair.h

+} SRepairCliArgs;
+
+#define REPAIR_SESSION_ID_LEN 64
+#define REPAIR_MAX_VNODE_IDS  128


The number of vnodes that can be repaired in a single command is limited by the REPAIR_MAX_VNODE_IDS constant, which is set to 128. For very large clusters, users might need to repair more vnodes at once. Using a fixed-size array for vnodeIds in SRepairCtx could be a limitation. Consider using a dynamic array (like SArray, which is used elsewhere in the codebase) to make this more flexible and avoid an arbitrary limit.

Copilot

Pull request overview

This PR extends taosd -r into a resumable “data repair” workflow with new repair CLI options, session/state artifacts, and CI coverage scripts for multiple repair modes (force/replica/copy), plus supporting WAL repair stats and related documentation/planning notes.

Changes:

Add repair CLI options and a repair workflow orchestration in taosd (--node-type/--file-type/--vnode-id/--backup-path/--mode/--replica-node) with session log/state output.
Add WAL repair statistics export (walGetRepairStats) and new WAL unit tests validating stats tracking.
Add CI scripts to generate fixtures and validate repair flows (force TSDB/META + replica + copy) and update ops docs/plans.

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`source/dnode/mgmt/exe/dmMain.c`	Adds repair CLI parsing/help and implements the repair workflow orchestration (precheck/backup + mode handlers + session logging/state).
`include/common/trepair.h`	Introduces the common repair API/types used by the CLI/workflow.
`include/libs/wal/wal.h`	Adds WAL repair stats struct + API and stores stats on the WAL handle.
`source/libs/wal/src/walMgmt.c`	Implements `walGetRepairStats()` and zeros stats on open.
`source/libs/wal/src/walMeta.c`	Updates WAL repair routines to accumulate repair statistics.
`source/libs/wal/test/walMetaTest.cpp`	Adds unit tests for WAL repair stats behavior.
`tests/ci/repair_tsdb_force.sh`	Adds a TSDB force-repair CI verification script.
`tests/ci/repair_meta_force.sh`	Adds a META force-repair CI verification script.
`tests/ci/repair_mode_matrix.sh`	Adds a matrix runner validating force/meta/replica/copy modes.
`tests/ci/repair_fixture_generator.sh`	Adds a fixture generator for reproducible corruption scenarios.
`docs/zh/08-operation/05-maintenance.md`	Documents file-level repair usage under `taosd -r` (Chinese).
`docs/en/08-operation/04-maintenance.md`	Documents file-level repair usage under `taosd -r` (English).
`task_plan.md`, `progress.md`, `findings.md`	Project planning/status documents for the repair tool effort.
`docs/plans/2026-03-03-data-repair-tool-*.md`	Design/implementation plan docs.
`docs/plans/2026-03-04-data-repair-release-checklist.md`	Release checklist and verification commands/results.
`AGENTS.md`	Adds session/progress reporting conventions document.
`.gitignore`	Ignores local agent/Claude tooling directories and related files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T02:22:49Z

tests/ci/repair_mode_matrix.sh

+if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then
+  echo "required force-mode scripts are missing or not executable"


This script requires the referenced force-mode scripts to be executable (-x), but it later invokes them explicitly via bash ..., so the execute bit is not actually needed at runtime. This extra requirement can cause CI failures if the repo file mode isn't preserved (e.g., when checking out on some systems). Consider checking for existence/readability (-f/-r) instead, or invoke them directly without bash if executability is truly required.

Suggested change

if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then

echo "required force-mode scripts are missing or not executable"

if [[ ! -r "$TSDB_FORCE_SCRIPT" || ! -r "$META_FORCE_SCRIPT" ]]; then

echo "required force-mode scripts are missing or not readable"

Copilot · 2026-03-04T02:22:50Z

source/dnode/mgmt/exe/dmMain.c

+    if (code != TSDB_CODE_SUCCESS) {
+      taosCleanupCfg();
+      taosCloseLog();
+      taosCleanupArgs();
+      taosConvDestroy();


When repair mode is enabled (global.repairCtx.enabled), the process runs dmRunRepairWorkflow() but then continues into the normal daemon startup path (dmInit/dmRun). This makes taosd -r --node-type/--file-type/... behave non-offline and can hang/side-effect CI and ops usage if startup succeeds. Consider exiting cleanly after a successful repair workflow (with proper cleanup) when repairCtx.enabled is true, instead of proceeding to start services.

Suggested change

if (code != TSDB_CODE_SUCCESS) {

taosCleanupCfg();

taosCloseLog();

taosCleanupArgs();

taosConvDestroy();

taosCleanupCfg();

taosCloseLog();

taosCleanupArgs();

taosConvDestroy();

if (code == TSDB_CODE_SUCCESS) {

return 0;

} else {

Copilot · 2026-03-04T02:22:50Z

source/dnode/mgmt/exe/dmMain.c

+#define DM_REPAIR        "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options."
+#define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode."
+#define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint."
+#define DM_REPAIR_VNODE_ID  "Target vnode id list, separated by comma (required when --node-type=vnode)."
+#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."
+#define DM_REPAIR_MODE       "Repair mode. Options: force, replica, copy."
+#define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."


The help text advertises multiple node/file repair targets (mnode/dnode/snode and file types like data/config/checkpoint), but the repair workflow in dmMain currently only performs vnode-level handlers (wal/tsdb/meta). With the current validator allowing these combinations, a non-vnode invocation can become a no-op that still reports a successful summary. Recommend either restricting --node-type to vnode for now (and updating help strings accordingly) or implementing the additional node/file-type handlers before advertising them.

Suggested change

#define DM_REPAIR "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options."

#define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode."

#define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint."

#define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)."

#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."

#define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy."

#define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."

#define DM_REPAIR "Enable repair mode. Currently supports vnode-level repair via --node-type, --file-type, --mode and other repair options."

#define DM_REPAIR_NODE_TYPE "Repair target node type. Currently supported option: vnode."

#define DM_REPAIR_FILE_TYPE "Repair target file type for vnode. Supported: wal | meta | tsdb."

#define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)."

#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."

#define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy."

#define DM_REPAIR_REPLICA_NODE "Replica vnode node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."

- extend tRepairTryResumeSession to return resume step and validate step buffer args - add tRepairResolveResumePlan and reuse it in dmRunRepairWorkflow - support step-aware vnode start index for replica/copy/wal/tsdb/meta phases - skip backup preparation when resuming from mode steps - ensure successful workflow summary/state uses doneVnodes=totalVnodes - tighten replica endpoint and copy command token validation to block shell metacharacter injection - add/extend RepairOptionParseTest coverage for resume step, resume plan, and copy endpoint/cmd invalid inputs Verification: - cmake --build debug --target commonTest - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.* - cmake --build debug --target taosd

…ngths - centralize repair option names in dmMain.c - add dmMatchLongOption/dmParseRepairCliLongOption to deduplicate CLI parsing - add tRepairExtractLongOptionValue for "--opt value" and "--opt=value" forms - replace duplicated parsing branches for node-type/file-type/vnode-id/backup-path/mode/replica-node - remove hardcoded length literals (12/11/14/7/15) to reduce maintenance risk - add unit tests for long-option extraction and invalid-arg handling Test: - cmake --build debug -j8 --target taosd - cmake --build debug -j8 --target commonTest - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.*

Copilot

Pull request overview

Copilot reviewed 19 out of 22 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T07:24:00Z

source/dnode/mgmt/exe/dmMain.c

+  if (global.repairCtx.enabled) {
+    code = dmRunRepairWorkflow();
+    if (code != TSDB_CODE_SUCCESS) {
+      taosCleanupCfg();
+      taosCloseLog();
+      taosCleanupArgs();
+      taosConvDestroy();
+      return code;
+    }
+  }
+
  osSetProcPath(argc, (char **)argv);
  taosCleanupArgs();



dmRunRepairWorkflow() returns success and execution continues into the normal dnode startup path. For a repair invocation (taosd -r --node-type ...), this can cause the process to turn into a long-running daemon (or hang under CI unless it fails later), and it can also trigger unrelated side effects (e.g., vnode open) after a potentially destructive repair. Consider exiting the process after the repair summary is written (with a success exit code), or gating the post-repair startup behind an explicit flag.

Copilot · 2026-03-05T07:24:01Z

source/dnode/mgmt/exe/dmMain.c

+  int32_t       totalVnodes = global.repairCtx.nodeType == REPAIR_NODE_TYPE_VNODE ? global.repairCtx.vnodeIdNum : 0;
+  int32_t       doneVnodes = 0;
+  SRepairResumePlan resumePlan = {0};


Non-vnode repair targets are accepted by CLI validation, but the workflow currently sets totalVnodes to 0 when nodeType != vnode and then proceeds to report success without performing any repair work. If mnode/dnode/snode repair is not implemented yet, this should fail fast with a clear "not supported" error; otherwise totalVnodes should be computed appropriately and the corresponding handlers added.

hzcheng added 21 commits March 3, 2026 18:50

feat(ci): add repair fixture generator for wal/tsdb/meta corruption c…

668e53d

…ases

feat(ci): add repair mode matrix script for force/replica/copy accept…

be524fb

…ance

docs(maintenance): document taosd -r file-level repair in zh/en guides

2da91c0

chore(plans): finalize P8 progress and add repair release checklist

1fc2db3

Copilot AI review requested due to automatic review settings March 4, 2026 02:15

hzcheng requested review from a team, dapan1121, guanshengliang and zitsen as code owners March 4, 2026 02:15

taosdata-bot bot added the team engine label Mar 4, 2026

Copilot started reviewing on behalf of hzcheng March 4, 2026 02:16 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

hzcheng added 2 commits March 5, 2026 14:18

Copilot AI review requested due to automatic review settings March 5, 2026 07:16

Copilot started reviewing on behalf of hzcheng March 5, 2026 07:16 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

		int32_t len = tsnprintf(cmd, cmdSize, "%s -o BatchMode=yes -o StrictHostKeyChecking=no %s \"test -d '%s'\"",
		sshBin, replicaHost, remoteTargetPath);

	--replica-node 192.168.1.24:/var/lib/taos \
	--replica-node <replica_host>:/path/to/remote/taos/data \

	--replica-node 192.168.1.24:/var/lib/taos \
	--replica-node <副本主机>:/远程taos数据目录的绝对路径 \

		if [[ ! -x "$TSDB_FORCE_SCRIPT" \|\| ! -x "$META_FORCE_SCRIPT" ]]; then
		echo "required force-mode scripts are missing or not executable"

Conversation

hzcheng commented Mar 4, 2026

Description

Issue(s)

Checklist

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants