Skip to content

feat: data recovery#34668

Open
hzcheng wants to merge 23 commits into3.0from
feat/hzcheng/data_recovery
Open

feat: data recovery#34668
hzcheng wants to merge 23 commits into3.0from
feat/hzcheng/data_recovery

Conversation

@hzcheng
Copy link
Contributor

@hzcheng hzcheng commented Mar 4, 2026

Description

Issue(s)

  • Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

hzcheng added 21 commits March 3, 2026 18:50
Add comprehensive design documentation for extending `taosd -r` into a controlled, traceable, and recoverable data repair tool. The design outlines three implementation approaches with detailed technical specifications, API designs, and test plans.

Key additions:
- Design document (`docs/plans/2026-03-03-data-repair-tool-design.md`) covering:
  - Background and objectives for data repair functionality
  - Three implementation approaches with pros/cons analysis
  - Detailed technical design including CLI interface, session management, and repair handlers
  - API specifications, error handling, and recovery mechanisms
  - Testing strategy and phased delivery plan
- Gitignore patterns for agent skills and development artifacts to exclude from version control:
  - Agent skill documentation and templates
  - Claude configuration and skill directories
  - Development scripts and testing materials

The design focuses on incremental extension of existing `taosd -r` functionality to support vnode repair with WAL, TSDB, and META file types using force/replica/copy operation modes.
- Add `SRepairCtx` structure to store repair session context including session ID, start time, and runtime parameters
- Implement `tRepairInitCtx()` to initialize repair context after CLI argument validation in `dmMain.c`
- Add vnode filtering capability by parsing `--vnode-id` into integer array with deduplication and validation
- Introduce `tRepairShouldRepairVnode()` as entry point for vnode traversal filtering
- Reject invalid `--vnode-id` values (e.g., "2,a") during context initialization
- Enforce code constraints by forbidding `strtol/strtoll` usage and requiring `taosStr2Int32/taosStr2Int64` wrappers
- Fix `strtok_r` mutation issue by using temporary buffer for vnode ID parsing
… paths

- Implement `tRepairPrecheck()` to validate repair prerequisites before initialization
- Check data directory existence, backup path (if configured), and disk available space threshold
- Verify vnode subdirectories (wal/tsdb/meta) exist for vnode node type
- Integrate precheck into `dmMain.c` after config loading and before `dmInit()`
- Use `tsDataSpace.reserved` as minimum disk space threshold; skip check if zero
- Fail fast with descriptive error message when precheck fails
- Update documentation and progress tracking for T2.3 completion
- Add `tRepairPrepareBackupDir()` to pre-create backup directories per vnode with path rules for explicit `--backup-path` or default `<dataDir>/backup/`
- Add session file management functions: `tRepairPrepareSessionFiles()`, `tRepairAppendSessionLog()`, `tRepairWriteSessionState()` for repair.log and repair.state.json
- Implement atomic state file writing using *.tmp + rename to reduce corruption risk
- Integrate backup directory creation and session file initialization into dmMain.c with fail-fast behavior on errors
- Update progress documentation to reflect completion of T2.4 and T2.5 tasks
…r sessions

- Add `tRepairNeedReportProgress()` to throttle progress reports based on time intervals
- Add `tRepairBuildProgressLine()` to construct progress output with session/step/vnode completion metrics
- Add `tRepairBuildSummaryLine()` to generate final summary with status and statistics
- Integrate progress reporting in `dmMain.c` for precheck, backup, and preflight stages
- Update `repair.state.json` to track step transitions from precheck → backup → preflight(ready)
- Add `AGENTS.md` documentation for session conventions and progress bar formatting rules
- Update progress tracking in `progress.md` and `findings.md` to reflect T2.6 completion
- Implement `tRepairTryResumeSession()` to scan backup directories for incomplete repair sessions
- Resume sessions from latest `repair.state.json` with status `initialized` or `running`
- Skip already processed vnodes using `doneVnodes` counter from saved state
- Update `dmMain.c` to attempt session recovery after precheck phase
- Document JSON parsing nuance: use `tjsonGetObjectItem()` to check field existence instead of relying on `tjsonGetStringValue2()` return code
- Update progress tracking to reflect completion of T2.7 task
  - add tRepairNeedRunWalForceRepair and tRepairBuildVnodeTargetPath
  - wire force+wal scheduling in dmMain via walInit and walOpen/walClose per vnode
  - persist wal step progress/state/log with existing repair session flow
  - add common tests for wal-force predicate and vnode target path builder
  - update task/progress/findings docs for T3.1 completion
  - move repair precheck/resume/backup/progress logic into dmRunRepairWorkflow()
  - isolate force+wal scheduling in dmRunForceWalRepair()
  - keep main focused on startup/shutdown orchestration and unified cleanup
  - preserve existing repair error handling and return semantics
  - add tRepairBackupVnodeTarget and tRepairRollbackVnodeTarget APIs
  - backup vnode WAL target before walOpen in force-repair workflow
  - rollback WAL target on walOpen failure to avoid partial damage
  - add commonTests coverage for backup/rollback success and invalid args
  - update task_plan/progress/findings for T3.2 completion
  - add TSDB recursive scan API and integrate tsdb precheck validation
  - add TSDB block analysis report with recoverable/corrupted counters
  - add TSDB rebuild API that keeps recoverable blocks (head+data only)
  - add unit tests for scan/analyze/rebuild and invalid-arg scenarios
  - add tRepairNeedRunTsdbForceRepair and unit test coverage
  - wire dmRunForceTsdbRepair into dmRunRepairWorkflow
  - implement tsdb analyze/rebuild/switch flow with rollback on failure
  - add tests/ci/repair_tsdb_force.sh for mixed recoverable/corrupted tsdb scenario
  - sync task_plan.md, progress.md, and findings.md for T4.4/T4.5 completion
…e coverage

  - add TSDB force repair scheduling in dmMain with analyze/rebuild/switch flow
  - add rollback guard and session log/state updates for TSDB repair failures
  - add META scan model and APIs (required/optional files) in trepair
  - add force+meta scheduling predicate and workflow integration in dmRunRepairWorkflow
  - extend precheck to validate META directory integrity before execution
  - add common tests for TSDB/META scheduling and META scan behavior
  - add tests/ci/repair_tsdb_force.sh to validate mixed recoverable/corrupted TSDB scenario
  - update task_plan/progress/findings for T4.4/T4.5 completion and T5.1 completion
… stub

  - add tRepairRebuildMetaFiles to rebuild meta output and backfill required files
  - integrate meta rebuild -> rename activation with rollback and rebuild detail logging
  - add tRepairNeedRunReplicaRepair and dmRunReplicaRepair for explicit mode=replica dispatch
  - add CI smoke script tests/ci/repair_meta_force.sh for partial/complete meta corruption cases
  - extend common tests for RebuildMetaFiles* and NeedRunReplicaRepair*
  - sync task/progress/findings docs for T5.4, T5.5, and T6.1 completion
  - add tRepairDegradeReplicaVnode() to atomically write vnode degrade marker
    (replica.degrade.marker.json) with availability/sync/version/term policies
  - upgrade dmRunReplicaRepair() from dispatch stub to per-vnode degrade execution
  - append replica degrade detail logs and keep step=replica progress/state updates
  - add DegradeReplicaVnode test coverage in commonTests
  - update task_plan/progress/findings: mark T6.2 completed and move T6.3 in_progress

  Tested:
  - cmake --build debug -j8 --target commonTest
  - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter='RepairOptionParseTest.NeedRunReplicaRepair*:RepairOptionParseTest.DegradeReplicaVnode*'
  - ASAN_OPTIONS=detect_leaks=0 ctest --test-dir debug -R commonTest --output-on-failure
  - cmake --build debug -j8 --target taosd
  - replica smoke with TAOS_DATA_DIR (exit code 47, marker and logs verified)
…ransfer foundation

  - add replica restore hint writer and degraded-marker rollback helper
  - wire replica failure branches to unified artifact rollback and restore-detail logging
  - enforce copy-mode --replica-node format (<host>:<absolute-path>)
  - add replica endpoint parser and local mock vnode copy API for follow-up SCP integration
  - extend commonTests for replica hint/rollback and copy endpoint/mock transfer paths
  - update task_plan/progress/findings for T6.3~T7.2 completion
  - add copy-mode scheduling and SSH/SCP command builders in trepair
  - implement SSH/SCP vnode target copy execution with exit-code parsing
  - integrate dmRunCopyRepair into dmRunRepairWorkflow with state/log/progress updates
  - add TDD coverage for copy scheduling and command construction
  - update task_plan/progress/findings for T7.3 completion
  - restore local uid/gid/mode after ssh/scp copy
  - add remote/local digest verification for post-copy consistency
  - add per-vnode backup and rollback on copy failure
  - extend copy tests for permission fix and consistency mismatch detection
  - update task/progress/findings docs for T7.4 and T7.5 completion
Copilot AI review requested due to automatic review settings March 4, 2026 02:15
@hzcheng hzcheng requested review from a team, dapan1121, guanshengliang and zitsen as code owners March 4, 2026 02:15
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances TDengine's data robustness by introducing a new file-level data recovery tool accessible via taosd -r. The tool provides flexible repair options for vnode WAL, TSDB, and META files, supporting various scenarios from local corruption fixes to replica-based recovery. A key feature is the ability to resume repair operations, ensuring resilience against interruptions. The changes also include extensive internal documentation and new system tests to validate the functionality.

Highlights

  • New Data Recovery Tool: Introduced a comprehensive data recovery tool integrated into the taosd -r command, enabling file-level repair for vnode WAL, TSDB, and META files.
  • Multiple Repair Modes: Implemented three distinct repair modes: force for single-replica self-recovery, replica for triggering full synchronization from healthy replicas, and copy for direct file transfer from a specified replica node.
  • Resumable Repair Sessions: Added functionality for resumable repair sessions, allowing the tool to pick up from where it left off after an interruption by persisting session state to repair.state.json.
  • Enhanced Command-Line Options: Extended taosd -r with new command-line options such as --node-type, --file-type, --vnode-id, --backup-path, --mode, and --replica-node for granular control over the repair process.
  • Comprehensive Documentation and Testing: Provided detailed design, implementation plans, and release checklists, along with new system-level test scripts and updated user manuals in both English and Chinese to cover the new data recovery features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Added numerous entries to ignore .agents and .claude related files, and skills-lock.json.
  • AGENTS.md
    • Added a new document outlining TDengine Session Conventions, including progress reporting and required progress bar format.
  • docs/en/08-operation/04-maintenance.md
    • Updated to include a new section on File-Level Repair (taosd -r), detailing supported scopes, command examples, validation, troubleshooting, and notes.
  • docs/plans/2026-03-03-data-repair-tool-design.md
    • Added a new design document in Chinese for the TDengine data repair tool, covering background, goals, solution comparison, recommended architecture, mode-level design, safety, consistency, session recovery, testing strategy, risks, and confirmation points.
  • docs/plans/2026-03-03-data-repair-tool-implementation.md
    • Added a new implementation plan document for the TDengine data repair tool, detailing the goal, architecture, tech stack, execution notes, and a task-by-task breakdown for implementation.
  • docs/plans/2026-03-04-data-repair-release-checklist.md
    • Added a new release checklist document for the TDengine data repair feature, including scope, verification commands, results, release risks, and sign-off.
  • docs/zh/08-operation/05-maintenance.md
    • Updated to include a new section on File-Level Repair (taosd -r) in Chinese, detailing supported scopes, command examples, operational validation, troubleshooting, and notes.
  • findings.md
    • Added a new document detailing the findings from the data repair tool requirements research, including terminology clarification, key requirements, existing code analysis, reusable capabilities, main gaps, architectural recommendations, and discovered issues during execution.
  • include/common/trepair.h
    • Added a new header file defining data structures and function prototypes for the data repair tool, including enums for node types, file types, and repair modes, CLI arguments, repair context, and various repair-related operations.
  • include/libs/wal/wal.h
    • Modified to add SWalRepairStats struct and walGetRepairStats function prototype.
    • Added repairStats member to SWal struct.
  • progress.md
    • Added a new document logging the development progress of the data repair tool, including checkpoints, session logs for each task, completed tasks, and risk warnings.
  • source/common/src/trepair.c
    • Added a new C source file implementing the data repair tool functionalities, including parsing CLI options, validating arguments, initializing repair context, handling different repair modes (force, replica, copy), managing backup and rollback, session state persistence, progress reporting, and meta/tsdb/wal file scanning and rebuilding.
  • source/dnode/mgmt/exe/dmMain.c
    • Modified to integrate new command-line arguments for the data repair tool.
    • Validated new command-line arguments and initialized the repair context.
    • Orchestrated the repair workflow based on the selected mode and file type.
  • source/libs/wal/src/walMeta.c
    • Modified to increment corruptedSegments and rebuiltIdxEntries in SWalRepairStats during WAL meta and index repair.
  • source/libs/wal/src/walMgmt.c
    • Modified to add walGetRepairStats function.
    • Initialized repairStats in walOpen.
  • source/libs/wal/test/walMetaTest.cpp
    • Modified to add new test cases for walGetRepairStats.
    • Added tests to track corrupted segments and rebuilt index entries during WAL repair.
  • task_plan.md
    • Added a new document outlining the overall task plan for the TDengine data repair tool, including goals, scope, current status, phase milestones, detailed 1-hour task breakdown, interruption recovery mechanism, and error records.
  • tests/ci/repair_fixture_generator.sh
    • Added a new shell script to generate reproducible corruption fixtures for WAL, TSDB, and META files for testing the repair tool.
  • tests/ci/repair_meta_force.sh
    • Added a new shell script to test the force mode repair for META files, including partial and complete corruption scenarios.
  • tests/ci/repair_mode_matrix.sh
    • Added a new shell script to run a matrix of system tests for the repair tool across force (TSDB/META), replica, and copy modes.
  • tests/ci/repair_tsdb_force.sh
    • Added a new shell script to test the force mode repair for TSDB files, including mixed recoverable and corrupted blocks.
Activity
  • The author hzcheng has meticulously documented the development process, including task breakdowns, progress logs, and findings, indicating a structured and thorough approach.
  • The development followed a Test-Driven Development (TDD) methodology, with failing tests written before implementation and subsequent verification.
  • Extensive internal documentation, including design and implementation plans, has been created to guide the feature's development and future maintenance.
  • New system-level test scripts have been added to ensure the reliability and correctness of the data recovery tool across various scenarios and modes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a data repair tool integrated into taosd -r. The changes are extensive, covering design documentation, implementation plans, the core C logic, command-line integration, and both unit and system-level tests. The overall structure is well-thought-out, with a clear separation of concerns between parsing, validation, and the repair workflow itself.

My review has identified a critical security vulnerability related to command injection in the copy mode, which must be addressed. I've also included a couple of medium-severity suggestions regarding a hardcoded limit and improving documentation examples for better usability.

Given the complexity and the critical nature of a data repair tool, thorough testing beyond the added scripts is highly recommended, especially focusing on edge cases and failure/rollback scenarios.

Note: Security Review did not run due to the size of the PR.

Comment on lines +1652 to +1653
int32_t len = tsnprintf(cmd, cmdSize, "%s -o BatchMode=yes -o StrictHostKeyChecking=no %s \"test -d '%s'\"",
sshBin, replicaHost, remoteTargetPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Potential command injection vulnerability. The replicaHost variable, which comes from user input, is directly embedded into a shell command string. While the validation in tRepairValidateReplicaNodeEndpoint checks for spaces, it's not sufficient to prevent all forms of command injection. A malicious value for --replica-node (e.g., somehost;reboot) could lead to arbitrary command execution on the server. This vulnerability exists in tRepairBuildCopySshProbeCmd, tRepairBuildCopyScpCmd, tRepairBuildCopySshStatCmd, and tRepairBuildCopySshDigestCmd. You should implement stricter validation for the hostname part of the replica-node parameter, allowing only characters that are valid in a hostname (e.g., alphanumeric, -, and .).

--file-type wal \
--vnode-id 2 \
--mode copy \
--replica-node 192.168.1.24:/var/lib/taos \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.

Suggested change
--replica-node 192.168.1.24:/var/lib/taos \
--replica-node <replica_host>:/path/to/remote/taos/data \

--file-type wal \
--vnode-id 2 \
--mode copy \
--replica-node 192.168.1.24:/var/lib/taos \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.

Suggested change
--replica-node 192.168.1.24:/var/lib/taos \
--replica-node <副本主机>:/远程taos数据目录的绝对路径 \

} SRepairCliArgs;

#define REPAIR_SESSION_ID_LEN 64
#define REPAIR_MAX_VNODE_IDS 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number of vnodes that can be repaired in a single command is limited by the REPAIR_MAX_VNODE_IDS constant, which is set to 128. For very large clusters, users might need to repair more vnodes at once. Using a fixed-size array for vnodeIds in SRepairCtx could be a limitation. Consider using a dynamic array (like SArray, which is used elsewhere in the codebase) to make this more flexible and avoid an arbitrary limit.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends taosd -r into a resumable “data repair” workflow with new repair CLI options, session/state artifacts, and CI coverage scripts for multiple repair modes (force/replica/copy), plus supporting WAL repair stats and related documentation/planning notes.

Changes:

  • Add repair CLI options and a repair workflow orchestration in taosd (--node-type/--file-type/--vnode-id/--backup-path/--mode/--replica-node) with session log/state output.
  • Add WAL repair statistics export (walGetRepairStats) and new WAL unit tests validating stats tracking.
  • Add CI scripts to generate fixtures and validate repair flows (force TSDB/META + replica + copy) and update ops docs/plans.

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
source/dnode/mgmt/exe/dmMain.c Adds repair CLI parsing/help and implements the repair workflow orchestration (precheck/backup + mode handlers + session logging/state).
include/common/trepair.h Introduces the common repair API/types used by the CLI/workflow.
include/libs/wal/wal.h Adds WAL repair stats struct + API and stores stats on the WAL handle.
source/libs/wal/src/walMgmt.c Implements walGetRepairStats() and zeros stats on open.
source/libs/wal/src/walMeta.c Updates WAL repair routines to accumulate repair statistics.
source/libs/wal/test/walMetaTest.cpp Adds unit tests for WAL repair stats behavior.
tests/ci/repair_tsdb_force.sh Adds a TSDB force-repair CI verification script.
tests/ci/repair_meta_force.sh Adds a META force-repair CI verification script.
tests/ci/repair_mode_matrix.sh Adds a matrix runner validating force/meta/replica/copy modes.
tests/ci/repair_fixture_generator.sh Adds a fixture generator for reproducible corruption scenarios.
docs/zh/08-operation/05-maintenance.md Documents file-level repair usage under taosd -r (Chinese).
docs/en/08-operation/04-maintenance.md Documents file-level repair usage under taosd -r (English).
task_plan.md, progress.md, findings.md Project planning/status documents for the repair tool effort.
docs/plans/2026-03-03-data-repair-tool-*.md Design/implementation plan docs.
docs/plans/2026-03-04-data-repair-release-checklist.md Release checklist and verification commands/results.
AGENTS.md Adds session/progress reporting conventions document.
.gitignore Ignores local agent/Claude tooling directories and related files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +18
if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then
echo "required force-mode scripts are missing or not executable"
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script requires the referenced force-mode scripts to be executable (-x), but it later invokes them explicitly via bash ..., so the execute bit is not actually needed at runtime. This extra requirement can cause CI failures if the repo file mode isn't preserved (e.g., when checking out on some systems). Consider checking for existence/readability (-f/-r) instead, or invoke them directly without bash if executability is truly required.

Suggested change
if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then
echo "required force-mode scripts are missing or not executable"
if [[ ! -r "$TSDB_FORCE_SCRIPT" || ! -r "$META_FORCE_SCRIPT" ]]; then
echo "required force-mode scripts are missing or not readable"

Copilot uses AI. Check for mistakes.
Comment on lines +2065 to +2069
if (code != TSDB_CODE_SUCCESS) {
taosCleanupCfg();
taosCloseLog();
taosCleanupArgs();
taosConvDestroy();
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When repair mode is enabled (global.repairCtx.enabled), the process runs dmRunRepairWorkflow() but then continues into the normal daemon startup path (dmInit/dmRun). This makes taosd -r --node-type/--file-type/... behave non-offline and can hang/side-effect CI and ops usage if startup succeeds. Consider exiting cleanly after a successful repair workflow (with proper cleanup) when repairCtx.enabled is true, instead of proceeding to start services.

Suggested change
if (code != TSDB_CODE_SUCCESS) {
taosCleanupCfg();
taosCloseLog();
taosCleanupArgs();
taosConvDestroy();
taosCleanupCfg();
taosCloseLog();
taosCleanupArgs();
taosConvDestroy();
if (code == TSDB_CODE_SUCCESS) {
return 0;
} else {

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +57
#define DM_REPAIR "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options."
#define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode."
#define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint."
#define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)."
#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."
#define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy."
#define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text advertises multiple node/file repair targets (mnode/dnode/snode and file types like data/config/checkpoint), but the repair workflow in dmMain currently only performs vnode-level handlers (wal/tsdb/meta). With the current validator allowing these combinations, a non-vnode invocation can become a no-op that still reports a successful summary. Recommend either restricting --node-type to vnode for now (and updating help strings accordingly) or implementing the additional node/file-type handlers before advertising them.

Suggested change
#define DM_REPAIR "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options."
#define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode."
#define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint."
#define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)."
#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."
#define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy."
#define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."
#define DM_REPAIR "Enable repair mode. Currently supports vnode-level repair via --node-type, --file-type, --mode and other repair options."
#define DM_REPAIR_NODE_TYPE "Repair target node type. Currently supported option: vnode."
#define DM_REPAIR_FILE_TYPE "Repair target file type for vnode. Supported: wal | meta | tsdb."
#define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)."
#define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair."
#define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy."
#define DM_REPAIR_REPLICA_NODE "Replica vnode node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy."

Copilot uses AI. Check for mistakes.
hzcheng added 2 commits March 5, 2026 14:18
  - extend tRepairTryResumeSession to return resume step and validate step buffer args
  - add tRepairResolveResumePlan and reuse it in dmRunRepairWorkflow
  - support step-aware vnode start index for replica/copy/wal/tsdb/meta phases
  - skip backup preparation when resuming from mode steps
  - ensure successful workflow summary/state uses doneVnodes=totalVnodes
  - tighten replica endpoint and copy command token validation to block shell metacharacter injection
  - add/extend RepairOptionParseTest coverage for resume step, resume plan, and copy endpoint/cmd invalid inputs

  Verification:
  - cmake --build debug --target commonTest
  - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.*
  - cmake --build debug --target taosd
…ngths

  - centralize repair option names in dmMain.c
  - add dmMatchLongOption/dmParseRepairCliLongOption to deduplicate CLI parsing
  - add tRepairExtractLongOptionValue for "--opt value" and "--opt=value" forms
  - replace duplicated parsing branches for node-type/file-type/vnode-id/backup-path/mode/replica-node
  - remove hardcoded length literals (12/11/14/7/15) to reduce maintenance risk
  - add unit tests for long-option extraction and invalid-arg handling

  Test:
  - cmake --build debug -j8 --target taosd
  - cmake --build debug -j8 --target commonTest
  - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.*
Copilot AI review requested due to automatic review settings March 5, 2026 07:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2073 to 2086
if (global.repairCtx.enabled) {
code = dmRunRepairWorkflow();
if (code != TSDB_CODE_SUCCESS) {
taosCleanupCfg();
taosCloseLog();
taosCleanupArgs();
taosConvDestroy();
return code;
}
}

osSetProcPath(argc, (char **)argv);
taosCleanupArgs();

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dmRunRepairWorkflow() returns success and execution continues into the normal dnode startup path. For a repair invocation (taosd -r --node-type ...), this can cause the process to turn into a long-running daemon (or hang under CI unless it fails later), and it can also trigger unrelated side effects (e.g., vnode open) after a potentially destructive repair. Consider exiting the process after the repair summary is written (with a success exit code), or gating the post-repair startup behind an explicit flag.

Copilot uses AI. Check for mistakes.
Comment on lines +1622 to +1624
int32_t totalVnodes = global.repairCtx.nodeType == REPAIR_NODE_TYPE_VNODE ? global.repairCtx.vnodeIdNum : 0;
int32_t doneVnodes = 0;
SRepairResumePlan resumePlan = {0};
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-vnode repair targets are accepted by CLI validation, but the workflow currently sets totalVnodes to 0 when nodeType != vnode and then proceeds to report success without performing any repair work. If mnode/dnode/snode repair is not implemented yet, this should fail fast with a clear "not supported" error; otherwise totalVnodes should be computed appropriately and the corresponding handlers added.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants