Conversation
Add comprehensive design documentation for extending `taosd -r` into a controlled, traceable, and recoverable data repair tool. The design outlines three implementation approaches with detailed technical specifications, API designs, and test plans. Key additions: - Design document (`docs/plans/2026-03-03-data-repair-tool-design.md`) covering: - Background and objectives for data repair functionality - Three implementation approaches with pros/cons analysis - Detailed technical design including CLI interface, session management, and repair handlers - API specifications, error handling, and recovery mechanisms - Testing strategy and phased delivery plan - Gitignore patterns for agent skills and development artifacts to exclude from version control: - Agent skill documentation and templates - Claude configuration and skill directories - Development scripts and testing materials The design focuses on incremental extension of existing `taosd -r` functionality to support vnode repair with WAL, TSDB, and META file types using force/replica/copy operation modes.
- Add `SRepairCtx` structure to store repair session context including session ID, start time, and runtime parameters - Implement `tRepairInitCtx()` to initialize repair context after CLI argument validation in `dmMain.c` - Add vnode filtering capability by parsing `--vnode-id` into integer array with deduplication and validation - Introduce `tRepairShouldRepairVnode()` as entry point for vnode traversal filtering - Reject invalid `--vnode-id` values (e.g., "2,a") during context initialization - Enforce code constraints by forbidding `strtol/strtoll` usage and requiring `taosStr2Int32/taosStr2Int64` wrappers - Fix `strtok_r` mutation issue by using temporary buffer for vnode ID parsing
… paths - Implement `tRepairPrecheck()` to validate repair prerequisites before initialization - Check data directory existence, backup path (if configured), and disk available space threshold - Verify vnode subdirectories (wal/tsdb/meta) exist for vnode node type - Integrate precheck into `dmMain.c` after config loading and before `dmInit()` - Use `tsDataSpace.reserved` as minimum disk space threshold; skip check if zero - Fail fast with descriptive error message when precheck fails - Update documentation and progress tracking for T2.3 completion
- Add `tRepairPrepareBackupDir()` to pre-create backup directories per vnode with path rules for explicit `--backup-path` or default `<dataDir>/backup/` - Add session file management functions: `tRepairPrepareSessionFiles()`, `tRepairAppendSessionLog()`, `tRepairWriteSessionState()` for repair.log and repair.state.json - Implement atomic state file writing using *.tmp + rename to reduce corruption risk - Integrate backup directory creation and session file initialization into dmMain.c with fail-fast behavior on errors - Update progress documentation to reflect completion of T2.4 and T2.5 tasks
…r sessions - Add `tRepairNeedReportProgress()` to throttle progress reports based on time intervals - Add `tRepairBuildProgressLine()` to construct progress output with session/step/vnode completion metrics - Add `tRepairBuildSummaryLine()` to generate final summary with status and statistics - Integrate progress reporting in `dmMain.c` for precheck, backup, and preflight stages - Update `repair.state.json` to track step transitions from precheck → backup → preflight(ready) - Add `AGENTS.md` documentation for session conventions and progress bar formatting rules - Update progress tracking in `progress.md` and `findings.md` to reflect T2.6 completion
- Implement `tRepairTryResumeSession()` to scan backup directories for incomplete repair sessions - Resume sessions from latest `repair.state.json` with status `initialized` or `running` - Skip already processed vnodes using `doneVnodes` counter from saved state - Update `dmMain.c` to attempt session recovery after precheck phase - Document JSON parsing nuance: use `tjsonGetObjectItem()` to check field existence instead of relying on `tjsonGetStringValue2()` return code - Update progress tracking to reflect completion of T2.7 task
- add tRepairNeedRunWalForceRepair and tRepairBuildVnodeTargetPath - wire force+wal scheduling in dmMain via walInit and walOpen/walClose per vnode - persist wal step progress/state/log with existing repair session flow - add common tests for wal-force predicate and vnode target path builder - update task/progress/findings docs for T3.1 completion
- move repair precheck/resume/backup/progress logic into dmRunRepairWorkflow() - isolate force+wal scheduling in dmRunForceWalRepair() - keep main focused on startup/shutdown orchestration and unified cleanup - preserve existing repair error handling and return semantics
- add tRepairBackupVnodeTarget and tRepairRollbackVnodeTarget APIs - backup vnode WAL target before walOpen in force-repair workflow - rollback WAL target on walOpen failure to avoid partial damage - add commonTests coverage for backup/rollback success and invalid args - update task_plan/progress/findings for T3.2 completion
- add TSDB recursive scan API and integrate tsdb precheck validation - add TSDB block analysis report with recoverable/corrupted counters - add TSDB rebuild API that keeps recoverable blocks (head+data only) - add unit tests for scan/analyze/rebuild and invalid-arg scenarios
- add tRepairNeedRunTsdbForceRepair and unit test coverage - wire dmRunForceTsdbRepair into dmRunRepairWorkflow - implement tsdb analyze/rebuild/switch flow with rollback on failure - add tests/ci/repair_tsdb_force.sh for mixed recoverable/corrupted tsdb scenario - sync task_plan.md, progress.md, and findings.md for T4.4/T4.5 completion
…e coverage - add TSDB force repair scheduling in dmMain with analyze/rebuild/switch flow - add rollback guard and session log/state updates for TSDB repair failures - add META scan model and APIs (required/optional files) in trepair - add force+meta scheduling predicate and workflow integration in dmRunRepairWorkflow - extend precheck to validate META directory integrity before execution - add common tests for TSDB/META scheduling and META scan behavior - add tests/ci/repair_tsdb_force.sh to validate mixed recoverable/corrupted TSDB scenario - update task_plan/progress/findings for T4.4/T4.5 completion and T5.1 completion
… stub - add tRepairRebuildMetaFiles to rebuild meta output and backfill required files - integrate meta rebuild -> rename activation with rollback and rebuild detail logging - add tRepairNeedRunReplicaRepair and dmRunReplicaRepair for explicit mode=replica dispatch - add CI smoke script tests/ci/repair_meta_force.sh for partial/complete meta corruption cases - extend common tests for RebuildMetaFiles* and NeedRunReplicaRepair* - sync task/progress/findings docs for T5.4, T5.5, and T6.1 completion
- add tRepairDegradeReplicaVnode() to atomically write vnode degrade marker
(replica.degrade.marker.json) with availability/sync/version/term policies
- upgrade dmRunReplicaRepair() from dispatch stub to per-vnode degrade execution
- append replica degrade detail logs and keep step=replica progress/state updates
- add DegradeReplicaVnode test coverage in commonTests
- update task_plan/progress/findings: mark T6.2 completed and move T6.3 in_progress
Tested:
- cmake --build debug -j8 --target commonTest
- ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter='RepairOptionParseTest.NeedRunReplicaRepair*:RepairOptionParseTest.DegradeReplicaVnode*'
- ASAN_OPTIONS=detect_leaks=0 ctest --test-dir debug -R commonTest --output-on-failure
- cmake --build debug -j8 --target taosd
- replica smoke with TAOS_DATA_DIR (exit code 47, marker and logs verified)
…ransfer foundation - add replica restore hint writer and degraded-marker rollback helper - wire replica failure branches to unified artifact rollback and restore-detail logging - enforce copy-mode --replica-node format (<host>:<absolute-path>) - add replica endpoint parser and local mock vnode copy API for follow-up SCP integration - extend commonTests for replica hint/rollback and copy endpoint/mock transfer paths - update task_plan/progress/findings for T6.3~T7.2 completion
- add copy-mode scheduling and SSH/SCP command builders in trepair - implement SSH/SCP vnode target copy execution with exit-code parsing - integrate dmRunCopyRepair into dmRunRepairWorkflow with state/log/progress updates - add TDD coverage for copy scheduling and command construction - update task_plan/progress/findings for T7.3 completion
- restore local uid/gid/mode after ssh/scp copy - add remote/local digest verification for post-copy consistency - add per-vnode backup and rollback on copy failure - extend copy tests for permission fix and consistency mismatch detection - update task/progress/findings docs for T7.4 and T7.5 completion
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances TDengine's data robustness by introducing a new file-level data recovery tool accessible via Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a data repair tool integrated into taosd -r. The changes are extensive, covering design documentation, implementation plans, the core C logic, command-line integration, and both unit and system-level tests. The overall structure is well-thought-out, with a clear separation of concerns between parsing, validation, and the repair workflow itself.
My review has identified a critical security vulnerability related to command injection in the copy mode, which must be addressed. I've also included a couple of medium-severity suggestions regarding a hardcoded limit and improving documentation examples for better usability.
Given the complexity and the critical nature of a data repair tool, thorough testing beyond the added scripts is highly recommended, especially focusing on edge cases and failure/rollback scenarios.
Note: Security Review did not run due to the size of the PR.
| int32_t len = tsnprintf(cmd, cmdSize, "%s -o BatchMode=yes -o StrictHostKeyChecking=no %s \"test -d '%s'\"", | ||
| sshBin, replicaHost, remoteTargetPath); |
There was a problem hiding this comment.
Potential command injection vulnerability. The replicaHost variable, which comes from user input, is directly embedded into a shell command string. While the validation in tRepairValidateReplicaNodeEndpoint checks for spaces, it's not sufficient to prevent all forms of command injection. A malicious value for --replica-node (e.g., somehost;reboot) could lead to arbitrary command execution on the server. This vulnerability exists in tRepairBuildCopySshProbeCmd, tRepairBuildCopyScpCmd, tRepairBuildCopySshStatCmd, and tRepairBuildCopySshDigestCmd. You should implement stricter validation for the hostname part of the replica-node parameter, allowing only characters that are valid in a hostname (e.g., alphanumeric, -, and .).
| --file-type wal \ | ||
| --vnode-id 2 \ | ||
| --mode copy \ | ||
| --replica-node 192.168.1.24:/var/lib/taos \ |
There was a problem hiding this comment.
The example for copy mode uses a hardcoded IP address and path (192.168.1.24:/var/lib/taos). This can be confusing and lead to copy-paste errors. It would be clearer and safer to use placeholders.
| --replica-node 192.168.1.24:/var/lib/taos \ | |
| --replica-node <replica_host>:/path/to/remote/taos/data \ |
| --file-type wal \ | ||
| --vnode-id 2 \ | ||
| --mode copy \ | ||
| --replica-node 192.168.1.24:/var/lib/taos \ |
There was a problem hiding this comment.
| } SRepairCliArgs; | ||
|
|
||
| #define REPAIR_SESSION_ID_LEN 64 | ||
| #define REPAIR_MAX_VNODE_IDS 128 |
There was a problem hiding this comment.
The number of vnodes that can be repaired in a single command is limited by the REPAIR_MAX_VNODE_IDS constant, which is set to 128. For very large clusters, users might need to repair more vnodes at once. Using a fixed-size array for vnodeIds in SRepairCtx could be a limitation. Consider using a dynamic array (like SArray, which is used elsewhere in the codebase) to make this more flexible and avoid an arbitrary limit.
There was a problem hiding this comment.
Pull request overview
This PR extends taosd -r into a resumable “data repair” workflow with new repair CLI options, session/state artifacts, and CI coverage scripts for multiple repair modes (force/replica/copy), plus supporting WAL repair stats and related documentation/planning notes.
Changes:
- Add repair CLI options and a repair workflow orchestration in
taosd(--node-type/--file-type/--vnode-id/--backup-path/--mode/--replica-node) with session log/state output. - Add WAL repair statistics export (
walGetRepairStats) and new WAL unit tests validating stats tracking. - Add CI scripts to generate fixtures and validate repair flows (force TSDB/META + replica + copy) and update ops docs/plans.
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
source/dnode/mgmt/exe/dmMain.c |
Adds repair CLI parsing/help and implements the repair workflow orchestration (precheck/backup + mode handlers + session logging/state). |
include/common/trepair.h |
Introduces the common repair API/types used by the CLI/workflow. |
include/libs/wal/wal.h |
Adds WAL repair stats struct + API and stores stats on the WAL handle. |
source/libs/wal/src/walMgmt.c |
Implements walGetRepairStats() and zeros stats on open. |
source/libs/wal/src/walMeta.c |
Updates WAL repair routines to accumulate repair statistics. |
source/libs/wal/test/walMetaTest.cpp |
Adds unit tests for WAL repair stats behavior. |
tests/ci/repair_tsdb_force.sh |
Adds a TSDB force-repair CI verification script. |
tests/ci/repair_meta_force.sh |
Adds a META force-repair CI verification script. |
tests/ci/repair_mode_matrix.sh |
Adds a matrix runner validating force/meta/replica/copy modes. |
tests/ci/repair_fixture_generator.sh |
Adds a fixture generator for reproducible corruption scenarios. |
docs/zh/08-operation/05-maintenance.md |
Documents file-level repair usage under taosd -r (Chinese). |
docs/en/08-operation/04-maintenance.md |
Documents file-level repair usage under taosd -r (English). |
task_plan.md, progress.md, findings.md |
Project planning/status documents for the repair tool effort. |
docs/plans/2026-03-03-data-repair-tool-*.md |
Design/implementation plan docs. |
docs/plans/2026-03-04-data-repair-release-checklist.md |
Release checklist and verification commands/results. |
AGENTS.md |
Adds session/progress reporting conventions document. |
.gitignore |
Ignores local agent/Claude tooling directories and related files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then | ||
| echo "required force-mode scripts are missing or not executable" |
There was a problem hiding this comment.
This script requires the referenced force-mode scripts to be executable (-x), but it later invokes them explicitly via bash ..., so the execute bit is not actually needed at runtime. This extra requirement can cause CI failures if the repo file mode isn't preserved (e.g., when checking out on some systems). Consider checking for existence/readability (-f/-r) instead, or invoke them directly without bash if executability is truly required.
| if [[ ! -x "$TSDB_FORCE_SCRIPT" || ! -x "$META_FORCE_SCRIPT" ]]; then | |
| echo "required force-mode scripts are missing or not executable" | |
| if [[ ! -r "$TSDB_FORCE_SCRIPT" || ! -r "$META_FORCE_SCRIPT" ]]; then | |
| echo "required force-mode scripts are missing or not readable" |
| if (code != TSDB_CODE_SUCCESS) { | ||
| taosCleanupCfg(); | ||
| taosCloseLog(); | ||
| taosCleanupArgs(); | ||
| taosConvDestroy(); |
There was a problem hiding this comment.
When repair mode is enabled (global.repairCtx.enabled), the process runs dmRunRepairWorkflow() but then continues into the normal daemon startup path (dmInit/dmRun). This makes taosd -r --node-type/--file-type/... behave non-offline and can hang/side-effect CI and ops usage if startup succeeds. Consider exiting cleanly after a successful repair workflow (with proper cleanup) when repairCtx.enabled is true, instead of proceeding to start services.
| if (code != TSDB_CODE_SUCCESS) { | |
| taosCleanupCfg(); | |
| taosCloseLog(); | |
| taosCleanupArgs(); | |
| taosConvDestroy(); | |
| taosCleanupCfg(); | |
| taosCloseLog(); | |
| taosCleanupArgs(); | |
| taosConvDestroy(); | |
| if (code == TSDB_CODE_SUCCESS) { | |
| return 0; | |
| } else { |
| #define DM_REPAIR "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options." | ||
| #define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode." | ||
| #define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint." | ||
| #define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)." | ||
| #define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair." | ||
| #define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy." | ||
| #define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy." |
There was a problem hiding this comment.
The help text advertises multiple node/file repair targets (mnode/dnode/snode and file types like data/config/checkpoint), but the repair workflow in dmMain currently only performs vnode-level handlers (wal/tsdb/meta). With the current validator allowing these combinations, a non-vnode invocation can become a no-op that still reports a successful summary. Recommend either restricting --node-type to vnode for now (and updating help strings accordingly) or implementing the additional node/file-type handlers before advertising them.
| #define DM_REPAIR "Enable repair mode. Works with --node-type/--file-type/--mode and other repair options." | |
| #define DM_REPAIR_NODE_TYPE "Repair target node type. Options: vnode, mnode, dnode, snode." | |
| #define DM_REPAIR_FILE_TYPE "Repair target file type. Examples: vnode->wal|meta|tsdb; mnode->wal|data; dnode->config; snode->checkpoint." | |
| #define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)." | |
| #define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair." | |
| #define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy." | |
| #define DM_REPAIR_REPLICA_NODE "Replica node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy." | |
| #define DM_REPAIR "Enable repair mode. Currently supports vnode-level repair via --node-type, --file-type, --mode and other repair options." | |
| #define DM_REPAIR_NODE_TYPE "Repair target node type. Currently supported option: vnode." | |
| #define DM_REPAIR_FILE_TYPE "Repair target file type for vnode. Supported: wal | meta | tsdb." | |
| #define DM_REPAIR_VNODE_ID "Target vnode id list, separated by comma (required when --node-type=vnode)." | |
| #define DM_REPAIR_BACKUP_PATH "Backup path for corrupted files before repair." | |
| #define DM_REPAIR_MODE "Repair mode. Options: force, replica, copy." | |
| #define DM_REPAIR_REPLICA_NODE "Replica vnode node endpoint for copy mode. Format: <ip>:<dataDir>, required when --mode=copy." |
- extend tRepairTryResumeSession to return resume step and validate step buffer args - add tRepairResolveResumePlan and reuse it in dmRunRepairWorkflow - support step-aware vnode start index for replica/copy/wal/tsdb/meta phases - skip backup preparation when resuming from mode steps - ensure successful workflow summary/state uses doneVnodes=totalVnodes - tighten replica endpoint and copy command token validation to block shell metacharacter injection - add/extend RepairOptionParseTest coverage for resume step, resume plan, and copy endpoint/cmd invalid inputs Verification: - cmake --build debug --target commonTest - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.* - cmake --build debug --target taosd
…ngths - centralize repair option names in dmMain.c - add dmMatchLongOption/dmParseRepairCliLongOption to deduplicate CLI parsing - add tRepairExtractLongOptionValue for "--opt value" and "--opt=value" forms - replace duplicated parsing branches for node-type/file-type/vnode-id/backup-path/mode/replica-node - remove hardcoded length literals (12/11/14/7/15) to reduce maintenance risk - add unit tests for long-option extraction and invalid-arg handling Test: - cmake --build debug -j8 --target taosd - cmake --build debug -j8 --target commonTest - ASAN_OPTIONS=detect_leaks=0 ./debug/build/bin/commonTest --gtest_filter=RepairOptionParseTest.*
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (global.repairCtx.enabled) { | ||
| code = dmRunRepairWorkflow(); | ||
| if (code != TSDB_CODE_SUCCESS) { | ||
| taosCleanupCfg(); | ||
| taosCloseLog(); | ||
| taosCleanupArgs(); | ||
| taosConvDestroy(); | ||
| return code; | ||
| } | ||
| } | ||
|
|
||
| osSetProcPath(argc, (char **)argv); | ||
| taosCleanupArgs(); | ||
|
|
There was a problem hiding this comment.
dmRunRepairWorkflow() returns success and execution continues into the normal dnode startup path. For a repair invocation (taosd -r --node-type ...), this can cause the process to turn into a long-running daemon (or hang under CI unless it fails later), and it can also trigger unrelated side effects (e.g., vnode open) after a potentially destructive repair. Consider exiting the process after the repair summary is written (with a success exit code), or gating the post-repair startup behind an explicit flag.
| int32_t totalVnodes = global.repairCtx.nodeType == REPAIR_NODE_TYPE_VNODE ? global.repairCtx.vnodeIdNum : 0; | ||
| int32_t doneVnodes = 0; | ||
| SRepairResumePlan resumePlan = {0}; |
There was a problem hiding this comment.
Non-vnode repair targets are accepted by CLI validation, but the workflow currently sets totalVnodes to 0 when nodeType != vnode and then proceeds to report success without performing any repair work. If mnode/dnode/snode repair is not implemented yet, this should fail fast with a clear "not supported" error; otherwise totalVnodes should be computed appropriately and the corresponding handlers added.
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.