Conversation
- enable tsdb repair to enter the real execution path - add vnode/tsdb force repair handling and backup logic - add tsdb force repair pytest coverage and design notes - normalize timezone strings for repair-related tests
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求为 TDengine 引入了关键的数据修复功能,通过对 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: data force repair for meta and tsdb data, including backup logic and detailed execution paths. While the implementation is comprehensive, a security audit identified two medium-severity issues related to how backups are handled: the default backup location in the system's temporary directory may expose sensitive data, and user-provided backup paths are not sanitized for path traversal, potentially allowing arbitrary directory creation. Furthermore, two critical race conditions related to tracking repaired vnodes in a multi-threaded context and a bug in a test helper function need to be addressed.
| static void metaMarkForceRepairDone(int32_t vgId) { | ||
| char vnodeText[32] = {0}; | ||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | ||
|
|
||
| if (tsMetaRepairDoneVnodeId[0] == '\0') { | ||
| tstrncpy(tsMetaRepairDoneVnodeId, vnodeText, sizeof(tsMetaRepairDoneVnodeId)); | ||
| return; | ||
| } | ||
|
|
||
| if (metaRepairListContains(tsMetaRepairDoneVnodeId, vgId)) { | ||
| return; | ||
| } | ||
|
|
||
| int32_t offset = (int32_t)strlen(tsMetaRepairDoneVnodeId); | ||
| snprintf(tsMetaRepairDoneVnodeId + offset, sizeof(tsMetaRepairDoneVnodeId) - offset, ",%s", vnodeText); | ||
| } |
There was a problem hiding this comment.
The global static variable tsMetaRepairDoneVnodeId is accessed and modified here without any synchronization. Since metaOpen is called in parallel for different vnodes, this will lead to a race condition when multiple threads try to update this shared string simultaneously. This can cause memory corruption or incorrect behavior.
Please protect all accesses to tsMetaRepairDoneVnodeId (in this function and in metaForceRepairMatchesVnode) with a mutex. A static TdThreadMutex would be appropriate here.
| static void tsdbMarkForceRepairDone(int32_t vgId) { | ||
| char vnodeText[32] = {0}; | ||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | ||
|
|
||
| if (tsTsdbRepairDoneVnodeId[0] == '\0') { | ||
| tstrncpy(tsTsdbRepairDoneVnodeId, vnodeText, sizeof(tsTsdbRepairDoneVnodeId)); | ||
| return; | ||
| } | ||
|
|
||
| if (tsdbRepairListContains(tsTsdbRepairDoneVnodeId, vgId)) { | ||
| return; | ||
| } | ||
|
|
||
| int32_t offset = (int32_t)strlen(tsTsdbRepairDoneVnodeId); | ||
| snprintf(tsTsdbRepairDoneVnodeId + offset, sizeof(tsTsdbRepairDoneVnodeId) - offset, ",%s", vnodeText); | ||
| } |
There was a problem hiding this comment.
Similar to the issue in metaOpen.c, the global static variable tsTsdbRepairDoneVnodeId is modified here without thread synchronization. The tsdbDispatchForceRepair function, which calls this, is part of the parallel vnode startup process. This creates a race condition that can lead to memory corruption or incorrect repair behavior.
Please add a mutex to protect all reads and writes to tsTsdbRepairDoneVnodeId.
| return int(value) | ||
| tdLog.exit(f"failed to resolve vnode id from show {dbname}.vgroups result: {row}") | ||
|
|
||
| def _start_repair_process(self, args): |
There was a problem hiding this comment.
The variable extra_env is used within this function but is not defined as a parameter or a local variable. This will cause a NameError if the function is ever called.
Although this helper function does not seem to be used in the current test suite, this is a latent bug that should be fixed. Please add extra_env=None to the function signature.
| def _start_repair_process(self, args): | |
| def _start_repair_process(self, args, extra_env=None): |
| } | ||
|
|
||
| static int32_t metaBuildRepairBackupDir(SVnode *pVnode, char *buf, int32_t bufLen) { | ||
| const char *root = dmRepairHasBackupPath() ? dmRepairBackupPath() : TD_TMP_DIR_PATH; |
There was a problem hiding this comment.
The data repair process defaults to backing up sensitive database metadata to the system's temporary directory (/tmp) when --backup-path is not specified. On multi-user systems, /tmp is typically world-readable, which could allow unauthorized local users to access sensitive database information. It is recommended to use a more secure default location (e.g., a subdirectory within the TDengine data directory with restricted permissions) or explicitly set restrictive permissions (e.g., 0700) on the created backup directories.
| } | ||
|
|
||
| static int32_t tsdbRepairBuildBackupFSetDir(STFileSystem *fs, int32_t fid, char *buf, int32_t bufLen) { | ||
| const char *root = dmRepairHasBackupPath() ? dmRepairBackupPath() : TD_TMP_DIR_PATH; |
There was a problem hiding this comment.
The data repair process defaults to backing up sensitive database files to the system's temporary directory (/tmp) when --backup-path is not specified. On multi-user systems, /tmp is typically world-readable, which could allow unauthorized local users to access sensitive database information. It is recommended to use a more secure default location or explicitly set restrictive permissions on the created backup directories.
| snprintf(buf, bufLen, "%s%staos_backup_%s%svnode%d%smeta", root, sep, dateBuf, TD_DIRSEP, TD_VID(pVnode), | ||
| TD_DIRSEP); |
There was a problem hiding this comment.
The user-provided --backup-path is used to construct filesystem paths for data backups without sanitization for path traversal sequences (e.g., ..). This allows a user with the ability to run the repair command to create directories and write backup files in arbitrary locations on the filesystem. While the impact is limited by the fixed subdirectory structure appended to the path, it is best practice to sanitize the input or validate that the resolved path remains within an authorized base directory.
| snprintf(buf, bufLen, "%s%staos_backup_%s%svnode%d%stsdb%sfid_%d", root, sep, dateBuf, TD_DIRSEP, | ||
| TD_VID(fs->tsdb->pVnode), TD_DIRSEP, TD_DIRSEP, fid); |
There was a problem hiding this comment.
The user-provided --backup-path is used to construct filesystem paths for data backups without sanitization for path traversal sequences (e.g., ..). This allows a user with the ability to run the repair command to create directories and write backup files in arbitrary locations on the filesystem. It is recommended to sanitize the input or validate that the resolved path remains within an authorized base directory.
There was a problem hiding this comment.
Pull request overview
This PR enables real execution for taosd -r --node-type vnode --file-type tsdb --mode force by dispatching repair logic inside vnode/TSDB open paths, adds backup/manifest/log handling for affected TSDB file-sets, and introduces end-to-end pytest coverage plus supporting docs/design notes. It also normalizes timezone strings (e.g., /UTC → UTC) to stabilize repair-related startup/tests.
Changes:
- Add TSDB force-repair dispatch in
tsdb open fs, including affected file-set backup (manifest + original files) and core-group drop/rebuild logic. - Add META force-repair dispatch in
metaOpen()with external backup support and CLI parameter flow accessors. - Add/extend pytest E2E coverage for repair flows and add a timezone normalization fix + unit test.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/cases/80-Components/01-Taosd/test_tsdb_force_repair.py | New E2E suite for TSDB force repair dispatch, backup/log behavior, crash-safe manifest staging, and rebuild scenarios. |
| test/cases/80-Components/01-Taosd/test_meta_force_repair.py | New E2E coverage for META force repair vnode targeting and backup directory creation. |
| test/cases/80-Components/01-Taosd/test_com_cmdline.py | Adds CLI parameter-layer regression matrix for taosd -r (phase1 behaviors). |
| source/os/src/osTimezone.c | Normalizes timezone strings via truncateTimezoneString() to handle leading /. |
| source/os/test/osTimeTests.cpp | Adds a unit test for the timezone string normalization behavior. |
| source/dnode/mgmt/exe/dmMain.c | Implements repair option parsing/validation and exposes repair accessors for vnode layers. |
| source/dnode/mgmt/node_mgmt/inc/dmMgmt.h | Declares repair option accessor APIs. |
| source/dnode/vnode/src/meta/metaOpen.c | Dispatches meta force repair per vnode and adds external backup copy logic. |
| source/dnode/vnode/src/tsdb/tsdbFS2.c | Adds TSDB force repair dispatch during FS open, backup/manifest/logging, and core repair actions. |
| source/dnode/vnode/src/vnd/vnodeRepair.c | Provides weak fallback stubs for dmRepair* accessors for linkability in libvnode contexts. |
| source/dnode/vnode/CMakeLists.txt | Adds the new vnodeRepair.c compilation unit. |
| docs/plans/2026-03-07-tsdb-force-repair-*.md | Adds TSDB force repair design + implementation plan notes. |
| docs/plans/2026-03-06-meta-force-repair-*.md | Adds META force repair design + implementation plan notes. |
| docs/data_repair/03-TSDB_repair/* | Adds phase3 task plan/findings/progress/handoff documentation for TSDB repair workstream. |
| docs/data_repair/02-META_repair/* | Adds/updates phase2 task plan/findings/progress documentation for META repair. |
| AGENTS.md | Adds repository guidelines for structure/build/test/style. |
| .gitignore | Ignores .agents/ and skills-lock.json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static void metaMarkForceRepairDone(int32_t vgId) { | ||
| char vnodeText[32] = {0}; | ||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | ||
|
|
||
| if (tsMetaRepairDoneVnodeId[0] == '\0') { | ||
| tstrncpy(tsMetaRepairDoneVnodeId, vnodeText, sizeof(tsMetaRepairDoneVnodeId)); | ||
| return; | ||
| } | ||
|
|
||
| if (metaRepairListContains(tsMetaRepairDoneVnodeId, vgId)) { | ||
| return; | ||
| } | ||
|
|
||
| int32_t offset = (int32_t)strlen(tsMetaRepairDoneVnodeId); | ||
| snprintf(tsMetaRepairDoneVnodeId + offset, sizeof(tsMetaRepairDoneVnodeId) - offset, ",%s", vnodeText); | ||
| } | ||
|
|
||
| static bool metaForceRepairMatchesVnode(int32_t vgId) { | ||
| if (!dmRepairFlowEnabled() || metaRepairListContains(tsMetaRepairDoneVnodeId, vgId)) { | ||
| return false; |
There was a problem hiding this comment.
The force-repair "done" tracking uses a process-global CSV string (tsMetaRepairDoneVnodeId) that is read/modified from metaOpen(). Vnodes are opened concurrently (vmOpenVnodes launches multiple threads), so this introduces a data race and possible string corruption, leading to repeated or skipped repairs. Protect the read/check/append with a mutex (or replace the string with a thread-safe set/bitmap keyed by vgId).
| @@ -32,12 +44,666 @@ typedef struct { | |||
| STFileHashEntry **buckets; | |||
| } STFileHash; | |||
|
|
|||
| typedef enum { | |||
| TSDB_REPAIR_CORE_KEEP = 0, | |||
| TSDB_REPAIR_CORE_DROP, | |||
| TSDB_REPAIR_CORE_REBUILD, | |||
| } ECoreRepairAction; | |||
|
|
|||
| typedef struct { | |||
| int32_t fid; | |||
| bool affected; | |||
| bool dropStt; | |||
| ECoreRepairAction coreAction; | |||
| bool staged; | |||
| int32_t totalBlocks; | |||
| int32_t keptBlocks; | |||
| int32_t droppedBlocks; | |||
| char coreReason[64]; | |||
| } STsdbRepairPlan; | |||
|
|
|||
| static int32_t save_json(const cJSON *json, const char *fname); | |||
|
|
|||
| static int32_t tsdbFSDupState(STFileSystem *fs); | |||
|
|
|||
| static int32_t commit_edit(STFileSystem *fs); | |||
|
|
|||
| static void tsdbRepairPlanInit(const STFileSet *fset, STsdbRepairPlan *plan) { | |||
| memset(plan, 0, sizeof(*plan)); | |||
| plan->fid = fset->fid; | |||
| } | |||
|
|
|||
| static void tsdbRepairPlanSetCore(STsdbRepairPlan *plan, ECoreRepairAction action, const char *reason) { | |||
| if (plan->coreAction == TSDB_REPAIR_CORE_DROP) { | |||
| return; | |||
| } | |||
| if (action == TSDB_REPAIR_CORE_DROP || plan->coreAction == TSDB_REPAIR_CORE_KEEP) { | |||
| plan->coreAction = action; | |||
| if (reason != NULL) { | |||
| tstrncpy(plan->coreReason, reason, sizeof(plan->coreReason)); | |||
| } | |||
| } | |||
| plan->affected = plan->dropStt || (plan->coreAction != TSDB_REPAIR_CORE_KEEP); | |||
| } | |||
|
|
|||
| static const char *gCurrentFname[] = { | |||
| [TSDB_FCURRENT] = "current.json", | |||
| [TSDB_FCURRENT_C] = "current.c.json", | |||
| [TSDB_FCURRENT_M] = "current.m.json", | |||
| }; | |||
|
|
|||
| static bool tsdbRepairListContains(const char *csv, int32_t vgId) { | |||
| if (csv == NULL || csv[0] == '\0') { | |||
| return false; | |||
| } | |||
|
|
|||
| char buf[PATH_MAX] = {0}; | |||
| tstrncpy(buf, csv, sizeof(buf)); | |||
|
|
|||
| char *savePtr = NULL; | |||
| for (char *token = strtok_r(buf, ",", &savePtr); token != NULL; token = strtok_r(NULL, ",", &savePtr)) { | |||
| if (atoi(token) == vgId) { | |||
| return true; | |||
| } | |||
| } | |||
|
|
|||
| return false; | |||
| } | |||
|
|
|||
| static void tsdbMarkForceRepairDone(int32_t vgId) { | |||
| char vnodeText[32] = {0}; | |||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | |||
|
|
|||
| if (tsTsdbRepairDoneVnodeId[0] == '\0') { | |||
| tstrncpy(tsTsdbRepairDoneVnodeId, vnodeText, sizeof(tsTsdbRepairDoneVnodeId)); | |||
| return; | |||
| } | |||
|
|
|||
| if (tsdbRepairListContains(tsTsdbRepairDoneVnodeId, vgId)) { | |||
| return; | |||
| } | |||
|
|
|||
| int32_t offset = (int32_t)strlen(tsTsdbRepairDoneVnodeId); | |||
| snprintf(tsTsdbRepairDoneVnodeId + offset, sizeof(tsTsdbRepairDoneVnodeId) - offset, ",%s", vnodeText); | |||
| } | |||
|
|
|||
| static bool tsdbShouldForceRepair(STFileSystem *fs) { | |||
| int32_t vgId = TD_VID(fs->tsdb->pVnode); | |||
|
|
|||
| if (!dmRepairFlowEnabled() || tsdbRepairListContains(tsTsdbRepairDoneVnodeId, vgId)) { | |||
| return false; | |||
There was a problem hiding this comment.
The TSDB force-repair "done" tracking uses a process-global CSV string (tsTsdbRepairDoneVnodeId) that is read/modified during open_fs(). Vnodes are opened concurrently, so this is not thread-safe and can corrupt the buffer or cause incorrect de-dup decisions. Guard accesses with a mutex (or use a concurrent set/bitmap keyed by vgId).
| def _start_repair_process(self, args): | ||
| bin_path = self._get_taosd_bin() | ||
| cmd = [bin_path, "-c", self._get_cfg_dir()] + shlex.split(args) | ||
| tdLog.info("run repair cmd: %s" % " ".join(cmd)) | ||
| env = os.environ.copy() | ||
| asan_options = env.get("ASAN_OPTIONS", "") | ||
| if "detect_leaks=" not in asan_options: | ||
| env["ASAN_OPTIONS"] = ( | ||
| "detect_leaks=0" if not asan_options else asan_options + ":detect_leaks=0" | ||
| ) | ||
| env.setdefault("LSAN_OPTIONS", "detect_leaks=0") | ||
| if extra_env: | ||
| env.update(extra_env) | ||
| return subprocess.Popen( |
There was a problem hiding this comment.
_start_repair_process() references extra_env but the function has no such parameter/variable in scope, so this will raise a NameError at runtime when the helper is used. Add an extra_env parameter (default None) and apply it to env, or remove the dead code path if not needed.
| def _run_taosd_with_cfg(self, args, timeout_sec=None, extra_env=None): | ||
| bin_path = self._get_taosd_bin() | ||
| cmd = [bin_path, "-c", self._get_cfg_dir()] + shlex.split(args) | ||
| tdLog.info("run cmd: %s" % " ".join(cmd)) | ||
| env = os.environ.copy() | ||
| asan_options = env.get("ASAN_OPTIONS", "") | ||
| if "detect_leaks=" not in asan_options: | ||
| env["ASAN_OPTIONS"] = ( | ||
| "detect_leaks=0" if not asan_options else asan_options + ":detect_leaks=0" | ||
| ) | ||
| env.setdefault("LSAN_OPTIONS", "detect_leaks=0") | ||
| proc = subprocess.Popen( | ||
| cmd, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| text=True, | ||
| encoding="utf-8", | ||
| env=env, | ||
| ) |
There was a problem hiding this comment.
_run_taosd_with_cfg() accepts an extra_env argument but never applies it to the subprocess environment. Either remove the parameter or update the function to merge extra_env into env so callers can reliably inject test-only environment variables.
| def _run_taosd(self, args): | ||
| bin_path = self._get_taosd_bin() | ||
| cmd = [bin_path] + shlex.split(args) | ||
| tdLog.info("run cmd: %s" % " ".join(cmd)) | ||
| env = os.environ.copy() | ||
| asan_options = env.get("ASAN_OPTIONS", "") | ||
| if "detect_leaks=" not in asan_options: | ||
| env["ASAN_OPTIONS"] = "detect_leaks=0" if not asan_options else asan_options + ":detect_leaks=0" | ||
| proc = subprocess.run( | ||
| cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, encoding="utf-8", env=env | ||
| ) | ||
| output = proc.stdout or "" | ||
| tdLog.info("ret=%s output=%s" % (proc.returncode, output[:500].replace("\n", "\\n"))) | ||
| return proc.returncode, output |
There was a problem hiding this comment.
_run_taosd() uses subprocess.run() without a timeout. Several cases invoke taosd -r ... without -V and could hang if the CLI behavior regresses, which would stall CI. Add a reasonable timeout and include the timeout handling in the assertion helper.
…force flag - Simplify repair mode help text and remove deprecated `--force` option validation - Add detailed comments to `SDmRepairOption` structure for better clarity - Remove phase1 execution restriction for non-meta/tsdb file types - Update repair help text to reflect current compatibility rules - Refactor meta repair functions with improved strategy handling
- Split metaForceRepair into strategy-specific functions (metaForceRepairFromUid) - Simplify metaGetRepairStrategy to return default strategy directly - Improve code organization and maintainability for meta repair operations
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| metaError("vgId:%d, %s failed at %s:%d since %s", TD_VID(pVnode), __func__, __FILE__, __LINE__, tstrerror(code)); | ||
| return code; | ||
| } | ||
|
|
There was a problem hiding this comment.
metaMarkForceRepairDone() is never called after a successful meta force repair. Without marking the vnode as repaired, repeated metaOpen() calls in the same process could re-run force repair unexpectedly. After a successful metaForceRepair() (or after metaForceRepairIfShould() returns success), record completion via metaMarkForceRepairDone(TD_VID(pVnode)).
| metaMarkForceRepairDone(TD_VID(pVnode)); |
| code, output = self._run_taosd_with_cfg( | ||
| f"-r --node-type vnode --file-type tsdb --vnode-id {vnode_id} --mode force --log-output /dev/null" | ||
| ) | ||
|
|
||
| tdSql.checkEqual("repair execution is not enabled in this phase" in output, False) | ||
| tdSql.checkEqual(code == 0 and "repair parameter validation succeeded (phase1)" in output, False) |
There was a problem hiding this comment.
The test docstring says taosd is stopped before running tsdb force repair, but the test doesn't actually stop the running dnode before launching taosd -r .... This can make the test flaky (port already in use / data directory lock) and diverges from the intended scenario. Stop taosd (e.g., tdDnodes.stop(1)) before invoking _run_taosd_with_cfg, and restart in a finally block like the other tests in this file.
| code, output = self._run_taosd_with_cfg( | |
| f"-r --node-type vnode --file-type tsdb --vnode-id {vnode_id} --mode force --log-output /dev/null" | |
| ) | |
| tdSql.checkEqual("repair execution is not enabled in this phase" in output, False) | |
| tdSql.checkEqual(code == 0 and "repair parameter validation succeeded (phase1)" in output, False) | |
| tdDnodes.stop(1) | |
| try: | |
| code, output = self._run_taosd_with_cfg( | |
| f"-r --node-type vnode --file-type tsdb --vnode-id {vnode_id} --mode force --log-output /dev/null" | |
| ) | |
| tdSql.checkEqual("repair execution is not enabled in this phase" in output, False) | |
| tdSql.checkEqual(code == 0 and "repair parameter validation succeeded (phase1)" in output, False) | |
| finally: | |
| tdDnodes.start(1) |
| static int32_t dmParseRepairOption(int32_t argc, char const *argv[], int32_t *pIndex, bool *pParsed) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
| int32_t index = *pIndex; | ||
| const char *arg = argv[index]; | ||
| bool matched = false; | ||
| bool optMatched = false; | ||
| SDmRepairOption *pOpt = &global.repairOpt; | ||
|
|
||
| *pParsed = false; |
There was a problem hiding this comment.
dmParseRepairOption() declares const char *arg = argv[index]; but never uses it. This can trigger -Wunused-variable warnings in builds that treat warnings as errors. Remove the unused variable (or use it) to keep the build clean.
| static int32_t metaBackupCurrentMeta(SVnode *pVnode) { | ||
| char metaDir[TSDB_FILENAME_LEN] = {0}; | ||
| char backupDir[TSDB_FILENAME_LEN] = {0}; | ||
|
|
||
| vnodeGetMetaPath(pVnode, VNODE_META_DIR, metaDir); | ||
|
|
||
| int32_t code = metaBuildRepairBackupDir(pVnode, backupDir, sizeof(backupDir)); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
|
|
||
| if (taosCheckExistFile(backupDir)) { | ||
| metaError("vgId:%d repair backup dir already exists: %s", TD_VID(pVnode), backupDir); | ||
| return TSDB_CODE_FS_FILE_ALREADY_EXISTS; | ||
| } | ||
|
|
||
| code = metaCopyDirRecursive(metaDir, backupDir); | ||
| if (code != 0) { | ||
| metaError("vgId:%d failed to back up meta from %s to %s, reason:%s", TD_VID(pVnode), metaDir, backupDir, | ||
| tstrerror(code)); | ||
| return code; | ||
| } | ||
|
|
||
| metaInfo("vgId:%d backed up meta to %s", TD_VID(pVnode), backupDir); | ||
| return TSDB_CODE_SUCCESS; | ||
| } |
There was a problem hiding this comment.
metaBackupCurrentMeta() is defined but never called. As a result, meta force repair won't create the external backup directory that the tests (and design notes) expect. Call metaBackupCurrentMeta(pVnode) when metaShouldForceRepair() is true (ideally before any local rename/switch operations) and handle/propagate its error code.
- Introduce new `--repair-target` parameter to replace single-profile options (`--file-type`, `--vnode-id`) - Support multiple repair targets in a single `taosd -r` invocation - Define grammar: `<file-type>:<key>=<value>[:<key>=<value>]...` with file types: meta, tsdb, wal - Enforce validation: requires `-r`, `--mode force`, `--node-type vnode`, and at least one `--repair-target` - Remove backward compatibility for old CLI options, treat them as invalid - Update error messages and test cases to reflect new interface - Centralize parsing in `dmMain.c` with normalized target structures
…ggregated structures - Replace `SArray<SDmRepairTarget>` runtime model with three aggregated structures: `meta` (vnodeId -> strategy), `wal` (vnodeId -> presence), and `tsdb` (vnodeId -> fileId -> strategy) - Update `dmRepair.h` to expose only leaf configurations and dedicated accessors, hiding underlying containers - Modify `metaOpen.c` and `tsdbFS2.c` to use new accessors instead of scanning generic target lists - Update documentation in `findings.md` and `progress.md` to reflect the refined data model - Ensure parser test cases (`test_com_cmdline.py -k repair_cmdline_repair_target`) continue to pass after refactoring
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (int32_t ftype = TSDB_FTYPE_HEAD; ftype <= TSDB_FTYPE_SMA; ++ftype) { | ||
| if (fset->farr[ftype] != NULL) { | ||
| STFileOp op = {.optype = TSDB_FOP_REMOVE, .fid = fset->fid, .of = fset->farr[ftype]->f[0]}; | ||
| code = TARRAY2_APPEND(fopArr, op); | ||
| if (code != 0) goto _exit; | ||
|
|
||
| TAOS_UNUSED(tsdbTFileObjUnref(fset->farr[ftype])); | ||
| fset->farr[ftype] = NULL; | ||
| } |
There was a problem hiding this comment.
tsdbRepairDropCoreOnTmpFSet() clears/unrefs fset->farr[ftype] before calling tsdbTFileSetEdit() with a TSDB_FOP_REMOVE op. tsdbTFileSetEdit()'s remove path dereferences fset->farr[op->of.type] to unref it, so setting it to NULL here will lead to a NULL dereference (and also double-unref if it were non-NULL). Let tsdbTFileSetEdit() own the unref/NULL assignment (i.e., remove the manual tsdbTFileObjUnref + farr[ftype]=NULL), or apply the edit before mutating fset->farr.
| // Open a new meta for organization | ||
| code = metaOpenImpl(pMeta->pVnode, &pNewMeta, VNODE_META_TMP_DIR, false); | ||
| if (code) { | ||
| return code; | ||
| } | ||
|
|
||
| code = metaBegin(pNewMeta, META_BEGIN_HEAP_NIL); | ||
| if (code) { | ||
| return code; | ||
| } |
There was a problem hiding this comment.
metaForceRepair() has multiple early-return error paths after successfully opening pNewMeta (and/or starting a transaction) that don't close pNewMeta or otherwise clean up the temp meta state. This can leak resources and potentially leave meta_tmp in a partially initialized state. Consider switching to a single goto _exit cleanup path that always metaClose(&pNewMeta) (and aborts/rolls back any started transaction) before returning.
| static void metaMarkForceRepairDone(int32_t vgId) { | ||
| char vnodeText[32] = {0}; | ||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | ||
|
|
||
| if (tsMetaRepairDoneVnodeId[0] == '\0') { | ||
| tstrncpy(tsMetaRepairDoneVnodeId, vnodeText, sizeof(tsMetaRepairDoneVnodeId)); | ||
| return; | ||
| } | ||
|
|
||
| if (metaRepairListContains(tsMetaRepairDoneVnodeId, vgId)) { | ||
| return; | ||
| } | ||
|
|
||
| int32_t offset = (int32_t)strlen(tsMetaRepairDoneVnodeId); | ||
| snprintf(tsMetaRepairDoneVnodeId + offset, sizeof(tsMetaRepairDoneVnodeId) - offset, ",%s", vnodeText); | ||
| } |
There was a problem hiding this comment.
metaMarkForceRepairDone() / tsMetaRepairDoneVnodeId are introduced and used for the “already repaired in this process” check, but the code never calls metaMarkForceRepairDone() after a successful force repair. As a result, the done-list check will never take effect, and the helper itself is currently dead code. Either call metaMarkForceRepairDone(TD_VID(pVnode)) after metaForceRepair() succeeds (so repeated metaOpen() calls don't re-run repair), or remove the unused tracking code.
| static int32_t metaForceRepairIfShould(SVnode *pVnode, SMeta **ppMeta) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
| EDmRepairStrategy strategy = DM_REPAIR_STRATEGY_META_FROM_UID; | ||
| bool shouldForceRepair = metaShouldForceRepair(pVnode, &strategy); | ||
|
|
||
| // Check if meta should repair | ||
| if (!shouldForceRepair) { | ||
| metaDebug("vgId:%d, meta should not repair!", TD_VID(pVnode)); | ||
| return code; | ||
| } | ||
|
|
||
| // Do repair | ||
| code = metaForceRepair(ppMeta, strategy); | ||
| if (code) { | ||
| metaError("vgId:%d, %s failed at %s:%d since %s", TD_VID(pVnode), __func__, __FILE__, __LINE__, tstrerror(code)); | ||
| return code; | ||
| } | ||
|
|
||
| return code; |
There was a problem hiding this comment.
After a successful repair in metaForceRepairIfShould(), the vnode isn't marked as “repaired” for this process, so a later metaOpen() for the same vnode (e.g., reopen paths) could re-run force repair again. This is especially important since metaForceRepairMatchesVnode() checks tsMetaRepairDoneVnodeId but that list is never updated. Mark the vnode as done on success (or remove the done-list gate entirely).
- Introduce SRepairVnodeOpt structure to encapsulate vnode-specific repair options - Update SDmRepairOption to use vnodeOpt and prepare for mnode/snode support - Adjust cleanup and target insertion functions to work with new structure - Clarify node-type options in comments and add new repair modes (copy, replica)
… flag - Remove 'dnode' from the node-type option comment in dmMain.c as it is no longer supported - Eliminate the global variable generateNewMeta and all its references across dmMain.c and metaOpen.c - Simplify repair option handling by removing unnecessary flag resets - Update code to reflect current supported node types: vnode, mnode, snode
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | ||
|
|
There was a problem hiding this comment.
These plan docs include assistant-specific instructions (e.g. “For Claude: REQUIRED SUB-SKILL…”). This is likely to confuse human readers and may become stale quickly. Consider moving AI-runbook content into AGENTS.md (or an internal-only doc) and keep repository plans focused on the technical design/tasks.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. |
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | ||
|
|
There was a problem hiding this comment.
This plan doc contains assistant-specific instructions (“For Claude…Use superpowers…”). If these files are intended for general engineering handoff, consider removing or relocating the AI runbook content to AGENTS.md so the plan remains tool-agnostic and easier to maintain.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. |
| # META Force Repair Implementation Plan | ||
|
|
||
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | ||
|
|
||
| **Goal:** Add real execution for `taosd -r --mode force --file-type meta` by enhancing the existing `metaGenerateNewMeta()` path so each vnode decides during `metaOpen()` whether it should run force repair, with external backup support and crash-safe local directory switching. | ||
|
|
There was a problem hiding this comment.
This plan doc includes assistant-specific directions (“For Claude…Use superpowers…”). If the intent is a reusable engineering plan, consider removing those lines or moving them into an AI-specific runbook (e.g., AGENTS.md) to avoid confusing non-AI readers.
- Remove tsMetaRepairDoneVnodeId static variable and related helper functions - Simplify metaShouldForceRepair by directly checking repair flow and vnode options - Eliminate redundant metaForceRepairMatchesVnode function - Improve code clarity by removing unnecessary indirection in repair condition checks
Disable the meta backup step in metaForceRepair by commenting it out with #if 0. This change is likely a temporary measure to allow the repair process to proceed without performing a backup, possibly for debugging or to avoid backup-related failures during forced repairs. The statistics reset step remains active.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if 0 | ||
| code = metaBackupCurrentMeta(pVnode); | ||
| if (code != 0) { | ||
| metaError("vgId:%d failed to back up current meta, reason:%s", TD_VID(pVnode), tstrerror(code)); | ||
| return code; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
metaBackupCurrentMeta() is never executed because the call is wrapped in #if 0 inside metaForceRepair(). This makes the external backup feature effectively disabled and will also break the E2E expectation that a backup directory is created during meta force repair. Please remove the compile-time guard (or gate it on dmRepairHasBackupPath() / repair mode) so backups are actually produced in repair runs.
| #if 0 | |
| code = metaBackupCurrentMeta(pVnode); | |
| if (code != 0) { | |
| metaError("vgId:%d failed to back up current meta, reason:%s", TD_VID(pVnode), tstrerror(code)); | |
| return code; | |
| } | |
| #endif | |
| if (dmRepairHasBackupPath()) { | |
| code = metaBackupCurrentMeta(pVnode); | |
| if (code != 0) { | |
| metaError("vgId:%d failed to back up current meta, reason:%s", TD_VID(pVnode), tstrerror(code)); | |
| return code; | |
| } | |
| } |
| @@ -32,12 +36,679 @@ typedef struct { | |||
| STFileHashEntry **buckets; | |||
| } STFileHash; | |||
|
|
|||
| typedef enum { | |||
| TSDB_REPAIR_CORE_KEEP = 0, | |||
| TSDB_REPAIR_CORE_DROP, | |||
| TSDB_REPAIR_CORE_REBUILD, | |||
| } ECoreRepairAction; | |||
|
|
|||
| typedef struct { | |||
| int32_t fid; | |||
| bool affected; | |||
| bool dropStt; | |||
| ECoreRepairAction coreAction; | |||
| bool staged; | |||
| int32_t totalBlocks; | |||
| int32_t keptBlocks; | |||
| int32_t droppedBlocks; | |||
| char coreReason[64]; | |||
| } STsdbRepairPlan; | |||
|
|
|||
| static int32_t save_json(const cJSON *json, const char *fname); | |||
|
|
|||
| static int32_t tsdbFSDupState(STFileSystem *fs); | |||
|
|
|||
| static int32_t commit_edit(STFileSystem *fs); | |||
|
|
|||
| static void tsdbRepairPlanInit(const STFileSet *fset, STsdbRepairPlan *plan) { | |||
| memset(plan, 0, sizeof(*plan)); | |||
| plan->fid = fset->fid; | |||
| } | |||
|
|
|||
| static void tsdbRepairPlanSetCore(STsdbRepairPlan *plan, ECoreRepairAction action, const char *reason) { | |||
| if (plan->coreAction == TSDB_REPAIR_CORE_DROP) { | |||
| return; | |||
| } | |||
| if (action == TSDB_REPAIR_CORE_DROP || plan->coreAction == TSDB_REPAIR_CORE_KEEP) { | |||
| plan->coreAction = action; | |||
| if (reason != NULL) { | |||
| tstrncpy(plan->coreReason, reason, sizeof(plan->coreReason)); | |||
| } | |||
| } | |||
| plan->affected = plan->dropStt || (plan->coreAction != TSDB_REPAIR_CORE_KEEP); | |||
| } | |||
|
|
|||
| static const char *gCurrentFname[] = { | |||
| [TSDB_FCURRENT] = "current.json", | |||
| [TSDB_FCURRENT_C] = "current.c.json", | |||
| [TSDB_FCURRENT_M] = "current.m.json", | |||
| }; | |||
|
|
|||
| static bool tsdbRepairListContains(const char *csv, int32_t vgId) { | |||
| if (csv == NULL || csv[0] == '\0') { | |||
| return false; | |||
| } | |||
|
|
|||
| char buf[PATH_MAX] = {0}; | |||
| tstrncpy(buf, csv, sizeof(buf)); | |||
|
|
|||
| char *savePtr = NULL; | |||
| for (char *token = strtok_r(buf, ",", &savePtr); token != NULL; token = strtok_r(NULL, ",", &savePtr)) { | |||
| if (atoi(token) == vgId) { | |||
| return true; | |||
| } | |||
| } | |||
|
|
|||
| return false; | |||
| } | |||
|
|
|||
| static void tsdbMarkForceRepairDone(int32_t vgId) { | |||
| char vnodeText[32] = {0}; | |||
| snprintf(vnodeText, sizeof(vnodeText), "%d", vgId); | |||
|
|
|||
| if (tsTsdbRepairDoneVnodeId[0] == '\0') { | |||
| tstrncpy(tsTsdbRepairDoneVnodeId, vnodeText, sizeof(tsTsdbRepairDoneVnodeId)); | |||
| return; | |||
| } | |||
|
|
|||
| if (tsdbRepairListContains(tsTsdbRepairDoneVnodeId, vgId)) { | |||
| return; | |||
| } | |||
|
|
|||
| int32_t offset = (int32_t)strlen(tsTsdbRepairDoneVnodeId); | |||
| snprintf(tsTsdbRepairDoneVnodeId + offset, sizeof(tsTsdbRepairDoneVnodeId) - offset, ",%s", vnodeText); | |||
| } | |||
There was a problem hiding this comment.
tsTsdbRepairDoneVnodeId is a process-global mutable buffer used to track which vnodes have been repaired. Vnodes are opened concurrently (see vmOpenVnodes() spawning threads in source/dnode/mgmt/mgmt_vnode/src/vmInt.c), so reads/writes to this global (via tsdbShouldForceRepair() / tsdbMarkForceRepairDone()) can race and corrupt the CSV or cause missed/duplicate repairs. Please make this tracking thread-safe (e.g., a mutex-protected hash/set keyed by vgId, or store the flag on SVnode/STsdb instance) or otherwise ensure the open path is single-threaded in repair mode.
| @@ -0,0 +1,352 @@ | |||
| # META Force Repair Implementation Plan | |||
|
|
|||
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | |||
There was a problem hiding this comment.
This plan document includes assistant-specific meta text ("For Claude" / "superpowers:executing-plans"). Please remove or replace with repo/tool-agnostic guidance to avoid confusion for future maintainers.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | |
| > Note: This implementation plan is intended to be executed task-by-task, either manually or by automation tooling. |
Introduce walShouldDeleteCorruption inline function to conditionally delete corrupted WAL files based on tsWalDeleteOnCorruption flag or dmRepairNeedWalRepair status. This replaces direct checks of tsWalDeleteOnCorruption in walLogEntriesComplete and walCheckAndRepairMeta, enabling more flexible corruption handling that considers both global configuration and vgId-specific repair requirements.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static char tsTsdbRepairDoneVnodeId[PATH_MAX] = {0}; | ||
|
|
There was a problem hiding this comment.
tsTsdbRepairDoneVnodeId is a file-static global that is mutated by tsdbShouldForceRepair() / tsdbMarkForceRepairDone() without synchronization. Vnodes are opened concurrently (multiple threads in vmOpenVnodes), so this is a data race and can corrupt the CSV buffer or lead to repeated/partial dispatch decisions. Please make the “already repaired” state per-vnode/per-STFileSystem (or guard shared state with a mutex/atomic).
| static FORCE_INLINE bool walShouldDeleteCorruption(const SWal* pWal) { | ||
| return tsWalDeleteOnCorruption || dmRepairNeedWalRepair(pWal->cfg.vgId); | ||
| } |
There was a problem hiding this comment.
walShouldDeleteCorruption() calls dmRepairNeedWalRepair(), but in this PR that symbol is only implemented in the taosd executable (dmMain.c) and in vnodeRepair.c (vnode library). The WAL unit test (walTest) links wal without those objects, so this introduces an undefined symbol at link time for test builds (and any other binary that links wal but not vnode/mgmt). Provide weak/default dmRepair implementations from a library that wal always links (e.g., common), or otherwise ensure the symbol is available wherever wal is used.
| code, output = self._run_taosd_with_cfg( | ||
| self._tsdb_repair_args(vnode_id, repair_fid, extra_args="--log-output /dev/null") | ||
| ) |
There was a problem hiding this comment.
This test runs taosd -r ... via _run_taosd_with_cfg() while the dnode is still running (unlike the other tests in this file which stop the dnode first). Starting a second taosd against the same config/data dir is likely to fail (port conflict / file lock / concurrent data access) and makes the test flaky. Stop tdDnodes before invoking the repair process and restart it in a finally block, consistent with test_tsdb_force_repair_dispatches_in_open_fs().
- Add `dmRepairNodeTypeIsVnode()` and `dmRepairModeIsForce()` functions to header and implementation - Introduce `tsdbShouldForceRepair()` and `tsdbForceRepair()` function declarations for TSDB repair logic - Refactor TSDB repair flow to conditionally apply force repair based on node type and mode - Improve repair plan handling with detailed block validation and error reporting
- Implement `tsdbForceRepair` to handle missing or corrupted files in TSDB - Add helper functions for detecting bad files and performing deep scans - Include STT file validation and data part verification mechanisms - Introduce commit change process with proper locking for thread safety - Structure repair operations to maintain data consistency during recovery
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static int32_t tsdbDeepScanAndFixSttFile(STFileset *pFileSet, STFileObj *pStt) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
| SSttFileReader *reader; | ||
| const TSttBlkArray *sttBlkArray = NULL; | ||
|
|
||
| // Open | ||
| SSttFileReaderConfig config = { | ||
| // TODO | ||
|
|
||
| }; | ||
| code = tsdbSttFileReaderOpen(pStt->fname, &config, &reader); | ||
| if (code) { | ||
| // TODO: error handle, need to delete this file | ||
| } | ||
|
|
||
| // read the index part | ||
| code = tsdbSttFileReadSttBlk(reader, &sttBlkArray); | ||
| if (code) { | ||
| // TODO: error handle, need to delete this file | ||
| } | ||
|
|
||
| // Loop to read each data part | ||
| for (int32_t i = 0; i < sttBlkArray->size; i++) { | ||
| SSttBlk *pSttBlk = ; | ||
| code = tsdbReadFile(STsdbFD * pFD, int64_t offset, uint8_t *pBuf, int64_t size, int64_t szHint, | ||
| SEncryptData *encryptData); | ||
| if (code) { | ||
| // TODO: find a bad block, need to eliminate it | ||
| } | ||
| }; |
There was a problem hiding this comment.
tsdbRepair.c currently contains incomplete/invalid C code (e.g., wrong type STFileset, empty initializer SSttBlk *pSttBlk = ;, and a placeholder call to tsdbReadFile(...) with a function signature pasted into the call). This will not compile and needs to be completed or removed (e.g., gate unfinished code behind #if 0 and enable the implemented path).
| static int32_t tsdbForceRepairFileSet(STFileSystem *pFS, STFileSet *pFileSet, TFileOpArray *opArr, bool *hasChange) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
|
|
||
| // TODO: if .head or .data is missing, just delete the data | ||
| code = tsdbForceRepairFileSetBadFiles(pFS); | ||
| if (code) { | ||
| // TODO | ||
| return code; | ||
| } | ||
|
|
||
| // TODO: if deep scan and fix the data, do deep scan and fix | ||
| code = tsdbForceRepairFileSetDeepScanAndFix(pFS); | ||
| if (code) { |
There was a problem hiding this comment.
tsdbForceRepairFileSetBadFiles / tsdbForceRepairFileSetDeepScanAndFix are called with the wrong argument lists here (missing pFileSet, opArr, hasChange, etc.), so this code cannot compile and also can’t apply any edits. Please fix the function calls/signatures consistently and ensure opArr is populated with the intended file operations.
| int32_t tsdbForceRepair(STFileSystem *fs) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
|
|
||
| bool hasChange = false; | ||
| TFileOpArray opArr = {0}; | ||
|
|
||
| // Loop to force repair each file set | ||
| STFileSet *pFileSet = NULL; | ||
| TARRAY2_FOREACH(fs->fSetArr, pFileSet) { | ||
| code = tsdbForceRepairFileSet(fs, pFileSet, &hasChange); | ||
| if (code) { | ||
| tsdbError("vgId:%d %s failed to force repair file set, fid:%d since %s, code:%d", TD_VID(fs->tsdb->pVnode), | ||
| __func__, pFileSet->fid, tstrerror(code), code); | ||
| return code; | ||
| } | ||
| } | ||
|
|
||
| code = tsdbForceRepairCommitChange(fs, &opArr); | ||
| if (code) { |
There was a problem hiding this comment.
In tsdbForceRepair, the loop calls tsdbForceRepairFileSet(fs, pFileSet, &hasChange) but tsdbForceRepairFileSet is declared with four parameters (pFS, pFileSet, opArr, hasChange). This is a compile-time error; please pass the correct arguments and ensure the TFileOpArray opArr is properly initialized (e.g., via the project’s TARRAY2_INIT/append helpers) before using it in tsdbFSEditBegin/commit.
| int32_t tsdbForceRepair(STFileSystem *fs) { | ||
| int32_t code = TSDB_CODE_SUCCESS; | ||
|
|
||
| bool hasChange = false; | ||
| TFileOpArray opArr = {0}; | ||
|
|
||
| // Loop to force repair each file set | ||
| STFileSet *pFileSet = NULL; | ||
| TARRAY2_FOREACH(fs->fSetArr, pFileSet) { | ||
| code = tsdbForceRepairFileSet(fs, pFileSet, &hasChange); | ||
| if (code) { | ||
| tsdbError("vgId:%d %s failed to force repair file set, fid:%d since %s, code:%d", TD_VID(fs->tsdb->pVnode), | ||
| __func__, pFileSet->fid, tstrerror(code), code); | ||
| return code; | ||
| } | ||
| } | ||
|
|
||
| code = tsdbForceRepairCommitChange(fs, &opArr); | ||
| if (code) { | ||
| // TODO: output error log | ||
| return code; | ||
| } | ||
|
|
||
| #if 0 | ||
| int32_t code = tsdbFSDupState(fs); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
|
|
||
| bool changed = false; | ||
| const STFileSet *srcFset = NULL; | ||
| TARRAY2_FOREACH(fs->fSetArr, srcFset) { | ||
| EDmRepairStrategy repairStrategy = DM_REPAIR_STRATEGY_NONE; | ||
| if (!tsdbRepairMatchTargetForFid(TD_VID(fs->tsdb->pVnode), srcFset->fid, &repairStrategy)) { | ||
| continue; | ||
| } | ||
| TAOS_UNUSED(repairStrategy); | ||
|
|
||
| STsdbRepairPlan plan; | ||
| code = tsdbRepairAnalyzeFileSet(fs, srcFset, &plan); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
| if (!plan.affected) { | ||
| continue; | ||
| } | ||
|
|
||
| code = tsdbRepairBackupAffectedFileSet(fs, srcFset, &plan); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
|
|
||
| STFileSet *dstFset = tsdbRepairFindTmpFSet(fs, srcFset->fid); | ||
| if (dstFset == NULL) { | ||
| return TSDB_CODE_FAILED; | ||
| } | ||
|
|
||
| if (plan.dropStt) { | ||
| tsdbRepairDropSttOnTmpFSet(dstFset); | ||
| changed = true; | ||
| } | ||
|
|
||
| if (plan.coreAction == TSDB_REPAIR_CORE_DROP) { | ||
| code = tsdbRepairDropCoreOnTmpFSet(fs, dstFset); | ||
| if (code != 0) return code; | ||
| changed = true; | ||
| } else if (plan.coreAction == TSDB_REPAIR_CORE_REBUILD) { | ||
| code = tsdbRepairRebuildCoreOnTmpFSet(fs, srcFset, dstFset, &plan); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
| changed = true; | ||
| } | ||
| } | ||
|
|
||
| if (!changed) { | ||
| printf("tsdb force repair dispatch: vnode%d\n", TD_VID(fs->tsdb->pVnode)); | ||
| fflush(stdout); | ||
| tsdbMarkForceRepairDone(TD_VID(fs->tsdb->pVnode)); | ||
| return 0; | ||
| } | ||
|
|
||
| code = tsdbRepairCommitStagedCurrent(fs); | ||
| if (code != 0) { | ||
| return code; | ||
| } | ||
|
|
||
| printf("tsdb force repair dispatch: vnode%d\n", TD_VID(fs->tsdb->pVnode)); | ||
| fflush(stdout); | ||
| tsdbMarkForceRepairDone(TD_VID(fs->tsdb->pVnode)); | ||
| #endif | ||
| return code; | ||
| } |
There was a problem hiding this comment.
The current tsdbForceRepair implementation doesn’t print the dispatch marker that the new pytest cases assert on (e.g., "tsdb force repair dispatch"). The only printf/marker emission is in the #if 0 block below, so the tests will fail even if this compiles. Please either re-enable the implemented dispatch/logging path or update the tests to assert on the actual output produced by the enabled repair flow.
| code, output = self._run_taosd_with_cfg( | ||
| self._tsdb_repair_args(vnode_id, repair_fid, extra_args="--log-output /dev/null") | ||
| ) | ||
|
|
There was a problem hiding this comment.
This test runs taosd -r --mode force ... without stopping the existing tdDnodes taosd instance first. Launching a second taosd process against the same data/cfg directories can fail due to locks or (worse) corrupt state. Consider stopping tdDnodes (as done in other repair tests) before invoking the repair-mode taosd, then restarting in finally.
Add deep scanning logic to tsdbRepair.c to detect and fix corrupted data blocks within brin blocks. The new tsdbDeepScanAndFixDataPart function reads brin blocks, validates data blocks, and skips corrupted entries. This enhances data integrity during repair operations by isolating and handling bad data segments without affecting the entire dataset.
- Change default TSDB repair strategy from `shallow_repair` to `drop_invalid_only` - Rename TSDB repair strategies: `shallow_repair` → `drop_invalid_only`, `deep_repair` → `head_only_rebuild` - Add new `full_rebuild` strategy for complete core data reconstruction - Update documentation in both English and Chinese versions with detailed strategy descriptions - Modify internal enum values and code references to reflect new strategy naming - Update command examples to use new strategy names - Add test configurations for new repair functionality The changes provide more granular control over TSDB repair operations with clearer strategy semantics.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ADD_EXECUTABLE(tsdbRepairTest tsdbRepairTest.cpp) | ||
| DEP_ext_gtest(tsdbRepairTest) | ||
| TARGET_LINK_LIBRARIES( | ||
| tsdbRepairTest | ||
| PUBLIC os util common vnode | ||
| ) | ||
|
|
||
| TARGET_INCLUDE_DIRECTORIES( | ||
| tsdbRepairTest | ||
| PUBLIC "${TD_SOURCE_DIR}/include/common" | ||
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/inc" | ||
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/tsdb" | ||
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../inc" | ||
| ) |
There was a problem hiding this comment.
tsdbRepairTest is added unconditionally, but tsdbRepairTest.cpp includes/uses POSIX-only APIs (e.g. sys/syscall.h, mkstemp, SYS_close). This will fail to build on Windows. Suggest guarding the ADD_EXECUTABLE(tsdbRepairTest ...) block with IF(NOT TD_WINDOWS) (similar to tqTest) or providing a Windows-compatible implementation.
| ADD_EXECUTABLE(tsdbRepairTest tsdbRepairTest.cpp) | |
| DEP_ext_gtest(tsdbRepairTest) | |
| TARGET_LINK_LIBRARIES( | |
| tsdbRepairTest | |
| PUBLIC os util common vnode | |
| ) | |
| TARGET_INCLUDE_DIRECTORIES( | |
| tsdbRepairTest | |
| PUBLIC "${TD_SOURCE_DIR}/include/common" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/inc" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/tsdb" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../inc" | |
| ) | |
| IF(NOT TD_WINDOWS) | |
| ADD_EXECUTABLE(tsdbRepairTest tsdbRepairTest.cpp) | |
| DEP_ext_gtest(tsdbRepairTest) | |
| TARGET_LINK_LIBRARIES( | |
| tsdbRepairTest | |
| PUBLIC os util common vnode | |
| ) | |
| TARGET_INCLUDE_DIRECTORIES( | |
| tsdbRepairTest | |
| PUBLIC "${TD_SOURCE_DIR}/include/common" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/inc" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../src/tsdb" | |
| PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../inc" | |
| ) | |
| ENDIF() |
- Add note that default `drop_invalid_only` strategy only handles missing-file damage - Specify that size-mismatch corruption requires explicit deep strategies (`head_only_rebuild` or `full_rebuild`) - Update both English and Chinese documentation consistently - Move repair-related source file from vnode to common directory for better code organization
- Introduce suite groups for metadata, core_e2e, and stt_e2e tests - Add helper methods for running force repair operations and verifying results - Implement test fixtures for core and STT file corruption scenarios - Include assertions for database writability and repair log validation - Refactor existing tests to use new fixture-based approach - Add temporary file handling and improved error recovery mechanisms
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if 0 | ||
| code = metaBackupCurrentMeta(pVnode); | ||
| if (code != 0) { | ||
| metaError("vgId:%d failed to back up current meta, reason:%s", TD_VID(pVnode), tstrerror(code)); | ||
| return code; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
The metaBackupCurrentMeta function is defined but the call site is wrapped in #if 0 (line 558), so it's dead code. If this is intentional for a future phase, consider adding a comment explaining when it will be enabled; otherwise remove it to avoid confusion.
| #if 0 | |
| code = metaBackupCurrentMeta(pVnode); | |
| if (code != 0) { | |
| metaError("vgId:%d failed to back up current meta, reason:%s", TD_VID(pVnode), tstrerror(code)); | |
| return code; | |
| } | |
| #endif |
| // Open a new meta for organization | ||
| code = metaOpenImpl(pMeta->pVnode, &pNewMeta, VNODE_META_TMP_DIR, false); | ||
| if (code) { | ||
| return code; | ||
| } | ||
|
|
||
| code = metaBegin(pNewMeta, META_BEGIN_HEAP_NIL); | ||
| if (code) { | ||
| return code; | ||
| } | ||
|
|
||
| EMetaRepairStrategy strategy = metaGetRepairStrategy(repairStrategy); | ||
| if (strategy == E_META_REPAIR_FROM_UID) { | ||
| code = metaForceRepairFromUid(pVnode, pMeta, pNewMeta); | ||
| if (code) { | ||
| metaError("vgId:%d, %s failed at %s:%d since %s", TD_VID(pVnode), __func__, __FILE__, __LINE__, tstrerror(code)); | ||
| return code; | ||
| } | ||
| } else if (strategy == E_META_REPAIR_FROM_REDO) { | ||
| code = metaForceRepairFromRedo(pVnode, pMeta, pNewMeta); | ||
| if (code) { | ||
| metaError("vgId:%d, %s failed at %s:%d since %s", TD_VID(pVnode), __func__, __FILE__, __LINE__, tstrerror(code)); | ||
| return code; | ||
| } | ||
| } |
There was a problem hiding this comment.
In metaForceRepair, if metaOpenImpl or metaBegin fails, pNewMeta is leaked (it's opened but never closed on the error path). Similarly, if metaForceRepairFromUid or metaForceRepairFromRedo fails at lines 582-591, pNewMeta is not closed before returning.
| } | ||
|
|
||
| if (code == TSDB_CODE_OPS_NOT_SUPPORT) { | ||
| return 1; |
There was a problem hiding this comment.
dmFinalizeRepairOption returns 1 (line 788) for TSDB_CODE_OPS_NOT_SUPPORT, which is an ad-hoc non-standard error code that gets propagated from dmParseArgs. This breaks the convention where the function otherwise returns TSDB_CODE_* values. The caller in mainWindows checks code != 0 and may misinterpret this as an unrelated error. Consider returning a proper TSDB_CODE_* constant instead.
| return 1; | |
| return TSDB_CODE_OPS_NOT_SUPPORT; |
| def _prepare_stt_fixture(self, total_rows=4000): | ||
| dbname = f"tsdb_repair_stt_fixture_{time.time_ns()}" | ||
| ts0 = 1700000000000 | ||
| table_name = "d0" | ||
|
|
||
| tdSql.execute(f"drop database if exists {dbname}") | ||
| tdSql.execute(f"create database {dbname} vgroups 1 stt_trigger 1 minrows 10 maxrows 200") | ||
| tdSql.execute(f"drop table if exists {dbname}.meters") | ||
| tdSql.execute(f"create table {dbname}.meters (ts timestamp, c1 int, c2 float) tags(t1 int)") | ||
| tdSql.execute(f"create table {dbname}.{table_name} using {dbname}.meters tags(1)") | ||
|
|
||
| sql = f"insert into {dbname}.{table_name} values " | ||
| sql += ",".join(f"({ts0 + i}, 1, 0.1)" for i in range(100)) | ||
| tdSql.execute(sql) | ||
| tdSql.execute(f"flush database {dbname}") | ||
|
|
||
| sql = f"insert into {dbname}.{table_name} values " | ||
| sql += ",".join(f"({ts0 + 99 + i}, 1, 0.1)" for i in range(100)) | ||
| tdSql.execute(sql) | ||
| tdSql.execute(f"flush database {dbname}") | ||
|
|
||
| tdSql.execute(f"insert into {dbname}.{table_name} values({ts0 + 1000}, 2, 1.0)") | ||
| tdSql.execute(f"flush database {dbname}") | ||
| time.sleep(2) | ||
|
|
||
| tdSql.query(f"select count(*) from {dbname}.{table_name}") | ||
| tdSql.checkData(0, 0, 200) | ||
|
|
||
| vnode_id = self._get_vnode_id_for_db(dbname, table_name=table_name) | ||
| stt_path, stt_entries = self._wait_for_stt_file(dbname, vnode_id, timeout_sec=90) | ||
| if stt_path is None or stt_entries <= 0: | ||
| pytest.skip("real stt fixture was not materialized in time") | ||
|
|
||
| fid = self._parse_fid_from_tsdb_path(stt_path) | ||
| tdSql.checkEqual(fid is not None, True) | ||
| return { | ||
| "dbname": dbname, | ||
| "vnode_id": vnode_id, | ||
| "fid": fid, | ||
| "row_count": 200, | ||
| "table_name": table_name, | ||
| "stt_path": stt_path, | ||
| "stt_entries": stt_entries, | ||
| } |
There was a problem hiding this comment.
The _prepare_stt_fixture method accepts a total_rows=4000 parameter but never uses it — the actual row count is hardcoded to 200 (two batches of 100 + 1 extra row, and row_count is returned as 200). The total_rows parameter is misleading and should either be removed or actually used.
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.