taosdata · hzcheng · Mar 6, 2026 · Mar 6, 2026 · Mar 8, 2026 · Mar 9, 2026
@@ -73,3 +73,6 @@ test/screenlog*
 test/output.tmp
 
 CMakeUserPresets.json
+
+.agents/
+skills-lock.json
@@ -98,6 +98,34 @@ restore qnode on dnode <dnode_id>; # Restore qnode on dnode
 - This feature is based on the recovery of existing replication capabilities, not disaster recovery or backup recovery. Therefore, for the mnode and vnode to be recovered, the prerequisite for using this command is that the other two replicas of the mnode or vnode can still function normally.
 - This command cannot repair individual files in the data directory that are damaged or lost. For example, if individual files or data in an mnode or vnode are damaged, it is not possible to recover a specific file or block of data individually. In this case, you can choose to completely clear the data of that mnode/vnode and then perform recovery.
 
+## Local Repair Mode
+
+If the issue is limited to local files on one node and you want TDengine to perform repair checks during startup, you can start `taosd` in local repair mode:
+
+```bash
+taosd -r --mode force --node-type vnode \
+  --repair-target meta:vnode=3
+```
+
+You can also declare multiple repair targets in one startup:
+
+```bash
+taosd -r --mode force --node-type vnode --backup-path /tmp/repair-bak \
+  --repair-target meta:vnode=3 \
+  --repair-target tsdb:vnode=5:fileid=1809 \
+  --repair-target wal:vnode=6
+```
+
+Current limitations:
+
+- Only `--mode force` is supported.
+- Only `--node-type vnode` is supported.
+- `tsdb` repair targets must include `fileid`.
+- `wal` repair targets currently do not support `strategy`.
+- The default TSDB strategy `drop_invalid_only` only handles missing-file style damage; size-mismatch recovery requires an explicit deep strategy such as `head_only_rebuild` or `full_rebuild`.
+
+For the complete CLI grammar, supported keys, default strategies, and more examples, see [taosd Reference](../../tdengine-reference/components/taosd/).
+
 ## Splitting Virtual Groups
 
 When a vgroup is overloaded with CPU or Disk resource usage due to too many subtables, after adding a dnode, you can split the vgroup into two virtual groups using the `split vgroup` command. After the split, the newly created two vgroups will undertake the read and write services originally provided by one vgroup. This command was first released in version 3.0.6.0, and it is recommended to use the latest version whenever possible.

@@ -15,10 +15,96 @@ The command line parameters for taosd are as follows:
 - -s: Prints SDB information.
 - -C: Prints configuration information.
 - -e: Specifies environment variables, formatted like `-e 'TAOS_FQDN=td1'`.
+- -r: Starts local repair mode. This option must be used together with `--mode force`, `--node-type vnode`, and at least one `--repair-target`.
 - -k: Retrieves the machine code.
 - -dm: Enables memory scheduling.
 - -V: Prints version information.
 
+## Repair Mode
+
+Use `taosd -r` to start local repair mode. In the current phase, repair mode only supports `--mode force` and `--node-type vnode`.
+
+### Syntax
+
+```bash
+taosd -r --mode force --node-type vnode [--backup-path <path>] \
+  --repair-target <target> [--repair-target <target>]...
+```
+
+### Repair Target Grammar
+
+Each `--repair-target` value uses the following grammar:
+
+```text
+<file-type>:<key>=<value>[:<key>=<value>]...
+```
+
+Rules:
+
+- `<file-type>` must be the first segment.
+- Supported file types are `meta`, `tsdb`, and `wal`.
+- Key order is not significant, but examples in this document use a consistent order.
+- Repeating the same key in one target is invalid.
+- Repeating the same repair object across multiple targets is invalid.
+
+### Supported Targets
+
+| File Type | Required Keys | Optional Keys | Default Strategy | Supported Strategies |
+| --- | --- | --- | --- | --- |
+| `meta` | `vnode` | `strategy` | `from_uid` | `from_uid`, `from_redo` |
+| `tsdb` | `vnode`, `fileid` | `strategy` | `drop_invalid_only` | `drop_invalid_only`, `head_only_rebuild`, `full_rebuild` |
+| `wal` | `vnode` | none | none | none |
+
+Additional notes:
+
+- `fileid` is only valid for `tsdb`, and it is required in the current phase.
+- `strategy` is not currently supported for `wal`.
+- `--backup-path` is global for the whole repair startup, not per target.
+- TSDB repair strategies behave as follows:
+  - `drop_invalid_only`: only remove obviously bad missing-file cases before any deep scan. It does not inspect size-mismatch corruption against `current.json`.
+  - `head_only_rebuild`: deep-scan valid core blocks and rebuild `.head` only; keep `.data` unchanged and drop `.sma` if SMA metadata is unusable.
+  - `full_rebuild`: deep-scan valid core blocks and rebuild the full core payload with the existing writer path.
+  - Use `head_only_rebuild` or `full_rebuild` when you need recovery behavior for size-mismatch corruption.
+
+### Limitations
+
+- Only `--mode force` is supported.
+- Only `--node-type vnode` is supported.
+- `taosd -r` without `--mode`, `--node-type`, or `--repair-target` is invalid.
+- The older repair parameters `--file-type`, `--vnode-id`, and `--replica-node` have been removed from this interface.
+
+### Examples
+
+Repair meta on one vnode and use the default strategy:
+
+```bash
+taosd -r --mode force --node-type vnode \
+  --repair-target meta:vnode=3
+```
+
+Repair one TSDB file set and use an explicit strategy:
+
+```bash
+taosd -r --mode force --node-type vnode \
+  --repair-target tsdb:vnode=5:fileid=1809:strategy=head_only_rebuild
+```
+
+Repair one TSDB file set and force a full core rebuild:
+
+```bash
+taosd -r --mode force --node-type vnode \
+  --repair-target tsdb:vnode=5:fileid=1809:strategy=full_rebuild
+```
+
+Repair multiple targets in one startup:
+
+```bash
+taosd -r --mode force --node-type vnode --backup-path /tmp/repair-bak \
+  --repair-target meta:vnode=3 \
+  --repair-target tsdb:vnode=5:fileid=1809 \
+  --repair-target wal:vnode=6
+```
+
 ## Configuration Parameters
 
 Configuration parameters are divided into two categories:

@@ -0,0 +1,203 @@
+# TSDB Force Repair Test Redesign Design
+
+**Date:** 2026-03-11
+
+**Scope:** `test/cases/80-Components/01-Taosd/test_tsdb_force_repair.py`
+
+## Goal
+
+Redesign TSDB force-repair tests so the suite validates real repair outcomes instead of mostly checking log text or direct `current.json` edits. The first phase should establish layered coverage with real fileset end-to-end recovery for the main repair paths. The second phase can expand that structure into a fuller on-disk corruption matrix.
+
+## Why The Current Tests Are No Longer Enough
+
+The repair implementation in [`tsdbRepair.c`](/Projects/work/TDengine/source/dnode/vnode/src/tsdb/tsdbRepair.c) now exposes materially different behaviors:
+
+- `drop_invalid_only`
+- `head_only_rebuild`
+- `full_rebuild`
+- independent `stt` drop vs rebuild decisions
+- concrete `action` and `reason` reporting in backup logs
+
+The current Python suite still over-relies on:
+
+- synthetic `current.json` injection
+- fake filesets that do not exercise the real writer/reader paths
+- weak success criteria such as "string exists in output"
+
+Those checks are still useful for a few metadata-transaction cases, but they no longer provide enough confidence for the actual repair behavior.
+
+## Design Principles
+
+1. Real fileset end-to-end tests become the default.
+2. Synthetic metadata tests remain only where they are the best way to validate transactional behavior.
+3. Every meaningful repair case must verify service recovery, not just file deletion.
+4. The helper layer must be designed for a second-phase expansion into a larger corruption matrix.
+5. The suite should favor a small number of high-value cases over many weakly asserted cases.
+
+## Two-Phase Strategy
+
+### Phase 1: Layered Redesign
+
+Use a mixed strategy:
+
+- keep a few metadata-oriented tests
+- convert the main core and `stt` cases to real fileset end-to-end scenarios
+- introduce helpers for fixture preparation, corruption injection, repair execution, restart, and post-repair assertions
+
+This phase is the immediate implementation target.
+
+### Phase 2: Full End-To-End Matrix
+
+Build on the Phase 1 helper layer to reduce or eliminate remaining synthetic cases and expand the matrix across:
+
+- corruption type
+- target file kind
+- strategy
+- expected repair action
+- post-repair data availability
+
+Phase 2 is explicitly planned but not required to block Phase 1.
+
+## Test Suite Structure
+
+The test file should be reorganized around repair semantics instead of incremental historical additions.
+
+Recommended grouping:
+
+- helper methods shared by the suite
+- metadata and transaction behavior
+- core end-to-end repair
+- `stt` end-to-end repair
+- later: full corruption matrix expansion
+
+The implementation may keep one class for framework compatibility, but helper names and test ordering should clearly reflect these groups.
+
+## Phase 1 Coverage Matrix
+
+### Baseline and Strategy Routing
+
+- healthy fileset repair is a no-op
+- default `drop_invalid_only` does not repair size-mismatch core corruption
+- `head_only_rebuild` and `full_rebuild` both recover size-mismatch core corruption
+
+### Core Repair
+
+- missing `.head` drops the core group and leaves the database restartable
+- missing `.data` drops the core group and leaves the database restartable
+- damaged `.head` content triggers rebuild or salvageable drop behavior according to surviving valid blocks
+- damaged `.data` content triggers rebuild or drop according to surviving valid blocks
+
+### STT Repair
+
+- missing `.stt` is removed and the database remains queryable
+- corrupted `stt` data content causes only the affected `stt` file to be rebuilt or dropped
+- tomb or index corruption should be covered when a stable data fixture is available; if not stable enough for Phase 1, the helper API must still reserve the hook for Phase 2
+
+### Audit and Transaction Safety
+
+- real repair writes backup manifest and `repair.log`
+- staged `current.c.json` recovery on restart remains covered
+
+## Uniform Success Criteria
+
+Except for explicit metadata-only tests, each repair case should verify four layers:
+
+1. File-level state
+   - target files changed as expected
+   - `current.json` reflects the repaired state
+   - backup manifest and `repair.log` exist when backup is enabled
+2. Process-level recovery
+   - repair command completes without crashing
+   - normal `taosd` startup succeeds afterward
+3. SQL read validation
+   - `count(*)` succeeds
+   - at least one bounded read query succeeds
+4. SQL write validation
+   - new inserts succeed after repair
+   - `flush` succeeds
+   - newly inserted rows are queryable
+
+For destructive repair cases the row count is allowed to decrease, but it must still satisfy a deliberate range assertion such as `0 <= repaired_rows <= original_rows`.
+
+## Helper Architecture
+
+### Real Fixture Builders
+
+Provide dedicated builders for:
+
+- core-repair fixtures
+- `stt`-repair fixtures
+- later: tomb-heavy fixtures
+
+Each builder should return structured context including:
+
+- `dbname`
+- `vnode_id`
+- `fid`
+- baseline row count
+- resolved file paths
+- backup root when applicable
+
+### Fileset Discovery
+
+Upgrade the current ad hoc path lookups into structured discovery helpers that can resolve a real fileset and report:
+
+- head/data/sma paths
+- `stt` file list
+- current manifest entry
+- file sizes
+- whether tomb data appears to exist
+
+### Corruption Injection
+
+Model corruption types explicitly:
+
+- missing file
+- size mismatch by truncate or extend
+- in-place byte overwrite
+- later: `stt` tomb or index region corruption
+
+Each injector should return a structured record describing what changed so failures are diagnosable.
+
+### Repair Lifecycle
+
+Centralize:
+
+- force-repair command execution
+- controlled `taosd` stop/start
+- readiness checks
+- post-repair SQL assertions
+
+This avoids test-local drift in recovery semantics.
+
+## Existing Test Triage
+
+### Keep And Refactor
+
+- dispatch smoke coverage
+- backup manifest coverage
+- backup log coverage
+- staged manifest crash-safe recovery
+- one or more core rebuild smoke cases
+
+### Replace With Real End-To-End Cases
+
+- synthetic missing core fileset tests
+- synthetic size-mismatch core tests
+- synthetic repair-log action/reason tests tied to fake file groups
+- synthetic `current.json` only checks for core strategy behavior
+
+## Risks And Constraints
+
+- Real `stt` and tomb fixtures can be timing-sensitive because file materialization is asynchronous.
+- Corrupting the wrong real fileset can produce unstable results, so helper selection must only target fully materialized files with manifest-consistent size before injection.
+- End-to-end coverage will make the suite slower; that is an accepted tradeoff for this redesign.
+
+## Verification Strategy
+
+Phase 1 completion should verify:
+
+- the updated Python test file is runnable in isolation
+- the selected end-to-end cases pass against a real `taosd`
+- metadata-only cases still validate backup and crash-safe behavior
+- the helper layer is generic enough that Phase 2 can add more corruption modes without another large refactor