Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
caf0dee
feat(taosd): implement phase1 repair CLI parsing and validation
hzcheng Mar 6, 2026
555e6d6
feat(repair): add meta force repair flow and coverage
hzcheng Mar 6, 2026
2ac1025
feat(repair): add tsdb force repair flow and coverage
hzcheng Mar 8, 2026
03eae54
feat(repair): refactor repair option structure and remove deprecated …
hzcheng Mar 9, 2026
c7b7f49
feat(meta): refactor meta repair to separate strategy functions
hzcheng Mar 9, 2026
9006f50
feat(repair): redesign CLI for multi-target data repair
hzcheng Mar 9, 2026
47ef776
refactor(dmRepair): replace generic target array with type-specific a…
hzcheng Mar 9, 2026
0277b27
feat(dnode): refactor repair options to support multiple node types
hzcheng Mar 10, 2026
c69c223
feat: remove dnode from node-type option and clean up generateNewMeta…
hzcheng Mar 10, 2026
14cca94
feat(meta): refactor meta repair logic and remove unused static variable
hzcheng Mar 10, 2026
03f767f
feat(meta): temporarily disable meta backup during forced repair
hzcheng Mar 10, 2026
a8080fe
feat(wal): add dynamic corruption handling with dmRepair integration
hzcheng Mar 10, 2026
bdc47aa
remove useless files
hzcheng Mar 10, 2026
452be5f
feat(repair): add vnode type and force mode detection functions
hzcheng Mar 10, 2026
a434412
feat(tsdb): add force repair functionality for file system integrity
hzcheng Mar 10, 2026
fdb9965
feat(tsdb): implement deep scan and fix for data file repair
hzcheng Mar 10, 2026
44fd1e3
docs: update TSDB repair strategies and documentation
hzcheng Mar 11, 2026
88df66f
docs: clarify TSDB repair strategy behavior and limitations
hzcheng Mar 11, 2026
25e38ba
docs: plan tsdb force repair test redesign
hzcheng Mar 11, 2026
0480ebc
feat(test): add comprehensive force repair test suite for TSDB
hzcheng Mar 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,6 @@ test/screenlog*
test/output.tmp

CMakeUserPresets.json

.agents/
skills-lock.json
28 changes: 28 additions & 0 deletions docs/en/08-operation/04-maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,34 @@ restore qnode on dnode <dnode_id>; # Restore qnode on dnode
- This feature is based on the recovery of existing replication capabilities, not disaster recovery or backup recovery. Therefore, for the mnode and vnode to be recovered, the prerequisite for using this command is that the other two replicas of the mnode or vnode can still function normally.
- This command cannot repair individual files in the data directory that are damaged or lost. For example, if individual files or data in an mnode or vnode are damaged, it is not possible to recover a specific file or block of data individually. In this case, you can choose to completely clear the data of that mnode/vnode and then perform recovery.

## Local Repair Mode

If the issue is limited to local files on one node and you want TDengine to perform repair checks during startup, you can start `taosd` in local repair mode:

```bash
taosd -r --mode force --node-type vnode \
--repair-target meta:vnode=3
```

You can also declare multiple repair targets in one startup:

```bash
taosd -r --mode force --node-type vnode --backup-path /tmp/repair-bak \
--repair-target meta:vnode=3 \
--repair-target tsdb:vnode=5:fileid=1809 \
--repair-target wal:vnode=6
```

Current limitations:

- Only `--mode force` is supported.
- Only `--node-type vnode` is supported.
- `tsdb` repair targets must include `fileid`.
- `wal` repair targets currently do not support `strategy`.
- The default TSDB strategy `drop_invalid_only` only handles missing-file style damage; size-mismatch recovery requires an explicit deep strategy such as `head_only_rebuild` or `full_rebuild`.

For the complete CLI grammar, supported keys, default strategies, and more examples, see [taosd Reference](../../tdengine-reference/components/taosd/).

## Splitting Virtual Groups

When a vgroup is overloaded with CPU or Disk resource usage due to too many subtables, after adding a dnode, you can split the vgroup into two virtual groups using the `split vgroup` command. After the split, the newly created two vgroups will undertake the read and write services originally provided by one vgroup. This command was first released in version 3.0.6.0, and it is recommended to use the latest version whenever possible.
Expand Down
86 changes: 86 additions & 0 deletions docs/en/14-reference/01-components/01-taosd.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,96 @@ The command line parameters for taosd are as follows:
- -s: Prints SDB information.
- -C: Prints configuration information.
- -e: Specifies environment variables, formatted like `-e 'TAOS_FQDN=td1'`.
- -r: Starts local repair mode. This option must be used together with `--mode force`, `--node-type vnode`, and at least one `--repair-target`.
- -k: Retrieves the machine code.
- -dm: Enables memory scheduling.
- -V: Prints version information.

## Repair Mode

Use `taosd -r` to start local repair mode. In the current phase, repair mode only supports `--mode force` and `--node-type vnode`.

### Syntax

```bash
taosd -r --mode force --node-type vnode [--backup-path <path>] \
--repair-target <target> [--repair-target <target>]...
```

### Repair Target Grammar

Each `--repair-target` value uses the following grammar:

```text
<file-type>:<key>=<value>[:<key>=<value>]...
```

Rules:

- `<file-type>` must be the first segment.
- Supported file types are `meta`, `tsdb`, and `wal`.
- Key order is not significant, but examples in this document use a consistent order.
- Repeating the same key in one target is invalid.
- Repeating the same repair object across multiple targets is invalid.

### Supported Targets

| File Type | Required Keys | Optional Keys | Default Strategy | Supported Strategies |
| --- | --- | --- | --- | --- |
| `meta` | `vnode` | `strategy` | `from_uid` | `from_uid`, `from_redo` |
| `tsdb` | `vnode`, `fileid` | `strategy` | `drop_invalid_only` | `drop_invalid_only`, `head_only_rebuild`, `full_rebuild` |
| `wal` | `vnode` | none | none | none |

Additional notes:

- `fileid` is only valid for `tsdb`, and it is required in the current phase.
- `strategy` is not currently supported for `wal`.
- `--backup-path` is global for the whole repair startup, not per target.
- TSDB repair strategies behave as follows:
- `drop_invalid_only`: only remove obviously bad missing-file cases before any deep scan. It does not inspect size-mismatch corruption against `current.json`.
- `head_only_rebuild`: deep-scan valid core blocks and rebuild `.head` only; keep `.data` unchanged and drop `.sma` if SMA metadata is unusable.
- `full_rebuild`: deep-scan valid core blocks and rebuild the full core payload with the existing writer path.
- Use `head_only_rebuild` or `full_rebuild` when you need recovery behavior for size-mismatch corruption.

### Limitations

- Only `--mode force` is supported.
- Only `--node-type vnode` is supported.
- `taosd -r` without `--mode`, `--node-type`, or `--repair-target` is invalid.
- The older repair parameters `--file-type`, `--vnode-id`, and `--replica-node` have been removed from this interface.

### Examples

Repair meta on one vnode and use the default strategy:

```bash
taosd -r --mode force --node-type vnode \
--repair-target meta:vnode=3
```

Repair one TSDB file set and use an explicit strategy:

```bash
taosd -r --mode force --node-type vnode \
--repair-target tsdb:vnode=5:fileid=1809:strategy=head_only_rebuild
```

Repair one TSDB file set and force a full core rebuild:

```bash
taosd -r --mode force --node-type vnode \
--repair-target tsdb:vnode=5:fileid=1809:strategy=full_rebuild
```

Repair multiple targets in one startup:

```bash
taosd -r --mode force --node-type vnode --backup-path /tmp/repair-bak \
--repair-target meta:vnode=3 \
--repair-target tsdb:vnode=5:fileid=1809 \
--repair-target wal:vnode=6
```

## Configuration Parameters

Configuration parameters are divided into two categories:
Expand Down
203 changes: 203 additions & 0 deletions docs/plans/2026-03-11-tsdb-force-repair-test-redesign-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# TSDB Force Repair Test Redesign Design

**Date:** 2026-03-11

**Scope:** `test/cases/80-Components/01-Taosd/test_tsdb_force_repair.py`

## Goal

Redesign TSDB force-repair tests so the suite validates real repair outcomes instead of mostly checking log text or direct `current.json` edits. The first phase should establish layered coverage with real fileset end-to-end recovery for the main repair paths. The second phase can expand that structure into a fuller on-disk corruption matrix.

## Why The Current Tests Are No Longer Enough

The repair implementation in [`tsdbRepair.c`](/Projects/work/TDengine/source/dnode/vnode/src/tsdb/tsdbRepair.c) now exposes materially different behaviors:

- `drop_invalid_only`
- `head_only_rebuild`
- `full_rebuild`
- independent `stt` drop vs rebuild decisions
- concrete `action` and `reason` reporting in backup logs

The current Python suite still over-relies on:

- synthetic `current.json` injection
- fake filesets that do not exercise the real writer/reader paths
- weak success criteria such as "string exists in output"

Those checks are still useful for a few metadata-transaction cases, but they no longer provide enough confidence for the actual repair behavior.

## Design Principles

1. Real fileset end-to-end tests become the default.
2. Synthetic metadata tests remain only where they are the best way to validate transactional behavior.
3. Every meaningful repair case must verify service recovery, not just file deletion.
4. The helper layer must be designed for a second-phase expansion into a larger corruption matrix.
5. The suite should favor a small number of high-value cases over many weakly asserted cases.

## Two-Phase Strategy

### Phase 1: Layered Redesign

Use a mixed strategy:

- keep a few metadata-oriented tests
- convert the main core and `stt` cases to real fileset end-to-end scenarios
- introduce helpers for fixture preparation, corruption injection, repair execution, restart, and post-repair assertions

This phase is the immediate implementation target.

### Phase 2: Full End-To-End Matrix

Build on the Phase 1 helper layer to reduce or eliminate remaining synthetic cases and expand the matrix across:

- corruption type
- target file kind
- strategy
- expected repair action
- post-repair data availability

Phase 2 is explicitly planned but not required to block Phase 1.

## Test Suite Structure

The test file should be reorganized around repair semantics instead of incremental historical additions.

Recommended grouping:

- helper methods shared by the suite
- metadata and transaction behavior
- core end-to-end repair
- `stt` end-to-end repair
- later: full corruption matrix expansion

The implementation may keep one class for framework compatibility, but helper names and test ordering should clearly reflect these groups.

## Phase 1 Coverage Matrix

### Baseline and Strategy Routing

- healthy fileset repair is a no-op
- default `drop_invalid_only` does not repair size-mismatch core corruption
- `head_only_rebuild` and `full_rebuild` both recover size-mismatch core corruption

### Core Repair

- missing `.head` drops the core group and leaves the database restartable
- missing `.data` drops the core group and leaves the database restartable
- damaged `.head` content triggers rebuild or salvageable drop behavior according to surviving valid blocks
- damaged `.data` content triggers rebuild or drop according to surviving valid blocks

### STT Repair

- missing `.stt` is removed and the database remains queryable
- corrupted `stt` data content causes only the affected `stt` file to be rebuilt or dropped
- tomb or index corruption should be covered when a stable data fixture is available; if not stable enough for Phase 1, the helper API must still reserve the hook for Phase 2

### Audit and Transaction Safety

- real repair writes backup manifest and `repair.log`
- staged `current.c.json` recovery on restart remains covered

## Uniform Success Criteria

Except for explicit metadata-only tests, each repair case should verify four layers:

1. File-level state
- target files changed as expected
- `current.json` reflects the repaired state
- backup manifest and `repair.log` exist when backup is enabled
2. Process-level recovery
- repair command completes without crashing
- normal `taosd` startup succeeds afterward
3. SQL read validation
- `count(*)` succeeds
- at least one bounded read query succeeds
4. SQL write validation
- new inserts succeed after repair
- `flush` succeeds
- newly inserted rows are queryable

For destructive repair cases the row count is allowed to decrease, but it must still satisfy a deliberate range assertion such as `0 <= repaired_rows <= original_rows`.

## Helper Architecture

### Real Fixture Builders

Provide dedicated builders for:

- core-repair fixtures
- `stt`-repair fixtures
- later: tomb-heavy fixtures

Each builder should return structured context including:

- `dbname`
- `vnode_id`
- `fid`
- baseline row count
- resolved file paths
- backup root when applicable

### Fileset Discovery

Upgrade the current ad hoc path lookups into structured discovery helpers that can resolve a real fileset and report:

- head/data/sma paths
- `stt` file list
- current manifest entry
- file sizes
- whether tomb data appears to exist

### Corruption Injection

Model corruption types explicitly:

- missing file
- size mismatch by truncate or extend
- in-place byte overwrite
- later: `stt` tomb or index region corruption

Each injector should return a structured record describing what changed so failures are diagnosable.

### Repair Lifecycle

Centralize:

- force-repair command execution
- controlled `taosd` stop/start
- readiness checks
- post-repair SQL assertions

This avoids test-local drift in recovery semantics.

## Existing Test Triage

### Keep And Refactor

- dispatch smoke coverage
- backup manifest coverage
- backup log coverage
- staged manifest crash-safe recovery
- one or more core rebuild smoke cases

### Replace With Real End-To-End Cases

- synthetic missing core fileset tests
- synthetic size-mismatch core tests
- synthetic repair-log action/reason tests tied to fake file groups
- synthetic `current.json` only checks for core strategy behavior

## Risks And Constraints

- Real `stt` and tomb fixtures can be timing-sensitive because file materialization is asynchronous.
- Corrupting the wrong real fileset can produce unstable results, so helper selection must only target fully materialized files with manifest-consistent size before injection.
- End-to-end coverage will make the suite slower; that is an accepted tradeoff for this redesign.

## Verification Strategy

Phase 1 completion should verify:

- the updated Python test file is runnable in isolation
- the selected end-to-end cases pass against a real `taosd`
- metadata-only cases still validate backup and crash-safe behavior
- the helper layer is generic enough that Phase 2 can add more corruption modes without another large refactor
Loading
Loading