Date: 2026-03-11
Scope: test/cases/80-Components/01-Taosd/test_tsdb_force_repair.py
Redesign TSDB force-repair tests so the suite validates real repair outcomes instead of mostly checking log text or direct current.json edits. The first phase should establish layered coverage with real fileset end-to-end recovery for the main repair paths. The second phase can expand that structure into a fuller on-disk corruption matrix.
The repair implementation in tsdbRepair.c now exposes materially different behaviors:
drop_invalid_onlyhead_only_rebuildfull_rebuild- independent
sttdrop vs rebuild decisions - concrete
actionandreasonreporting in backup logs
The current Python suite still over-relies on:
- synthetic
current.jsoninjection - fake filesets that do not exercise the real writer/reader paths
- weak success criteria such as "string exists in output"
Those checks are still useful for a few metadata-transaction cases, but they no longer provide enough confidence for the actual repair behavior.
- Real fileset end-to-end tests become the default.
- Synthetic metadata tests remain only where they are the best way to validate transactional behavior.
- Every meaningful repair case must verify service recovery, not just file deletion.
- The helper layer must be designed for a second-phase expansion into a larger corruption matrix.
- The suite should favor a small number of high-value cases over many weakly asserted cases.
Use a mixed strategy:
- keep a few metadata-oriented tests
- convert the main core and
sttcases to real fileset end-to-end scenarios - introduce helpers for fixture preparation, corruption injection, repair execution, restart, and post-repair assertions
This phase is the immediate implementation target.
Build on the Phase 1 helper layer to reduce or eliminate remaining synthetic cases and expand the matrix across:
- corruption type
- target file kind
- strategy
- expected repair action
- post-repair data availability
Phase 2 is explicitly planned but not required to block Phase 1.
The test file should be reorganized around repair semantics instead of incremental historical additions.
Recommended grouping:
- helper methods shared by the suite
- metadata and transaction behavior
- core end-to-end repair
sttend-to-end repair- later: full corruption matrix expansion
The implementation may keep one class for framework compatibility, but helper names and test ordering should clearly reflect these groups.
- healthy fileset repair is a no-op
- default
drop_invalid_onlydoes not repair size-mismatch core corruption head_only_rebuildandfull_rebuildboth recover size-mismatch core corruption
- missing
.headdrops the core group and leaves the database restartable - missing
.datadrops the core group and leaves the database restartable - damaged
.headcontent triggers rebuild or salvageable drop behavior according to surviving valid blocks - damaged
.datacontent triggers rebuild or drop according to surviving valid blocks
- missing
.sttis removed and the database remains queryable - corrupted
sttdata content causes only the affectedsttfile to be rebuilt or dropped - tomb or index corruption should be covered when a stable data fixture is available; if not stable enough for Phase 1, the helper API must still reserve the hook for Phase 2
- real repair writes backup manifest and
repair.log - staged
current.c.jsonrecovery on restart remains covered
Except for explicit metadata-only tests, each repair case should verify four layers:
- File-level state
- target files changed as expected
current.jsonreflects the repaired state- backup manifest and
repair.logexist when backup is enabled
- Process-level recovery
- repair command completes without crashing
- normal
taosdstartup succeeds afterward
- SQL read validation
count(*)succeeds- at least one bounded read query succeeds
- SQL write validation
- new inserts succeed after repair
flushsucceeds- newly inserted rows are queryable
For destructive repair cases the row count is allowed to decrease, but it must still satisfy a deliberate range assertion such as 0 <= repaired_rows <= original_rows.
Provide dedicated builders for:
- core-repair fixtures
stt-repair fixtures- later: tomb-heavy fixtures
Each builder should return structured context including:
dbnamevnode_idfid- baseline row count
- resolved file paths
- backup root when applicable
Upgrade the current ad hoc path lookups into structured discovery helpers that can resolve a real fileset and report:
- head/data/sma paths
sttfile list- current manifest entry
- file sizes
- whether tomb data appears to exist
Model corruption types explicitly:
- missing file
- size mismatch by truncate or extend
- in-place byte overwrite
- later:
stttomb or index region corruption
Each injector should return a structured record describing what changed so failures are diagnosable.
Centralize:
- force-repair command execution
- controlled
taosdstop/start - readiness checks
- post-repair SQL assertions
This avoids test-local drift in recovery semantics.
- dispatch smoke coverage
- backup manifest coverage
- backup log coverage
- staged manifest crash-safe recovery
- one or more core rebuild smoke cases
- synthetic missing core fileset tests
- synthetic size-mismatch core tests
- synthetic repair-log action/reason tests tied to fake file groups
- synthetic
current.jsononly checks for core strategy behavior
- Real
sttand tomb fixtures can be timing-sensitive because file materialization is asynchronous. - Corrupting the wrong real fileset can produce unstable results, so helper selection must only target fully materialized files with manifest-consistent size before injection.
- End-to-end coverage will make the suite slower; that is an accepted tradeoff for this redesign.
Phase 1 completion should verify:
- the updated Python test file is runnable in isolation
- the selected end-to-end cases pass against a real
taosd - metadata-only cases still validate backup and crash-safe behavior
- the helper layer is generic enough that Phase 2 can add more corruption modes without another large refactor