-
Notifications
You must be signed in to change notification settings - Fork 346
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Motivation
v0.5.x introduced major architectural changes (LTX format, compaction, VFS). We've been moving fast on features and have good test coverage, but community feedback shows regressions still slip through under real-world production load.
The existing test infrastructure (soak, chaos, fuzz, integration) is strong — the gap is in how often it runs, what it asserts, and how realistically it simulates production conditions.
Proposal: Pause feature work and focus on hardening releases through better CI, more realistic load tests, and release gating.
1. Nightly Stability CI
- Create
nightly-stability.ymlworkflow (daily onmain,workflow_dispatch) - Race-detector sweep with
-count=3 - Comprehensive soak test (short mode daily, full 2h on weekends)
- MinIO soak test (S3-compatible backend)
- VFS chaos test
- Upload artifacts and notify on failure
2. PR CI Gate Improvements
- Re-enable the commented-out long-running test in
commit.yml - Add short-mode soak test to integration test PR gate (~2 min)
3. Realistic Load Testing
- New
TestStabilityUnderFlakyStorage: busy DB + MinIO behind Toxiproxy with cyclic fault injection (TCP resets, latency, bandwidth throttling, timeouts) - Extract and expand Toxiproxy helpers for reuse across tests (currently only TCP reset is supported)
- New network fault variant tests: high latency, bandwidth throttle, partial write interruption
4. Behavioral Assertions in Soak Tests
- Assert snapshot cadence (catch excessive/runaway snapshots)
- Track WAL size over time, assert it stays bounded
- Assert checkpoint timing under sustained write load
- Validate no write blocking beyond BusyTimeout
5. LTX Consistency Validation
- Extend
Store.Validate()with LTX content checks (checksums, page size consistency, TXID coverage) - Add
litestream validateCLI command for users to check replica health
6. Release Gating Process
- Add
stabilitylabel for lockup/corruption/regression issues - Gate releases: zero open
stabilityissues before tagging - Document in CONTRIBUTING.md: PRs touching core replication/compaction must pass soak test
7. Community Load Profiles
- Create a way for users to contribute representative production workload profiles
- Turn profiles into named test configs run in nightly CI
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request