Skip to content

Stability hardening: nightly CI, load testing, and release gating for v0.5.x #1183

@corylanou

Description

@corylanou

Motivation

v0.5.x introduced major architectural changes (LTX format, compaction, VFS). We've been moving fast on features and have good test coverage, but community feedback shows regressions still slip through under real-world production load.

The existing test infrastructure (soak, chaos, fuzz, integration) is strong — the gap is in how often it runs, what it asserts, and how realistically it simulates production conditions.

Proposal: Pause feature work and focus on hardening releases through better CI, more realistic load tests, and release gating.


1. Nightly Stability CI

  • Create nightly-stability.yml workflow (daily on main, workflow_dispatch)
  • Race-detector sweep with -count=3
  • Comprehensive soak test (short mode daily, full 2h on weekends)
  • MinIO soak test (S3-compatible backend)
  • VFS chaos test
  • Upload artifacts and notify on failure

2. PR CI Gate Improvements

  • Re-enable the commented-out long-running test in commit.yml
  • Add short-mode soak test to integration test PR gate (~2 min)

3. Realistic Load Testing

  • New TestStabilityUnderFlakyStorage: busy DB + MinIO behind Toxiproxy with cyclic fault injection (TCP resets, latency, bandwidth throttling, timeouts)
  • Extract and expand Toxiproxy helpers for reuse across tests (currently only TCP reset is supported)
  • New network fault variant tests: high latency, bandwidth throttle, partial write interruption

4. Behavioral Assertions in Soak Tests

  • Assert snapshot cadence (catch excessive/runaway snapshots)
  • Track WAL size over time, assert it stays bounded
  • Assert checkpoint timing under sustained write load
  • Validate no write blocking beyond BusyTimeout

5. LTX Consistency Validation

  • Extend Store.Validate() with LTX content checks (checksums, page size consistency, TXID coverage)
  • Add litestream validate CLI command for users to check replica health

6. Release Gating Process

  • Add stability label for lockup/corruption/regression issues
  • Gate releases: zero open stability issues before tagging
  • Document in CONTRIBUTING.md: PRs touching core replication/compaction must pass soak test

7. Community Load Profiles

  • Create a way for users to contribute representative production workload profiles
  • Turn profiles into named test configs run in nightly CI

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions