Skip to content

Stateful testing, weighted transitions and using state to generate transitions #122

@russelldb

Description

@russelldb

Hi,
First up, thanks for the amazing work and keeping the PBT flame lit!

I have a model/sim that I wrote for a client that uses proptest-statemachine. The basic idea is I use a BTreeMap<NodeId, State> to represent a distributed system, and then I generate transitions to do things in the system. I find this a really generally useful way to model distributed systems, and it can often then be used to drive the real system components to ensure implementation matches the spec. I guess it is Property Based Design more than testing, but ideas/thinking needs testing as much as code, right?

I installed the hegel-skill in claude. Using Opus 4.6 1m ctx model claude ported my sim to hegel-rust. Here are the open questions I have. These might be down to my lack of understanding of which concept in proptest-statemachine maps to hegel-rust. (NOTE: I learned this general technique from Quviq using EQC and eqc_statem, and just ported it across to proptest-statemachine)

Having read the issue that claude generated, it sounds pretty confident that the issue isn't me/it's understanding. I am nowhere near as confident, and apologise in advance if the answer is "right better models".

Claude says:


Title: Stateful testing: support for state-dependent rule eligibility (preconditions that gate rule selection, not just filter)
Context:

I'm porting a stateful property-based test from proptest-state-machine to hegel-rust. The system under test is a distributed sync protocol simulator with 9 transition types (Insert, Update, Delete, StartPush,
ContinuePush, StartPull, ContinuePull, NewSchema, UpgradeSchema). Many transitions are only valid in specific states — e.g., ContinuePush requires a device to be mid-push session, UpgradeSchema requires the
device schema to be behind the server's.

In proptest-statemachine, I handle this with an EligibleSets pattern: a single pass over the state computes which transitions are valid and builds a weighted strategy union of only the eligible ones. The
framework generates transitions externally , the SUT never draws during application, and shrinking operates on the transition sequence.

In hegel-rust's #[hegel::state_machine], the equivalent is tc.assume() inside #[rule] methods. This works but has two significant limitations:

  1. High rejection rates for state-gated rules

Rules like continue_push and continue_pull are only valid when a device is mid-session. With uniform rule selection and 3 devices, tc.assume(dev.is_pushing()) rejects the vast majority of attempts. In proptest,
these had weight 3 and were only in the strategy when eligible — zero waste. With hegel I had to suppress FilterTooMuch health checks entirely.

  1. Draws inside rules must be state-independent

This is the bigger issue. If a rule draws from a state-dependent collection (e.g., sampled_from(live_rows) where live_rows varies), the draw pattern changes across replays and hegel flags it as flaky. I had to
restructure all rules to draw fixed-shape values (integer indices) and map them to state post-hoc via modulo arithmetic. This works but loses precision — instead of drawing directly from eligible devices/rows,
I'm drawing uniform integers and hoping they land somewhere useful after assume().

For comparison, in proptest-statemachine the equivalent of "draw a live row from a specific device" was:

proptest::sample::select(devices_with_live_rows.clone())
    .prop_flat_map(|(id, row_ids, cols)| gen_update(id, row_ids, cols))

State-dependent, zero rejection, precise generation. In hegel I had to write:

let device_id = tc.draw(generators::integers::<usize>().max_value(num_devices - 1));
 let row_idx = tc.draw(generators::integers::<usize>());
 // ... later, after reading state:
 tc.assume(!live_rows.is_empty());
 let row_id = live_rows[row_idx % live_rows.len()];

What I think would help

Hypothesis Python solves this with @precondition decorators on rules (which gate selection before any draws happen) and Bundles (state-dependent pools the engine understands). Something analogous in hegel-rust
would be transformative for stateful testing:

-#[precondition]on rules: a predicate checked before the rule is selected, so ineligible rules never fire and never waste test budget. This avoids the assume() rejection tax entirely.

  • Variables<T> with state-dependent draw semantics: Variables already exists in hegel-rust, but if the engine could draw from variables without the flaky-test detection treating variable-length pools as
    non-deterministic draws, rules could use state-dependent collections naturally.

Without these, stateful testing in hegel-rust works for simple cases but hits a wall for models where transition eligibility is the core complexity — which is exactly where stateful PBT is most valuable.

  1. No weighted rule selection
In proptest-statemachine I used explicit weights to shape the transition distribution toward realistic scenarios:

This produced test traces that looked like real usage — sync sessions were frequent and usually completed, schema changes were rare, and local mutations were steady.
--- Transition Distribution (1006 cases, 553445 transitions) ---
   ContinuePull   112069  ( 20.2%)
   ContinuePush   107586  ( 19.4%)
   Update         100972  ( 18.2%)
   Insert          52227  (  9.4%)
   Delete          50588  (  9.1%)
   NewSchema       50008  (  9.0%)
   UpgradeSchema   43704  (  7.9%)
   StartPush       19691  (  3.6%)
   StartPull       16600  (  3.0%)
 
 --- Session Completions ---
   StartPush       19691 started, 18414 completed ( 93.5%)
   StartPull       16600 started, 14977 completed ( 90.2%)
With hegel's uniform rule selection, the distribution is dominated by rules that are always eligible. From a 1000-case run:
--- Transition Distribution (1009 cases, 48323 transitions) ---
  Insert         8727  ( 18.1%)
  NewSchema      8240  ( 17.1%)
  Delete         8075  ( 16.7%)
  Update         7776  ( 16.1%)
  UpgradeSchema  5628  ( 11.6%)
  ContinuePush   5040  ( 10.4%)
  StartPush      2724  (  5.6%)
  StartPull      1065  (  2.2%)
  ContinuePull   1048  (  2.2%)

--- Session Completions ---
  StartPush        2724 started,   767 completed ( 28.2%)
  StartPull        1065 started,   464 completed ( 43.6%)
`NewSchema` and `UpgradeSchema` fire ~28% of the time combined — in reality these are rare events. Meanwhile `StartPull/ContinuePull `are starved at ~6% combined because they require specific preconditions and get
filtered by `assume()`. The result: test traces are dominated by schema churn and don't exercise the sync protocol deeply. `Push/pull` sessions rarely complete.

This compounds with the precondition issue — rules that need preconditions are already penalised by `assume() `rejection, and they get no priority boost to compensate. A weight parameter on `#[rule]` (or
equivalently, the ability to control rule selection probability) would let users tune the distribution back toward realistic scenarios.

Happy to discuss further or provide a cut down version of the full port as a concrete example of the friction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionnot sure it's a bug? questions welcome

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions