A Bayesian A/B testing system that treats experimentation as a sequential decision-making problem, rather than a one-time statistical inference task.
v1.0 implemented a Bayesian A/B testing baseline using a Beta–Binomial model. It focused on posterior inference for conversion rates and provided quantities such as the probability that variant B outperforms A, expected lift, and credible intervals.
This established a solid inferential foundation, but it remained a static analysis: it did not address how experiments should be evaluated sequentially or when a decision should be made in practice.
Most A/B testing implementations focus on static inference: they estimate conversion rates after collecting a fixed amount of data.
This approach has two major issues in practice:
- Experiments often run longer than necessary, wasting traffic and time.
- Statistical significance does not guarantee practical or business relevance.
v1.0 of this project addressed Bayesian inference, but it did not answer the key operational question:
“When should we stop the experiment, and what should we do next?”
v2.0 upgrades the system from a static Bayesian analysis to a sequential decision engine.
Key changes:
- The experiment is evaluated day by day, not only at a fixed horizon.
- Decisions are made explicitly using Bayesian decision principles.
- The system can stop early when further data collection is no longer useful.
Instead of reporting only probabilities or intervals, v2.0 outputs one of three actions:
SHIP_B— roll out the treatmentSTOP— stop the experiment and keep the controlCONTINUE— collect more data
- Variant A (control) is assumed to be the default.
- The system only ships B if evidence is strong and risk is acceptable.
- Otherwise, the experiment is stopped and A is kept by default.
This reflects common industry experimentation practice.
The decision engine combines three ideas:
Conversion rates are modeled using a Beta–Binomial model, allowing exact and fast posterior updates as new data arrives.
Small effects may be statistically real but operationally irrelevant. A Region of Practical Equivalence (ROPE) is used to distinguish meaningful improvements from noise.
Before shipping a variant, the system estimates the expected loss if the decision were wrong. This prevents premature rollouts when uncertainty is still costly.
A decision is made only when confidence is high and downside risk is acceptable.
The notebook v2.0_demo.ipynb demonstrates the system in action.
It simulates an A/B experiment where data arrives sequentially and shows:
- how evidence accumulates over time
- when the system decides to stop
- which action is taken
To run the demo:
- Open
v2.0_demo.ipynbin Google Colab - Run all cells from top to bottom
- Observe the printed decisions and the decision-over-time plot
Potential extensions include:
- Threshold tuning via large-scale simulation
- Hierarchical models for segment-level experiments
- Support for non-binary metrics using PyMC
- Explicit rollback and multi-variant decision logic
These are intentionally left out of v2.0 to keep the decision behavior transparent and well-understood.