Skip to content

Commit da0e609

Browse files
authored
docs: Add documentation for sequential testing in experiments (#17226)
* Add sequential testing documentation to frequentist statistics page * docs(experiments): mention sequential testing as peeking problem solution in best practices --------- Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
1 parent 98fc6fd commit da0e609

2 files changed

Lines changed: 60 additions & 18 deletions

File tree

contents/docs/experiments/best-practices.mdx

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ title: Experiments & A/B testing best practices
33
sidebar: Docs
44
showTitle: true
55
---
6+
67
## 1. Establish a clear hypothesis
78

89
A good hypothesis focuses your testing and guides your decision-making. It should include your goal metric, how you think your change will improve it, and any other important context. For example:
@@ -28,33 +29,32 @@ To avoid including unaffected users, make sure to first filter out ineligible us
2829
```js
2930
// ✅ Correct. Will exclude unaffected users
3031
function showNewChanges(user) {
31-
3232
if (user.hasCompletedAction) {
33-
return false
33+
return false;
3434
}
3535

3636
// other checks
3737

38-
if (posthog.getFeatureFlag('experiment-key') === 'control') {
38+
if (posthog.getFeatureFlag("experiment-key") === "control") {
3939
return false;
4040
}
4141

42-
return true
42+
return true;
4343
}
4444

4545
// ❌ Incorrect. Will include unaffected users
4646
function showNewChanges(user) {
47-
if (posthog.getFeatureFlag('experiment-key') === 'control') {
47+
if (posthog.getFeatureFlag("experiment-key") === "control") {
4848
return false;
4949
}
5050

5151
if (user.hasCompletedAction) {
52-
return false
52+
return false;
5353
}
5454

5555
// other checks
5656

57-
return true
57+
return true;
5858
}
5959
```
6060

@@ -77,10 +77,10 @@ To avoid this problem, you should first test your experiment with a [small rollo
7777
To do this in PostHog, you can edit the rollout percentage of the experiment feature flag:
7878

7979
<ProductVideo
80-
videoLight= "https://res.cloudinary.com/dmukukwp6/video/upload/light_rollout_45df5dd6f6.mp4"
81-
videoDark= "https://res.cloudinary.com/dmukukwp6/video/upload/rollot_d275817347.mp4"
82-
alt="How to edit the rollout percentage of an experiment feature flag"
83-
classes="rounded"
80+
videoLight="https://res.cloudinary.com/dmukukwp6/video/upload/light_rollout_45df5dd6f6.mp4"
81+
videoDark="https://res.cloudinary.com/dmukukwp6/video/upload/rollot_d275817347.mp4"
82+
alt="How to edit the rollout percentage of an experiment feature flag"
83+
classes="rounded"
8484
/>
8585

8686
Here's a list of what to check during your test rollout:
@@ -99,15 +99,18 @@ Alternatively, if you don't have enough statistical power (i.e., not enough user
9999
For these reasons, PostHog includes a recommended running time calculator in the experiment setup flow. This calculates the minimum sample size required to run your experiment and the duration you should run your experiment for.
100100

101101
<ProductVideo
102-
videoLight= "https://res.cloudinary.com/dmukukwp6/video/upload/experiment_light_7070cd279b.mp4"
103-
videoDark= "https://res.cloudinary.com/dmukukwp6/video/upload/experiment0dark_61a5bcbb63.mp4"
104-
alt="How to use the A/B test duration calculator"
105-
classes="rounded"
102+
videoLight="https://res.cloudinary.com/dmukukwp6/video/upload/experiment_light_7070cd279b.mp4"
103+
videoDark="https://res.cloudinary.com/dmukukwp6/video/upload/experiment0dark_61a5bcbb63.mp4"
104+
alt="How to use the A/B test duration calculator"
105+
classes="rounded"
106106
/>
107107

108+
Alternatively, if you enable [sequential testing](/docs/experiments/statistics-frequentist#sequential-testing-beta), PostHog produces always-valid p-values that stay bounded by alpha no matter how often you check results. This lets you monitor an experiment continuously without inflating false positives. You should still plan your expected sample size upfront so you have enough statistical power to detect the effect you care about.
109+
108110
## 6. Use a launch checklist
109111

110112
Launching an A/B test requires careful planning to ensure accurate results and meaningful insights. To help you navigate this, we've put together this launch checklist:
113+
111114
```
112115
**Before launch**
113116
@@ -130,4 +133,4 @@ Launching an A/B test requires careful planning to ensure accurate results and m
130133
- [ ] Validate or invalidate your hypothesis based on experiment data.
131134
- [ ] Document your results and share with your team for feedback.
132135
- [ ] End experiment and deploy winning variant code. Delete code for losing variant.
133-
```
136+
```

contents/docs/experiments/statistics-frequentist.mdx

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ The raw event data is aggregated into sufficient statistics for each variant:
4444
number_of_samples: 1523, // Users exposed to this variant
4545
sum: 98.5, // Total metric value across all users
4646
sum_squares: 142.3, // Sum of squared values (for variance)
47-
47+
4848
// Additional fields for ratio metrics:
4949
denominator_sum: 205, // Total denominator value
5050
denominator_sum_squares: 89.2, // Sum of squared denominator values
@@ -65,13 +65,16 @@ You can also enable [CUPED variance reduction](/docs/experiments/cuped) to reduc
6565
Before analysis proceeds, the engine validates the data quality based on metric type:
6666

6767
**All metrics:**
68+
6869
- **Minimum sample size**: Each variant needs at least 50 exposures
6970

7071
**Funnel metrics only:**
72+
7173
- **Minimum conversions**: At least 5 conversions per variant
7274
- **Normal approximation validity**: For proportions, both `np ≥ 5` and `n(1-p) ≥ 5` are required for the t-test to be valid
7375

7476
**Mean and ratio metrics only:**
77+
7578
- **Non-zero baseline**: The control variant must have a non-zero mean (needed for relative difference calculations like "20% increase")
7679

7780
If any validation fails, the analysis stops and returns appropriate error messages instead of potentially misleading results.
@@ -95,31 +98,39 @@ Variance measures the spread or uncertainty in our data. Think of it as quantify
9598
Variance is calculated from the actual experiment data. When users in your experiment convert at different rates or have wildly different revenue values, that creates variance. The formulas differ by metric type:
9699

97100
**For funnel metrics**:
101+
98102
```
99103
variance = p(1-p)/n
100104
```
105+
101106
Where p is the conversion rate and n is the sample size. The variance is highest when p=0.5 (50% conversion rate) and lowest when p is close to 0 or 1.
102107

103108
**For mean metrics**:
109+
104110
```
105111
sample_variance = (sum_squares - sum²/n) / (n-1)
106112
```
113+
107114
This measures how spread out individual user values are from the average.
108115

109116
**For ratio metrics**:
117+
110118
```
111119
Var(M/D) ≈ Var(M)/D² + M²×Var(D)/D⁴ - 2M×Cov(M,D)/D³
112120
```
121+
113122
This complex formula accounts for uncertainty in both the numerator and denominator, plus how they vary together.
114123

115124
The pooled variance (comparing treatment vs control) combines the variances from both groups:
116125

117126
**For absolute differences**:
127+
118128
```
119129
pooled_variance = treatment_variance/n_treatment + control_variance/n_control
120130
```
121131

122132
**For relative differences** (using delta method):
133+
123134
```
124135
pooled_variance = treatment_variance/(n_treatment × control_mean²) + (treatment_mean² × control_variance)/(n_control × control_mean⁴)
125136
```
@@ -133,6 +144,7 @@ The frequentist approach tests a specific hypothesis about the difference betwee
133144
#### The null hypothesis
134145

135146
The null hypothesis (H₀) states that there is no difference between the treatment and control:
147+
136148
- H₀: effect_size = 0 (no difference)
137149
- H₁: effect_size ≠ 0 (there is a difference)
138150

@@ -141,11 +153,13 @@ The null hypothesis (H₀) states that there is no difference between the treatm
141153
PostHog uses Welch's t-test, which handles unequal variances between groups:
142154

143155
**T-statistic calculation**:
156+
144157
```
145158
t = (observed_effect - 0) / √(pooled_variance)
146159
```
147160

148161
**Degrees of freedom** (Welch-Satterthwaite approximation):
162+
149163
```
150164
df = (s₁²/n₁ + s₂²/n₂)² / [(s₁²/n₁)²/(n₁-1) + (s₂²/n₂)²/(n₂-1)]
151165
```
@@ -203,6 +217,30 @@ You can choose between 90%, 95%, or 99% confidence levels, which correspond to a
203217

204218
To set the default confidence level for all experiments, go to **Experiments > Settings**. You can also override this for individual experiments in the **Statistics** section of the experiment.
205219

220+
## Sequential testing (beta)
221+
222+
The standard fixed-horizon t-test assumes you check results once at a predetermined sample size. In practice, experimenters check dashboards continuously, which inflates the false positive rate – this is known as the "peeking problem." Sequential testing solves this by producing always-valid p-values and confidence sequences that remain valid no matter how many times you check.
223+
224+
When sequential testing is enabled, PostHog replaces the fixed-horizon t-test with a sequential test based on [Waudby-Smith et al. 2023](https://arxiv.org/abs/2103.06476). Instead of traditional confidence intervals, it produces confidence sequences that are valid at any stopping point. The false positive rate stays bounded by your chosen alpha throughout the entire experiment.
225+
226+
### Trade-offs
227+
228+
The cost of always-valid guarantees is slightly wider confidence intervals compared to the fixed-horizon test. The confidence sequence is narrowest near the tuning parameter sample size and wider at smaller or much larger sample sizes.
229+
230+
Use sequential testing when you plan to monitor results continuously and want to make decisions before a predetermined end date. If you strictly wait for the predetermined duration before checking, the standard fixed-horizon test is more efficient.
231+
232+
### How to enable sequential testing
233+
234+
**Per experiment:** In the experiment's **Statistics** settings, set sequential testing to **Enabled**.
235+
236+
**As a project default:** Go to **Experiments > Settings** and check **Apply sequential testing by default**. Individual experiments can override this default.
237+
238+
### Tuning parameter
239+
240+
The tuning parameter (default: 5,000) controls where along the data-collection timeline the confidence sequence is tightest. Set it close to your expected total sample size for the experiment. A larger value shifts the tightest point to larger sample sizes, while a smaller value optimizes for earlier reads.
241+
242+
You can configure the tuning parameter per experiment when sequential testing is enabled, or set a project-wide default in **Experiments > Settings**.
243+
206244
## Mathematical formulas reference
207245

208246
### T-test with Welch-Satterthwaite degrees of freedom
@@ -239,6 +277,7 @@ Since treatment and control are independent, the covariance term equals zero, si
239277
### What about multiple variants?
240278

241279
When testing multiple variants (A/B/C/D tests):
280+
242281
- Each variant is compared to control independently
243282
- The p-value and confidence interval are calculated for each variant vs. control
244283
- No correction for multiple comparisons is applied by default
@@ -248,7 +287,7 @@ When testing multiple variants (A/B/C/D tests):
248287

249288
At the default significance level (α = 0.05), there's a 5% chance that any single metric shows a significant result when nothing actually changed. The statistical engine itself can produce a false signal from random noise. Think of it like rolling a die: each metric is an independent roll, and each one has a small chance of landing on "significant" by chance.
250289

251-
With 5 metrics, the chance that *at least one* produces a false positive is about 23%. With 10 metrics, it's 40%. PostHog tests each metric independently and doesn't adjust for this, so each result stands on its own.
290+
With 5 metrics, the chance that _at least one_ produces a false positive is about 23%. With 10 metrics, it's 40%. PostHog tests each metric independently and doesn't adjust for this, so each result stands on its own.
252291

253292
### More metrics can help, if they're planned
254293

0 commit comments

Comments
 (0)