You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Add documentation for sequential testing in experiments (#17226)
* Add sequential testing documentation to frequentist statistics page
* docs(experiments): mention sequential testing as peeking problem solution in best practices
---------
Co-authored-by: inkeep[bot] <257615677+inkeep[bot]@users.noreply.github.com>
A good hypothesis focuses your testing and guides your decision-making. It should include your goal metric, how you think your change will improve it, and any other important context. For example:
@@ -28,33 +29,32 @@ To avoid including unaffected users, make sure to first filter out ineligible us
28
29
```js
29
30
// ✅ Correct. Will exclude unaffected users
30
31
functionshowNewChanges(user) {
31
-
32
32
if (user.hasCompletedAction) {
33
-
returnfalse
33
+
returnfalse;
34
34
}
35
35
36
36
// other checks
37
37
38
-
if (posthog.getFeatureFlag('experiment-key') ==='control') {
38
+
if (posthog.getFeatureFlag("experiment-key") ==="control") {
39
39
returnfalse;
40
40
}
41
41
42
-
returntrue
42
+
returntrue;
43
43
}
44
44
45
45
// ❌ Incorrect. Will include unaffected users
46
46
functionshowNewChanges(user) {
47
-
if (posthog.getFeatureFlag('experiment-key') ==='control') {
47
+
if (posthog.getFeatureFlag("experiment-key") ==="control") {
48
48
returnfalse;
49
49
}
50
50
51
51
if (user.hasCompletedAction) {
52
-
returnfalse
52
+
returnfalse;
53
53
}
54
54
55
55
// other checks
56
56
57
-
returntrue
57
+
returntrue;
58
58
}
59
59
```
60
60
@@ -77,10 +77,10 @@ To avoid this problem, you should first test your experiment with a [small rollo
77
77
To do this in PostHog, you can edit the rollout percentage of the experiment feature flag:
alt="How to edit the rollout percentage of an experiment feature flag"
83
+
classes="rounded"
84
84
/>
85
85
86
86
Here's a list of what to check during your test rollout:
@@ -99,15 +99,18 @@ Alternatively, if you don't have enough statistical power (i.e., not enough user
99
99
For these reasons, PostHog includes a recommended running time calculator in the experiment setup flow. This calculates the minimum sample size required to run your experiment and the duration you should run your experiment for.
Alternatively, if you enable [sequential testing](/docs/experiments/statistics-frequentist#sequential-testing-beta), PostHog produces always-valid p-values that stay bounded by alpha no matter how often you check results. This lets you monitor an experiment continuously without inflating false positives. You should still plan your expected sample size upfront so you have enough statistical power to detect the effect you care about.
109
+
108
110
## 6. Use a launch checklist
109
111
110
112
Launching an A/B test requires careful planning to ensure accurate results and meaningful insights. To help you navigate this, we've put together this launch checklist:
113
+
111
114
```
112
115
**Before launch**
113
116
@@ -130,4 +133,4 @@ Launching an A/B test requires careful planning to ensure accurate results and m
130
133
- [ ] Validate or invalidate your hypothesis based on experiment data.
131
134
- [ ] Document your results and share with your team for feedback.
132
135
- [ ] End experiment and deploy winning variant code. Delete code for losing variant.
Copy file name to clipboardExpand all lines: contents/docs/experiments/statistics-frequentist.mdx
+41-2Lines changed: 41 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,7 @@ The raw event data is aggregated into sufficient statistics for each variant:
44
44
number_of_samples: 1523, // Users exposed to this variant
45
45
sum: 98.5, // Total metric value across all users
46
46
sum_squares: 142.3, // Sum of squared values (for variance)
47
-
47
+
48
48
// Additional fields for ratio metrics:
49
49
denominator_sum: 205, // Total denominator value
50
50
denominator_sum_squares: 89.2, // Sum of squared denominator values
@@ -65,13 +65,16 @@ You can also enable [CUPED variance reduction](/docs/experiments/cuped) to reduc
65
65
Before analysis proceeds, the engine validates the data quality based on metric type:
66
66
67
67
**All metrics:**
68
+
68
69
-**Minimum sample size**: Each variant needs at least 50 exposures
69
70
70
71
**Funnel metrics only:**
72
+
71
73
-**Minimum conversions**: At least 5 conversions per variant
72
74
-**Normal approximation validity**: For proportions, both `np ≥ 5` and `n(1-p) ≥ 5` are required for the t-test to be valid
73
75
74
76
**Mean and ratio metrics only:**
77
+
75
78
-**Non-zero baseline**: The control variant must have a non-zero mean (needed for relative difference calculations like "20% increase")
76
79
77
80
If any validation fails, the analysis stops and returns appropriate error messages instead of potentially misleading results.
@@ -95,31 +98,39 @@ Variance measures the spread or uncertainty in our data. Think of it as quantify
95
98
Variance is calculated from the actual experiment data. When users in your experiment convert at different rates or have wildly different revenue values, that creates variance. The formulas differ by metric type:
96
99
97
100
**For funnel metrics**:
101
+
98
102
```
99
103
variance = p(1-p)/n
100
104
```
105
+
101
106
Where p is the conversion rate and n is the sample size. The variance is highest when p=0.5 (50% conversion rate) and lowest when p is close to 0 or 1.
102
107
103
108
**For mean metrics**:
109
+
104
110
```
105
111
sample_variance = (sum_squares - sum²/n) / (n-1)
106
112
```
113
+
107
114
This measures how spread out individual user values are from the average.
@@ -203,6 +217,30 @@ You can choose between 90%, 95%, or 99% confidence levels, which correspond to a
203
217
204
218
To set the default confidence level for all experiments, go to **Experiments > Settings**. You can also override this for individual experiments in the **Statistics** section of the experiment.
205
219
220
+
## Sequential testing (beta)
221
+
222
+
The standard fixed-horizon t-test assumes you check results once at a predetermined sample size. In practice, experimenters check dashboards continuously, which inflates the false positive rate – this is known as the "peeking problem." Sequential testing solves this by producing always-valid p-values and confidence sequences that remain valid no matter how many times you check.
223
+
224
+
When sequential testing is enabled, PostHog replaces the fixed-horizon t-test with a sequential test based on [Waudby-Smith et al. 2023](https://arxiv.org/abs/2103.06476). Instead of traditional confidence intervals, it produces confidence sequences that are valid at any stopping point. The false positive rate stays bounded by your chosen alpha throughout the entire experiment.
225
+
226
+
### Trade-offs
227
+
228
+
The cost of always-valid guarantees is slightly wider confidence intervals compared to the fixed-horizon test. The confidence sequence is narrowest near the tuning parameter sample size and wider at smaller or much larger sample sizes.
229
+
230
+
Use sequential testing when you plan to monitor results continuously and want to make decisions before a predetermined end date. If you strictly wait for the predetermined duration before checking, the standard fixed-horizon test is more efficient.
231
+
232
+
### How to enable sequential testing
233
+
234
+
**Per experiment:** In the experiment's **Statistics** settings, set sequential testing to **Enabled**.
235
+
236
+
**As a project default:** Go to **Experiments > Settings** and check **Apply sequential testing by default**. Individual experiments can override this default.
237
+
238
+
### Tuning parameter
239
+
240
+
The tuning parameter (default: 5,000) controls where along the data-collection timeline the confidence sequence is tightest. Set it close to your expected total sample size for the experiment. A larger value shifts the tightest point to larger sample sizes, while a smaller value optimizes for earlier reads.
241
+
242
+
You can configure the tuning parameter per experiment when sequential testing is enabled, or set a project-wide default in **Experiments > Settings**.
243
+
206
244
## Mathematical formulas reference
207
245
208
246
### T-test with Welch-Satterthwaite degrees of freedom
@@ -239,6 +277,7 @@ Since treatment and control are independent, the covariance term equals zero, si
239
277
### What about multiple variants?
240
278
241
279
When testing multiple variants (A/B/C/D tests):
280
+
242
281
- Each variant is compared to control independently
243
282
- The p-value and confidence interval are calculated for each variant vs. control
244
283
- No correction for multiple comparisons is applied by default
@@ -248,7 +287,7 @@ When testing multiple variants (A/B/C/D tests):
248
287
249
288
At the default significance level (α = 0.05), there's a 5% chance that any single metric shows a significant result when nothing actually changed. The statistical engine itself can produce a false signal from random noise. Think of it like rolling a die: each metric is an independent roll, and each one has a small chance of landing on "significant" by chance.
250
289
251
-
With 5 metrics, the chance that *at least one* produces a false positive is about 23%. With 10 metrics, it's 40%. PostHog tests each metric independently and doesn't adjust for this, so each result stands on its own.
290
+
With 5 metrics, the chance that _at least one_ produces a false positive is about 23%. With 10 metrics, it's 40%. PostHog tests each metric independently and doesn't adjust for this, so each result stands on its own.
0 commit comments