What Is the Peeking Problem?
In a properly designed fixed-horizon A/B test, you calculate the required sample size before launch, run the test until that sample size is reached, and then analyze the results exactly once. The entire statistical framework — your p-value, confidence interval, and significance threshold — assumes this single-analysis structure. Peeking violates that assumption by turning one test into many.
Every time you open the dashboard, check the p-value, and ask yourself “is this significant yet?” — you are conducting an implicit hypothesis test. If the answer is yes and you stop, you have engaged in optional stopping: a decision to halt data collection based on the data itself. This is the statistical equivalent of flipping a coin until you get the outcome you want and then declaring the coin biased.
The Math: How Peeking Inflates False Positives
| Number of Peeks | Nominal α | Actual Type I Error Rate |
|---|---|---|
| 1 | 5% | 5% |
| 2 | 5% | ~8% |
| 5 | 5% | ~14% |
| 10 | 5% | ~19% |
| 20 | 5% | ~25% |
| 30 | 5% | ~30% |
The intuition is straightforward. Early in the test, your sample size is small and your estimates are noisy. The observed conversion rate difference between control and variant swings wildly. These random fluctuations can easily cross the significance threshold — not because there is a real effect, but because there is not yet enough data to dampen the noise. As you accumulate more observations, the signal-to-noise ratio improves and estimates stabilize. But if you already stopped on day 3 because the p-value briefly dipped below 0.05, the damage is done.
The actual inflation is somewhat worse than the simple complement formula suggests, because sequential peeks are correlated — they share overlapping data. Simulation studies by Johari et al. (2017) showed that continuous monitoring of a fixed-horizon test can inflate the Type I error rate to 5× the nominal level. The degradation is fast early on and then logarithmic: the first few peeks do the most damage.
Why Everyone Peeks (And Why It’s Rational)
Stakeholders want results. Your head of e-commerce didn’t approve a 4-week test to wait patiently. They want to know if the new checkout flow is working after the first weekend of traffic. Revenue is on the line every day the test runs, and if the variant looks like it’s losing, the pressure to kill it is enormous. Not peeking requires extraordinary discipline from people whose incentives are entirely misaligned with statistical rigor.
The real problem isn’t human weakness — it’s that fixed-horizon testing was designed for clinical trials and agriculture, not for high-velocity digital experimentation. It doesn’t accommodate the legitimate business need for interim decisions. Telling a VP “we can’t look at the data for three more weeks” is technically correct and organizationally untenable. The framework needs to adapt to the environment, not the other way around.
Sequential testing methods — covered in depth in our guide to sequential testing — were designed specifically for this scenario. They let you monitor accumulating data with mathematical guarantees that the overall false positive rate stays at the level you specified. The trade-off is modest: you need slightly larger sample sizes, typically 20–30% more than a fixed-horizon test.
Real-World Consequences of Peeking
Consider a team running 20 tests per quarter — a reasonable cadence for a mid-size e-commerce operation. If they routinely peek and stop early on ‘winners,’ their actual Type I error rate is north of 20%. That means 4 of those 20 tests are false positives. Four changes that look like wins on paper but deliver zero — or negative — real impact. Those changes get shipped, celebrated in the quarterly review, and baked into revenue projections that will never materialize.
Solutions: How to Monitor Tests Without Peeking
Option 1: Pre-Committed Analysis Schedule
The simplest adjustment: before the test starts, decide exactly when you will analyze the data — for example, at 50% and 100% of the planned sample size. Then apply a Bonferroni correction or similar multiplicity adjustment to maintain the overall α. With two planned analyses, each check uses α/2 = 0.025 instead of 0.05. It’s conservative but simple, and it removes the ambiguity of “how many times did we actually peek?”
Option 2: Group Sequential Testing
Group sequential methods — specifically O’Brien-Fleming and Pocock boundaries — use alpha spending functions to distribute the total significance level across pre-planned interim analyses. O’Brien-Fleming boundaries are very stringent early (requiring extreme effects to stop) and converge toward the original α at the final analysis, preserving nearly all statistical power. This is the gold standard for balancing early stopping with validity.
Option 3: Always-Valid Confidence Sequences
A newer approach from the sequential analysis literature, always-valid confidence sequences allow continuous monitoring at any point without pre-specifying the number or timing of analyses. The confidence intervals are wider than fixed-horizon equivalents — you pay a premium for the flexibility — but they maintain guaranteed coverage at every point in time. This is particularly useful for organizations that cannot commit to a rigid analysis schedule.
Option 4: Lock the Dashboard
The simplest solution requires no statistical sophistication: remove access to real-time results for stakeholders until the test reaches its planned sample size. If nobody can see the data, nobody can make premature stopping decisions. This sounds draconian, but it is the single most effective intervention for teams that lack the statistical infrastructure for sequential methods. Automated email notifications at test completion replace the compulsive dashboard-checking behavior.
How to Detect if Your Program Has a Peeking Problem
- Tests consistently end before planned duration. If your median test length is significantly shorter than the pre-calculated runtime, someone is stopping tests early based on intermediate results.
- Win rate is suspiciously high (>50%). In a rigorous program, most ideas do not produce statistically significant improvements. A win rate above 40–50% often signals that marginal results are being called ‘winners’ prematurely.
- Effect sizes shrink when validated. If you re-run ‘winning’ tests and the uplift is substantially smaller, the original result was likely inflated by early stopping during a favorable fluctuation.
- Shipped changes have no measurable impact. Post-implementation holdback tests or interrupted time-series analysis show that supposedly winning changes made no difference to site-wide metrics.
The most powerful diagnostic is the A/A test. Run a test where control and variant are identical. In a properly calibrated system, you should see significance at the 5% level no more than 5% of the time. If your A/A tests show significance substantially more often, you have a systematic bias — whether from peeking, instrumentation errors, or flawed randomization.
