What Is Sequential Testing?
In a standard fixed-horizon A/B test, you calculate the required sample size upfront, run the test until you reach that number, and then analyze the results exactly once. This is clean and statistically valid, but it creates a practical problem: you cannot look at the data before the test is complete without inflating your Type I error rate. Sequential testing solves this by building the possibility of early analysis directly into the statistical framework.
The core idea is that you can analyze data at multiple points — called interim analyses or 'looks' — as long as you adjust your significance thresholds to account for the multiple comparisons. The total probability of a false positive across all analyses remains at your target level (typically 5%), but the threshold at any single analysis is stricter than the standard α = 0.05.
| Property | Fixed-Horizon | Group Sequential | Continuous Sequential |
|---|---|---|---|
| When to analyze | Once, at end | At pre-planned interim points (e.g., 3-5 looks) | At any time during the experiment |
| False positive control | Exact at α (one test) | Exact at α across all looks (alpha spending) | Exact at α continuously (always-valid bounds) |
| Efficiency tradeoff | Most efficient at planned sample size | 5-15% larger max sample size; lower expected sample size for clear effects | Widest confidence intervals; most flexibility |
| Complexity | Low — standard test | Moderate — requires pre-specified analysis schedule | High — requires confidence sequences or mixture boundaries |
| Best for | Tests with no time pressure | Most e-commerce A/B tests | Tests requiring maximum monitoring flexibility |
For the majority of e-commerce A/B tests, group sequential testing with 3-5 planned analyses strikes the best balance between monitoring flexibility and statistical efficiency. Continuous sequential methods exist and are theoretically appealing, but they come with wider confidence intervals that most teams find impractical.
The Problem Sequential Testing Solves
When a test runs for 42 days — the median duration across thousands of experiments in our database — stakeholders inevitably want to know what is happening before the end date. Product managers have roadmap deadlines. Marketing teams have campaigns to launch. Executives want to see progress. The pressure to peek is not irrational; it is a natural consequence of running experiments in a business context where time has real cost.
The problem is that checking a standard fixed-horizon test before it reaches full sample size — and making decisions based on what you see — inflates the false positive rate dramatically. Every time you look at the data and ask 'is this significant yet?', you are performing another statistical test. Without adjustment, checking once per day for 30 days at α = 0.05 can push the real false positive rate above 20%. This is the peeking problem, and it is one of the most common sources of invalid test results in e-commerce optimization.
Sequential testing reframes the problem. Instead of pretending that teams will not look at data early — which is unrealistic — it builds the monitoring into the statistical design. The result is a test that can be checked at pre-specified points (or continuously, depending on the method) without any inflation of the false positive rate. For a deeper analysis of why uncontrolled peeking is so dangerous, see our article on the peeking problem in A/B testing.
How Alpha Spending Works
The concept behind alpha spending is straightforward: you have a total error budget of 5% (or whatever your target α is), and you decide in advance how to distribute that budget across your planned analyses. At each interim look, you use only a fraction of the total alpha. The spending function determines how much alpha is available at each look, and the cumulative spend never exceeds the original 5%.
O'Brien-Fleming Boundaries
O'Brien-Fleming boundaries are the most widely used alpha spending approach in practice. They are extremely conservative at early analyses — requiring very strong evidence to stop early — and become progressively more permissive as the test accumulates data. At the final analysis, the boundary is nearly identical to a standard fixed-horizon test. This makes them ideal for e-commerce A/B testing because early data is noisy and unreliable, and premature stopping based on early noise is exactly the failure mode you want to prevent.
In practical terms, an O'Brien-Fleming design with 3 interim analyses might require a p-value below 0.005 to stop at the first look (after 33% of data), below 0.014 at the second look (after 67% of data), and below 0.045 at the final look (after 100% of data). The total alpha spent across all three looks sums to 0.05. The early looks are deliberately hard to pass because the estimates at that stage have wide confidence intervals.
Pocock Boundaries
Pocock boundaries take a different approach: they use the same adjusted significance level at every interim analysis. This is conceptually simpler — every look has the same stopping threshold — but it results in a more aggressive early stopping boundary and a less efficient final analysis. Because alpha is distributed equally, each individual look gets less budget than the O'Brien-Fleming final look, meaning the Pocock design requires a larger maximum sample size to achieve the same power.
| Analysis Point | O'Brien-Fleming α | Pocock α | Data Fraction |
|---|---|---|---|
| Look 1 | 0.005 | 0.022 | 33% |
| Look 2 | 0.014 | 0.022 | 67% |
| Look 3 (Final) | 0.045 | 0.022 | 100% |
The practical difference between the two is meaningful. O'Brien-Fleming designs have a final-analysis boundary close to 0.05, meaning that if the test runs to completion, you lose almost nothing in statistical power compared to a standard fixed-horizon test. Pocock designs sacrifice final-analysis power (the boundary at the final look is 0.022, not 0.05) in exchange for a higher probability of early stopping. For most e-commerce tests, the O'Brien-Fleming tradeoff is superior because the majority of tests will run to completion — and you want full power when they do.
Sequential Testing vs. Always-Valid Inference
Group sequential testing requires you to pre-specify when you will analyze the data — typically 3-5 evenly spaced looks. But what if you want to monitor continuously, checking results after every batch of visitors without any pre-specified schedule? This is the domain of always-valid inference, a newer framework built on confidence sequences and anytime-valid p-values.
Confidence sequences are the sequential analogue of confidence intervals. A standard 95% confidence interval guarantees that the true parameter falls within the interval 95% of the time at the fixed sample size. A 95% confidence sequence guarantees that the true parameter falls within the sequence at every sample size simultaneously. This is a much stronger guarantee, and it is what makes continuous monitoring valid. The work by researchers like Georgi Georgiev on always-valid confidence sequences has been particularly influential in making these methods accessible to practitioners.
The tradeoff is real and quantifiable. At any given sample size, a confidence sequence is wider than the corresponding group sequential boundary, which is wider than a fixed-horizon confidence interval. You pay for monitoring flexibility with precision. For a test that runs to completion, the always-valid approach will produce a wider confidence interval — and therefore require more data to detect the same effect — than an equivalent group sequential design.
This does not mean always-valid methods are impractical. For high-traffic tests where data accumulates quickly and the cost of waiting for a pre-planned analysis is high, continuous monitoring can deliver decisions days earlier than a group sequential design. The key is understanding the tradeoff: you gain monitoring flexibility but lose some statistical efficiency. For most e-commerce A/B tests running 3-6 weeks, weekly planned analyses provide enough monitoring frequency without the efficiency penalty of continuous monitoring.
When to Use Sequential Testing
Sequential testing is a tool, not a default. It adds value in specific situations and adds unnecessary complexity in others. The decision to use it should be driven by the operational context of the experiment, not by a blanket policy.
Ideal scenarios for sequential testing
- High-traffic tests: When sample size accumulates quickly, the efficiency penalty of sequential design is small in absolute terms, and early stopping on clear winners saves meaningful time.
- Revenue-critical experiments: When a losing variant is actively costing revenue, the ability to stop early and remove the loser has direct financial value that outweighs the marginal efficiency loss.
- Promotional tests with hard deadlines: Seasonal campaigns, flash sales, and product launches have fixed end dates. Sequential designs let you reach a valid conclusion even if the test is shorter than originally planned.
- Any test where stakeholders will peek anyway: If you know the team will look at results before the test completes — and they will — sequential testing formalizes that behavior and makes it statistically valid rather than pretending it doesn't happen.
When NOT to use sequential testing
- Low-traffic tests: When sample size is already a constraint, the 10-30% increase in maximum sample size required by sequential designs can extend test durations by weeks. The efficiency penalty hurts most when you can least afford it.
- Long-running holdout experiments: Holdout tests measuring the cumulative impact of a program over months are not designed for early stopping. The question they answer — 'what is the long-term effect?' — requires the full observation period by definition.
- Tests where precise effect estimates matter: If the goal is to estimate the exact size of an effect (not just whether it exists), sequential designs produce wider confidence intervals that reduce estimation precision. Fixed-horizon designs are more efficient for point estimation.
How DRIP Implements Sequential Monitoring
Every experiment we run at DRIP includes a sequential monitoring plan designed before the test launches. We use O'Brien-Fleming alpha spending with weekly interim analyses — typically 3-5 looks depending on the expected test duration. The monitoring thresholds are pre-specified in the test plan, and the analysis at each interim look follows the same rigorous protocol as the final analysis. There is no ad hoc peeking; every look is planned and accounted for in the error budget.
The time savings compound in the same way that conversion improvements compound. Every test that stops 12 days early is 12 days that can be allocated to the next experiment in the pipeline. Across a year of testing, this adds up to multiple additional experiments that would not have been possible under a strict fixed-horizon approach. Sequential monitoring does not replace good test design — it supplements it. The foundation is still proper power analysis, clear hypothesis specification, and a pre-registered analysis plan. Sequential methods sit on top of that foundation, adding monitoring flexibility without compromising the rigor underneath.
The critical point is that sequential testing is not a substitute for discipline. It does not excuse launching tests without adequate power calculations, and it does not mean every test should be stopped early. Most of our tests — roughly 75% — run to their full planned duration because the effect is ambiguous enough that early stopping criteria are not met. Sequential monitoring is a safety valve and an efficiency tool, not a philosophy of impatience.
Learn how DRIP's testing methodology works →