What is a holdout group in A/B testing?

A holdout group is a persistent segment of users (typically 5-10% of traffic) who are excluded from all winning experiment treatments. By comparing them to users who receive all deployed winners, you can measure the true cumulative impact of your entire experimentation program — not just individual test results.

How large should a holdout group be?

Between 5% and 10% of traffic for most e-commerce sites. Higher-traffic sites can use smaller holdouts (3-5%) while maintaining statistical precision. The key trade-off is measurement accuracy versus the revenue opportunity cost of withholding treatments from holdout users.

How long should you run a holdout group?

A minimum of 6 months for most programs, extending to 12 months for lower-traffic sites. The holdout needs sufficient time for cumulative divergence between the holdout and exposed groups to reach statistical significance. Releasing the holdout too early often produces inconclusive results.

What is the difference between a holdout group and a control group?

A control group exists within a single A/B test and sees no change for that specific experiment. A holdout group spans the entire experimentation program and receives none of the winning treatments shipped over months. The holdout measures program-level cumulative impact; the control measures a single treatment's effect.

Holdout Group Design: Measuring the True Impact of Your Experimentation Program

What Is a Holdout Group and Why Does It Matter?

A holdout group is a randomly assigned, persistent segment of users excluded from all winning experiment treatments. It serves as a long-running control for your entire experimentation program, allowing you to measure cumulative impact rather than relying on the sum of individual test results.

Every A/B test produces a point estimate: this change improved conversion by X%. Over the course of a year, a mature program may ship dozens of winning treatments. The intuitive approach — summing the individual uplifts — consistently overestimates the real cumulative impact. Effects interact, some gains decay over time, and external factors shift baselines. The holdout group eliminates this problem.

The concept is simple. At the start of a measurement period, you randomly assign a small percentage of your traffic to a holdout group. These users never receive any winning treatments from your experimentation program. Everyone else — the exposed group — receives all deployed winners as normal. After months of accumulated changes, you compare the holdout against the exposed group. The difference is your program's true incremental contribution.

15-30%Typical overestimateSumming individual test results vs. holdout-measured cumulative impact

5-10%Recommended holdout sizeBalances measurement precision with revenue opportunity cost

6-12 moMinimum holdout durationRequired to capture meaningful cumulative divergence

DRIP Insight

Across DRIP's client programs, the sum of individual A/B test results typically overestimates cumulative impact by 15-30%. Holdout measurement closes the gap between what your testing dashboard reports and what actually hits the P&L.

This matters for governance. When the C-suite asks whether the experimentation budget is generating returns, individual test results are unconvincing because they cannot account for interaction effects between simultaneously deployed changes. A holdout-measured cumulative impact figure is the only honest answer. It is the difference between saying 'we shipped 40 winners that individually summed to +12%' and saying 'our holdout confirms the program delivered +8.4% incremental revenue.'

Holdout groups are not the same as the control group in an individual test. A test control sees no change for the duration of one experiment. A holdout sees no changes for the duration of the entire program — potentially spanning months and dozens of shipped treatments. This distinction is critical and frequently confused.

How Do You Size a Holdout Group Correctly?

Size your holdout between 5% and 10% of traffic. Smaller holdouts reduce revenue opportunity cost but require longer observation periods for statistical significance. The right size depends on your site's traffic volume and the precision you need for cumulative impact estimates.

Holdout sizing is a trade-off between measurement precision and opportunity cost. Every user in the holdout group is a user who does not benefit from your winning changes. Size the holdout too large and you leave significant revenue on the table. Size it too small and the cumulative impact estimate will have confidence intervals so wide they are useless for decision-making.

Holdout size trade-offs by monthly unique visitors

Monthly uniques	Recommended holdout	Detectable cumulative lift (95% CI, 6 months)	Annual revenue opportunity cost at 10% program lift
100K-500K	10%	~2-3% absolute	~1% of revenue
500K-2M	5-8%	~1-2% absolute	~0.5-0.8% of revenue
2M-10M	5%	~0.5-1% absolute	~0.5% of revenue
10M+	3-5%	~0.3-0.5% absolute	~0.3-0.5% of revenue

The detectable cumulative lift column is what matters most. If your experimentation program is generating less than 2% cumulative lift over six months, a 10% holdout on a 200K-visitor site will not produce a statistically significant result. You either need more time, more traffic, or a larger holdout — each with its own cost.

Common Mistake

Never size your holdout below 3%. The resulting confidence intervals will be so wide that even a strong experimentation program cannot produce a statistically significant cumulative result within a reasonable timeframe. You will have paid the cost of withholding treatments without getting actionable measurement in return.

The Revenue Opportunity Cost Calculation

To quantify opportunity cost: if your experimentation program delivers 10% incremental lift and you hold out 5% of traffic, that 5% misses the 10% improvement. The cost is 0.5% of total revenue — typically a fraction of a single developer's salary and well worth the measurement certainty. Frame the holdout cost this way when seeking stakeholder buy-in: the holdout costs less than one headcount and produces the only trustworthy measure of the entire team's contribution.

Randomization and Assignment Persistence

Holdout assignment must be persistent at the user level, not the session level. Use a deterministic hash of the user identifier (cookie ID, logged-in user ID, or device fingerprint) modulo 100. A user assigned to the holdout at the start of the period stays in the holdout for the entire measurement window. Session-level randomization introduces contamination: the same user sees winning treatments in some sessions and the unmodified experience in others, destroying the comparison.

Generate a deterministic hash from the user's persistent identifier.
Map the hash to a value between 0 and 99.
Assign values 0 through N-1 to the holdout (where N is your holdout percentage).
Store nothing — the assignment is recomputed on every request, ensuring consistency without state management.

How Do You Measure Cumulative Program Impact with a Holdout?

Compare the primary metric (revenue per visitor, conversion rate) between the holdout and exposed groups over the full measurement period. Use a two-sample z-test or t-test with the full observation window. The difference is your program's true incremental contribution, net of interaction effects and decay.

The measurement framework is a two-sample comparison, no different from a standard A/B test — except the 'treatment' is months of accumulated experimentation work. Compute revenue per visitor (or your primary success metric) for the holdout and the exposed group across the entire holdout period. Run a standard two-sample test for statistical significance.

Choosing the Right Primary Metric

Revenue per visitor is almost always the right primary metric for holdout analysis in e-commerce. Conversion rate alone misses AOV effects. Gross margin per visitor is ideal if the data is clean. Whichever metric you choose, define it before the holdout begins — not after you see the data. Post-hoc metric selection introduces bias that invalidates the analysis.

Pro Tip

Define your primary holdout metric and success criteria before the measurement period begins. Pre-registering the analysis plan prevents data dredging and gives the results credibility with finance stakeholders who are rightly skeptical of post-hoc metric shopping.

Handling Seasonality and External Factors

Because the holdout and exposed groups are measured over identical time periods, seasonality cancels out. Both groups experience the same Black Friday, the same promotional calendars, the same traffic fluctuations. This is the core advantage of a concurrent holdout over historical comparisons — confounding factors affect both groups equally, leaving only the treatment effect.

The one exception is novelty and learning effects. If a winning treatment is particularly novel (e.g., a dramatically different checkout flow), the exposed group may show an initial spike that fades. The holdout comparison captures this decay naturally — it measures the sustained effect, not the launch-week spike. This is a feature, not a bug.

Holdout analysis vs. summing individual test results

Approach	Accounts for interaction effects	Accounts for effect decay	Accounts for external factors	Credibility for C-suite reporting
Sum of individual tests	No	No	No	Low
Holdout group comparison	Yes	Yes	Yes (concurrent measurement)	High
Pre/post historical comparison	Yes	Yes	No (confounded by external changes)	Medium

Revenue/VisitorBest primary metricCaptures both CR and AOV effects of cumulative changes

Pre-registerDefine metrics before measurementPrevents post-hoc metric shopping that erodes credibility

What Are the Most Common Holdout Group Mistakes?

The most damaging mistakes are contamination (holdout users inadvertently receiving treatments), insufficient duration (releasing the holdout before cumulative divergence reaches significance), and applying holdout logic to individual tests instead of the program as a whole.

Holdout groups are conceptually simple but operationally fragile. A single implementation error can silently invalidate months of measurement. Below are the failure modes we encounter most frequently when auditing client programs.

Mistake 1: Contamination Through Leaky Implementation

Contamination occurs when holdout users receive one or more winning treatments. This typically happens when a new feature deployment bypasses the holdout logic, or when server-side holdout assignment is not integrated with the experimentation platform. Even 5% contamination of the holdout group meaningfully attenuates the measured cumulative effect, biasing the result toward zero.

Audit every deployment pipeline to confirm holdout exclusion is enforced at the code level, not just the testing tool level.
Log holdout assignment alongside treatment exposure to detect contamination retroactively.
Run weekly contamination checks: for each holdout user, verify zero treatment exposures in your event logs.

Mistake 2: Releasing the Holdout Too Early

Stakeholders often push to release the holdout once a 'sufficient' number of tests have been shipped. But the holdout's value comes from measuring the cumulative, compounding effect over time. Releasing after three months when the program has only shipped five treatments rarely produces a significant result — and a non-significant holdout result is frequently misinterpreted as evidence that the program is not working.

Counterintuitive Finding

A non-significant holdout result does not mean your experimentation program has no impact. It may mean the holdout was too small, the duration too short, or the program has not yet shipped enough cumulative changes. Do not confuse absence of evidence with evidence of absence — this is a textbook Type II error.

Mistake 3: Confusing Holdouts with Test-Level Controls

A holdout group measures program-level impact. It is not a replacement for proper control groups in individual A/B tests. Every experiment should still have its own control group for that specific treatment. The holdout sits above the individual test layer — it withholds all winning treatments, not just one.

Mistake 4: Ignoring Sample Ratio Mismatch

Just like individual tests, holdout groups are vulnerable to sample ratio mismatch (SRM). If your holdout is configured at 5% but the actual allocation drifts to 4.2%, the assignment mechanism is broken and the comparison is unreliable. Monitor the holdout/exposed split weekly and investigate any deviation greater than 0.5 percentage points from the expected ratio.

When Should You Refresh or Reset a Holdout Group?

Refresh holdout groups every 6 to 12 months, or when the cumulative impact is clearly significant and stakeholders have the data they need. Refreshing rotates which users are in the holdout, preventing a permanent 'have-not' segment while restarting the measurement clock for the next cycle.

A holdout is not permanent. It is a measurement instrument with a defined lifecycle. Keeping the same holdout indefinitely creates two problems: the holdout users accumulate a growing disadvantage (ethically questionable in some contexts), and the holdout measurement eventually saturates — once cumulative divergence is clearly significant, additional observation adds diminishing precision.

The Refresh Cycle

We recommend a 6-to-12-month holdout cycle for most e-commerce programs. At the end of each cycle: record the cumulative impact, release all holdout users (deploying all accumulated winners to them), randomly assign a new holdout group, and begin the next measurement period from a fresh baseline.

Record cumulative impact from the completing cycle (confidence interval, statistical significance, primary metric delta).
Deploy all accumulated winning treatments to the outgoing holdout users.
Randomly reassign a new holdout group of the same size from the full user population.
Reset the measurement clock — the new cycle starts with zero cumulative divergence.
Document the cycle boundary for future analysis so that inter-cycle trends can be tracked.

Pro Tip

Align holdout refresh cycles with your business planning calendar. If leadership reviews experimentation ROI quarterly, use a 6-month holdout that produces results midway through the fiscal year — giving you a clean number for the annual review and time to course-correct if needed.

Can You Run Overlapping Holdout Cycles?

Yes, but the complexity is rarely justified. Staggered holdouts (starting a new holdout before the previous one ends) provide more frequent measurement points but double the traffic withheld from treatments. For most programs, sequential non-overlapping cycles provide sufficient measurement cadence without the implementation overhead.

Holdout refresh strategies compared

Strategy	Traffic withheld	Measurement frequency	Implementation complexity	Recommended for
Single sequential cycle (6-12 months)	5-10%	Every 6-12 months	Low	Most programs
Overlapping staggered cycles	10-15%	Every 3-6 months	High	High-traffic enterprise programs
Rolling quarterly refresh	5-10%	Quarterly	Medium	Programs needing frequent reporting

How Does DRIP Use Holdout Groups to Prove Program ROI?

DRIP implements holdout groups for every client engagement as a standard practice. The holdout-measured cumulative impact — not the sum of individual test results — is the number we report to leadership teams, because it is the only figure that survives scrutiny.

We treat holdout measurement as non-negotiable infrastructure for any program at scale. When a client asks whether their experimentation investment is generating returns, the holdout gives us a single, defensible number: the real difference in revenue per visitor between users who received all accumulated improvements and users who received none.

4,000+Experiments across DRIP programsLong-running holdout data across 90+ e-commerce brands

5-10%Standard DRIP holdout sizeAdjusted per client based on traffic volume

6 monthsDefault measurement cycleExtended to 12 months for lower-traffic brands

Our Standard Holdout Protocol

At engagement kickoff, configure a 5-10% persistent holdout group using deterministic hashing on the client's user identifier.
Integrate holdout exclusion into the deployment pipeline — not just the testing tool — to prevent contamination from direct code deployments.
Run automated weekly contamination and SRM checks; alert the team if either exceeds tolerance thresholds.
After each 6-month cycle, compute cumulative impact with 95% confidence intervals and present the result to the client's leadership team.
Refresh the holdout group by deploying accumulated winners to outgoing holdout users and randomly assigning a new holdout from the full population.

DRIP Insight

The holdout result is frequently the single most important number in a client QBR. It answers the only question leadership actually cares about: is this program generating more money than it costs? Individual test results are interesting. The holdout result is conclusive.

This approach also disciplines the program itself. When you know the holdout will be evaluated, you resist the temptation to ship marginal winners that individually reached significance but are unlikely to contribute meaningfully to cumulative impact. The holdout creates accountability — and accountability improves experimentation ROI over time.

For teams building their measurement practice, we recommend starting with the fundamentals: ensure your individual tests are properly powered with correct sample sizes. Once the test-level methodology is sound, the holdout layer becomes the capstone that ties everything together into a provable business case.

How to Design Holdout Groups for Experimentation Programs

What Is a Holdout Group and Why Does It Matter?