Does CUPED work with Bayesian A/B testing?

Yes, CUPED is a pre-processing step that reduces variance in the outcome metric. It is compatible with both frequentist and Bayesian analysis. The adjusted metric has the same expectation but lower variance, which means tighter credible intervals in a Bayesian framework and narrower confidence intervals in a frequentist one.

How long should the pre-experiment period be?

Typically 2-4 weeks. The pre-period should be long enough to capture representative behavior but short enough to remain predictive. For e-commerce, 2 weeks usually captures enough purchase cycles. Extending beyond 4 weeks rarely improves the covariate's predictive power and may introduce drift if user behavior changes over time.

Can CUPED reduce variance by more than 50%?

In theory, yes -- if your covariate explains more than 50% of the outcome variance. In practice, 30-50% reduction is typical for revenue metrics. For engagement metrics like pages per session, reductions above 50% are possible because browsing patterns are highly consistent across time periods.

Is CUPED the same as ANCOVA?

CUPED is mathematically equivalent to ANCOVA (Analysis of Covariance) with a single covariate. The difference is branding and context -- CUPED was popularized in the tech experimentation community by Microsoft Research, while ANCOVA comes from classical statistics. If you have run ANCOVA before, you already understand the mechanics of CUPED.

CUPED in A/B Testing: Variance Reduction That Cuts Test Duration

What Is CUPED?

CUPED is a statistical technique that reduces noise in A/B test metrics by using pre-experiment data as a covariate. It was introduced by Microsoft Research in 2013 and is now the industry standard for variance reduction at scale.

CUPED was developed at Microsoft by Deng et al. in 2013 and has since become the default variance reduction method at Netflix, Booking.com, Airbnb, and most major experimentation platforms. The acronym stands for Controlled-experiment Using Pre-Experiment Data, which describes exactly what it does: it uses what you already know about each user to sharpen your measurement of what happens during the experiment.

The core idea is intuitive. If a user spent €100 on your site last month, their expected spend this month is meaningfully higher than someone who spent €10. That difference in baseline behavior is predictable variation -- and predictable variation is noise you can remove. CUPED subtracts this predictable component from each user's outcome, leaving only the variation that could plausibly be caused by your treatment.

20-50%Typical variance reductionDepends on covariate correlation with the outcome metric

Up to 50%Shorter test durationSame statistical power achieved with fewer observations

DRIP Insight

CUPED doesn't change what you measure -- it changes how precisely you measure it. Think of it as noise-canceling headphones for your A/B test metrics.

How CUPED Works (The Math, Simplified)

CUPED adjusts each user's outcome by subtracting the predictable component estimated from pre-experiment behavior.

The CUPED adjustment is a single formula: Y_adjusted = Y - θ × (X - E[X]), where Y is the observed outcome during the experiment and X is the pre-experiment covariate (e.g., the same metric measured before the test started). E[X] is the population mean of the covariate, and θ is a coefficient that controls how much adjustment to apply.

The coefficient θ is chosen to minimize the variance of Y_adjusted. The optimal value is Cov(X, Y) / Var(X) -- the familiar regression coefficient from ordinary least squares. This is not a coincidence. CUPED is mathematically equivalent to regressing the outcome on the covariate and analyzing the residuals. The adjustment removes exactly the portion of outcome variance that is linearly predictable from pre-experiment behavior.

Pro Tip

The effectiveness of CUPED depends entirely on how well the pre-experiment covariate predicts the outcome. If pre-experiment revenue explains 40% of the variance in test-period revenue (R²=0.4), CUPED reduces variance by 40%.

Covariate Correlation vs. Variance Reduction

Covariate R² with Outcome	Variance Reduction	Equivalent Sample Size Increase
R² = 0.1	10%	~11% more effective
R² = 0.2	20%	~25% more effective
R² = 0.3	30%	~43% more effective
R² = 0.4	40%	~67% more effective
R² = 0.5	50%	~100% more effective (2x)

Choosing the Right Covariate

The best covariate is the same metric measured in the pre-experiment period. For conversion rate, use pre-experiment conversion rate. For revenue per visitor, use pre-experiment revenue per visitor.

Same-Metric Covariates

The strongest predictor of how a user will behave during an experiment is how they behaved before it. Pre-period revenue predicts test-period revenue. Pre-period visit frequency predicts test-period visit frequency. Pre-period conversion rate predicts test-period conversion rate. This consistency is what makes CUPED effective -- user behavior is sticky, and that stickiness is exploitable signal.

In practice, using the same metric as both covariate and outcome consistently yields the highest R² values. A user who converted twice in the past 14 days is far more likely to convert during your experiment than a user who visited once and bounced. By accounting for this difference, you remove noise without introducing bias.

Cross-Metric Covariates

When the same metric is unavailable or has low variance in the pre-period, cross-metric covariates can fill the gap. Page views can predict conversion (more engaged users convert more). Historical average order value can predict revenue per visitor. Session depth can predict add-to-cart rate. These cross-metric covariates are typically weaker predictors but still provide meaningful variance reduction.

Same metric, pre-period (best) -- e.g., pre-experiment RPV to predict test-period RPV. Typically R² = 0.3-0.5.
Related metric, pre-period -- e.g., pre-experiment page views to predict test-period conversion. R² = 0.1-0.3.
Session count -- number of pre-period sessions captures engagement level. R² = 0.05-0.2.
User tenure -- days since first visit. Weakest standalone predictor but can supplement other covariates. R² = 0.02-0.1.

Common Mistake

Never use post-experiment data or data that could be affected by the treatment as a covariate. This violates the independence assumption and biases your results. The covariate must be fixed before randomization occurs.

CUPED in E-Commerce: Practical Impact

Variance reduction is especially powerful for revenue metrics in e-commerce, where user spending patterns are highly predictable from historical behavior. CUPED routinely cuts test duration by 30-50% for revenue-based primary metrics.

30-50%Revenue metric variance reductionPre-period RPV as covariate in e-commerce tests

15-30%Conversion rate variance reductionWeaker because binary metrics have less exploitable signal

E-commerce is one of the best domains for CUPED because purchase behavior is highly repetitive. A user's spending pattern over the past two weeks is a strong predictor of their spending in the next two weeks. This gives CUPED a high-quality covariate to work with, and the resulting variance reduction directly translates to shorter experiments.

CUPED Effectiveness by E-Commerce Metric

Metric	Typical R² with Pre-Period	Variance Reduction	Test Duration Impact
Revenue per visitor	0.30 - 0.50	30-50%	Tests run 30-50% shorter
Conversion rate	0.15 - 0.30	15-30%	Tests run 15-30% shorter
Average order value	0.20 - 0.40	20-40%	Tests run 20-40% shorter
Pages per session	0.40 - 0.60	40-60%	Tests run 40-60% shorter

Counterintuitive Finding

CUPED helps revenue metrics more than conversion metrics. Revenue has higher variance (driven by AOV differences between users), and pre-experiment spending is a strong predictor. This makes CUPED especially valuable when revenue per visitor is your primary metric -- precisely the metric that is hardest to move with adequate power in conventional tests.

Limitations and Pitfalls

CUPED requires pre-experiment data and works only for returning users. It provides zero benefit for new visitors, assumes a linear covariate relationship, and is less effective for binary metrics like conversion rate.

No benefit for new visitors -- users with no browsing or purchase history have no covariate to adjust on. CUPED simply cannot reduce variance for these users.
Requires a pre-experiment data window -- you need at least 1-2 weeks of pre-experiment behavior logged per user. If your analytics pipeline doesn't track user-level metrics, CUPED is not implementable.
Assumes a linear relationship -- the standard CUPED adjustment is a linear regression. If the relationship between pre-period and test-period behavior is nonlinear, you leave variance reduction on the table.
Less effective for binary metrics -- conversion rate (0 or 1) has inherently less variance to exploit than continuous metrics like revenue. Pre-period conversion is a weaker predictor of test-period conversion than pre-period revenue is of test-period revenue.

The new visitor problem deserves special attention. CUPED can only adjust outcomes for users who have pre-experiment data. In a typical e-commerce context, 30-60% of traffic comes from new visitors who have never been to the site before. For these users, there is no historical behavior to leverage. The CUPED-adjusted metric is simply the raw metric for new visitors and the adjusted metric for returning visitors.

This means the effective variance reduction for your overall test population is lower than the theoretical maximum. If 50% of your traffic is new and CUPED achieves 40% variance reduction for returning visitors, the population-level reduction is roughly 20%. Still meaningful, but less dramatic than the headline numbers suggest.

Common Mistake

If your test targets new visitors specifically (e.g., a first-time visit landing page or new-customer acquisition flow), CUPED provides zero benefit. Always check your new vs. returning visitor ratio before assuming CUPED will help.

Implementing CUPED in Your Testing Stack

Several enterprise experimentation platforms support CUPED natively. For teams using simpler tools, CUPED can be implemented as a post-hoc analysis step using standard regression.

Major experimentation platforms have adopted CUPED as a built-in feature. Optimizely calls it Stats Accelerator and applies it automatically. Statsig and Eppo implement CUPED by default for all experiments with sufficient pre-experiment data. If you use one of these platforms, CUPED is already working in the background.

Most Shopify-focused A/B testing tools -- including Shoplift, Intelligems, and ABlyft -- do not support CUPED natively. This is a meaningful gap for e-commerce teams running revenue-based experiments. Without variance reduction, these tools require longer test durations to achieve the same statistical power, particularly for revenue per visitor metrics where variance is highest.

Pro Tip

If your A/B testing tool doesn't support CUPED natively, you can implement it in your analysis pipeline. The adjustment is a simple linear regression -- the hard part is having clean pre-experiment data per user. Export user-level results, join with pre-period metrics from your data warehouse, and compute the adjusted outcome offline.

CUPED: The Variance Reduction Technique That Cuts A/B Test Duration in Half

What Is CUPED?

How CUPED Works (The Math, Simplified)

Choosing the Right Covariate

Same-Metric Covariates

Cross-Metric Covariates

CUPED in E-Commerce: Practical Impact

Limitations and Pitfalls

Implementing CUPED in Your Testing Stack

Empfohlener nächster Schritt

Die CRO Lizenz ansehen

KoRo Case Study lesen

Frequently Asked Questions

Shorter tests. Same rigor.

The Newsletter Read by Employees from Brands like

CUPED: The Variance Reduction Technique That Cuts A/B Test Duration in Half

What Is CUPED?

How CUPED Works (The Math, Simplified)

Choosing the Right Covariate

Same-Metric Covariates

Cross-Metric Covariates

CUPED in E-Commerce: Practical Impact

Limitations and Pitfalls

Implementing CUPED in Your Testing Stack

Empfohlener nächster Schritt

Die CRO Lizenz ansehen

KoRo Case Study lesen

Frequently Asked Questions

Verwandte Artikel

Minimum Detectable Effect: The Number That Makes or Breaks Your A/B Test

Statistical Power in A/B Testing: Why Most Tests Are Under-Powered

A/B Testing Sample Size: How to Calculate It (And Why Most Get It Wrong)

Shorter tests. Same rigor.

The Newsletter Read by Employees from Brands like