What Is CUPED?
CUPED was developed at Microsoft by Deng et al. in 2013 and has since become the default variance reduction method at Netflix, Booking.com, Airbnb, and most major experimentation platforms. The acronym stands for Controlled-experiment Using Pre-Experiment Data, which describes exactly what it does: it uses what you already know about each user to sharpen your measurement of what happens during the experiment.
The core idea is intuitive. If a user spent €100 on your site last month, their expected spend this month is meaningfully higher than someone who spent €10. That difference in baseline behavior is predictable variation -- and predictable variation is noise you can remove. CUPED subtracts this predictable component from each user's outcome, leaving only the variation that could plausibly be caused by your treatment.
How CUPED Works (The Math, Simplified)
The CUPED adjustment is a single formula: Y_adjusted = Y - θ × (X - E[X]), where Y is the observed outcome during the experiment and X is the pre-experiment covariate (e.g., the same metric measured before the test started). E[X] is the population mean of the covariate, and θ is a coefficient that controls how much adjustment to apply.
The coefficient θ is chosen to minimize the variance of Y_adjusted. The optimal value is Cov(X, Y) / Var(X) -- the familiar regression coefficient from ordinary least squares. This is not a coincidence. CUPED is mathematically equivalent to regressing the outcome on the covariate and analyzing the residuals. The adjustment removes exactly the portion of outcome variance that is linearly predictable from pre-experiment behavior.
| Covariate R² with Outcome | Variance Reduction | Equivalent Sample Size Increase |
|---|---|---|
| R² = 0.1 | 10% | ~11% more effective |
| R² = 0.2 | 20% | ~25% more effective |
| R² = 0.3 | 30% | ~43% more effective |
| R² = 0.4 | 40% | ~67% more effective |
| R² = 0.5 | 50% | ~100% more effective (2x) |
Choosing the Right Covariate
Same-Metric Covariates
The strongest predictor of how a user will behave during an experiment is how they behaved before it. Pre-period revenue predicts test-period revenue. Pre-period visit frequency predicts test-period visit frequency. Pre-period conversion rate predicts test-period conversion rate. This consistency is what makes CUPED effective -- user behavior is sticky, and that stickiness is exploitable signal.
In practice, using the same metric as both covariate and outcome consistently yields the highest R² values. A user who converted twice in the past 14 days is far more likely to convert during your experiment than a user who visited once and bounced. By accounting for this difference, you remove noise without introducing bias.
Cross-Metric Covariates
When the same metric is unavailable or has low variance in the pre-period, cross-metric covariates can fill the gap. Page views can predict conversion (more engaged users convert more). Historical average order value can predict revenue per visitor. Session depth can predict add-to-cart rate. These cross-metric covariates are typically weaker predictors but still provide meaningful variance reduction.
- Same metric, pre-period (best) -- e.g., pre-experiment RPV to predict test-period RPV. Typically R² = 0.3-0.5.
- Related metric, pre-period -- e.g., pre-experiment page views to predict test-period conversion. R² = 0.1-0.3.
- Session count -- number of pre-period sessions captures engagement level. R² = 0.05-0.2.
- User tenure -- days since first visit. Weakest standalone predictor but can supplement other covariates. R² = 0.02-0.1.
CUPED in E-Commerce: Practical Impact
E-commerce is one of the best domains for CUPED because purchase behavior is highly repetitive. A user's spending pattern over the past two weeks is a strong predictor of their spending in the next two weeks. This gives CUPED a high-quality covariate to work with, and the resulting variance reduction directly translates to shorter experiments.
| Metric | Typical R² with Pre-Period | Variance Reduction | Test Duration Impact |
|---|---|---|---|
| Revenue per visitor | 0.30 - 0.50 | 30-50% | Tests run 30-50% shorter |
| Conversion rate | 0.15 - 0.30 | 15-30% | Tests run 15-30% shorter |
| Average order value | 0.20 - 0.40 | 20-40% | Tests run 20-40% shorter |
| Pages per session | 0.40 - 0.60 | 40-60% | Tests run 40-60% shorter |
Limitations and Pitfalls
- No benefit for new visitors -- users with no browsing or purchase history have no covariate to adjust on. CUPED simply cannot reduce variance for these users.
- Requires a pre-experiment data window -- you need at least 1-2 weeks of pre-experiment behavior logged per user. If your analytics pipeline doesn't track user-level metrics, CUPED is not implementable.
- Assumes a linear relationship -- the standard CUPED adjustment is a linear regression. If the relationship between pre-period and test-period behavior is nonlinear, you leave variance reduction on the table.
- Less effective for binary metrics -- conversion rate (0 or 1) has inherently less variance to exploit than continuous metrics like revenue. Pre-period conversion is a weaker predictor of test-period conversion than pre-period revenue is of test-period revenue.
The new visitor problem deserves special attention. CUPED can only adjust outcomes for users who have pre-experiment data. In a typical e-commerce context, 30-60% of traffic comes from new visitors who have never been to the site before. For these users, there is no historical behavior to leverage. The CUPED-adjusted metric is simply the raw metric for new visitors and the adjusted metric for returning visitors.
This means the effective variance reduction for your overall test population is lower than the theoretical maximum. If 50% of your traffic is new and CUPED achieves 40% variance reduction for returning visitors, the population-level reduction is roughly 20%. Still meaningful, but less dramatic than the headline numbers suggest.
Implementing CUPED in Your Testing Stack
Major experimentation platforms have adopted CUPED as a built-in feature. Optimizely calls it Stats Accelerator and applies it automatically. Statsig and Eppo implement CUPED by default for all experiments with sufficient pre-experiment data. If you use one of these platforms, CUPED is already working in the background.
Most Shopify-focused A/B testing tools -- including Shoplift, Intelligems, and ABlyft -- do not support CUPED natively. This is a meaningful gap for e-commerce teams running revenue-based experiments. Without variance reduction, these tools require longer test durations to achieve the same statistical power, particularly for revenue per visitor metrics where variance is highest.
