What Is Statistical Power in A/B Testing?
Every A/B test is a bet against noise. You are trying to determine whether an observed difference between variants reflects a genuine improvement or is simply random fluctuation. Statistical power quantifies how good that bet is. A properly powered test gives you a high probability of detecting the effect you are looking for — if it exists.
The formal definition is straightforward: power equals one minus the Type II error rate (β). If your test has 80% power, β = 0.20, meaning there is a 20% probability that a real effect of the specified size will go undetected. The test will return a non-significant result, and you will conclude — incorrectly — that the variant had no impact.
The asymmetry between Type I and Type II errors is important to understand. A false positive is visible — you deploy a change, monitor it, and eventually notice it is not performing. A false negative is invisible. You tested a genuinely good idea, the test returned no significant result, and you moved on. The winning idea sits in your archive of 'failed' experiments, and you never revisit it.
| Power level | Type II error rate (β) | Tests needed to find a winner (if 50% of ideas are truly positive) |
|---|---|---|
| 70% | 30% | ~3.3 |
| 80% | 20% | ~5 |
| 90% | 10% | ~11 |
| 95% | 5% | ~20 |
The rightmost column illustrates a counter-intuitive point: higher power means you need fewer tests to identify each winner, because fewer real effects slip through undetected. Programs running at 70% power waste roughly 30% more experiments to achieve the same number of validated wins as programs running at 90%.
Why 80% Power Fails for E-Commerce
The 80% power convention was established in clinical research where effect sizes are often large and well-characterized. In e-commerce, the reality is different. Most valid optimizations produce small, incremental improvements — a checkout flow tweak that lifts conversion by 2%, a product page redesign that improves add-to-cart rate by 3%. These effects are real and valuable at scale, but they are difficult to detect statistically.
Here is the core problem: to detect a 2.91% relative improvement on a 2% baseline conversion rate at 80% power and 95% confidence, you need approximately 470,000 visitors per variant. That is nearly a million total visitors for a simple A/B test. Most mid-market e-commerce brands do not generate that volume on a single page within a reasonable testing window.
The result is a systematic blind spot. Teams run tests that are technically under-powered for the effects they are trying to detect. The test concludes with a non-significant result, and the team marks it as a failed experiment. But the experiment did not fail — the test design failed. The idea may have been genuinely good, but the test lacked the sensitivity to confirm it.
This creates a vicious cycle in experimentation programs. Under-powered tests produce a high rate of inconclusive results. Teams interpret these as evidence that optimization does not work for their brand. They reduce investment in testing, which reduces the number of experiments, which further limits their ability to find winners. The problem is not the ideas — it is the statistical machinery used to evaluate them.
How to Calculate Statistical Power
Power analysis is not optional — it is the single most important step in test design. Running a test without a power analysis is equivalent to launching a clinical trial without knowing how many patients you need. You might get a result, but you have no idea whether that result is reliable.
The Four Variables
- Sample size (n): The number of visitors per variant. This is usually the variable you solve for. More visitors means more power, but at diminishing returns — doubling power roughly quadruples the required sample.
- Minimum detectable effect (MDE): The smallest effect size you want to reliably detect. Smaller MDE requires exponentially more traffic. Choose this based on business impact: what is the smallest improvement that justifies the cost of implementation?
- Significance level (α): The probability of a false positive, typically set at 0.05 (5%). Lowering α (e.g., to 0.01) reduces false positives but requires larger samples to maintain the same power.
- Baseline conversion rate: The current performance of the control. Higher baselines are easier to test because the signal-to-noise ratio is more favorable. A page converting at 8% needs roughly half the sample of a page converting at 2% to detect the same relative MDE.
These four variables are locked in a mathematical relationship. You cannot improve one without worsening another, unless you increase sample size. This is why power analysis always begins with a business question — what effect size matters? — and ends with a logistical question — can we get enough traffic in a reasonable timeframe?
The relationship between MDE and sample size is not linear — it is quadratic. Halving your MDE (trying to detect effects half as large) requires four times the sample size. This means the jump from detecting a 10% relative lift to a 5% relative lift is not twice as hard. It is four times as hard. This mathematical reality is why most e-commerce tests cannot reliably detect effects below 5-10% relative lift.
Power Analysis in Practice: E-Commerce Examples
Theory is useful, but e-commerce teams need concrete guidance. The table below maps three common traffic scenarios to the minimum detectable effect achievable at 80% power within a 4-week test window, assuming a 2.5% baseline conversion rate and 95% confidence.
| Store tier | Monthly visitors | Visitors per variant (4 weeks) | Smallest detectable relative MDE | Practical implication |
|---|---|---|---|---|
| Low-traffic | 10,000 | ~5,000 | ~30% relative (+0.75pp) | Only radical redesigns are testable |
| Mid-tier | 100,000 | ~50,000 | ~10% relative (+0.25pp) | Meaningful optimizations are detectable |
| Enterprise | 1,000,000+ | ~500,000 | ~3% relative (+0.075pp) | Subtle, high-value improvements are testable |
The implications are stark. If the median real effect of a well-designed e-commerce experiment is around +2.91% relative, then low-traffic stores cannot detect the average winning experiment at all within a reasonable timeframe. Mid-tier stores can detect effects slightly above the median, but miss the smaller wins. Only enterprise-level traffic supports the sensitivity needed to capture the full distribution of positive effects.
This does not mean low-traffic stores should abandon testing. It means they must be strategic about what they test. A low-traffic store should focus on high-impact hypotheses — complete checkout redesigns, major navigation changes, fundamentally different value propositions — where the expected effect size is large enough to detect. Incremental refinements like button color changes or minor copy tweaks are statistically untestable at low volumes.
Common Power Mistakes in Experimentation Programs
1. Skipping power analysis entirely
The most widespread mistake is also the most basic: launching tests without any power analysis. Teams pick an arbitrary duration ('let's run it for two weeks'), launch the test, and evaluate whatever data they have at the end. This approach means the test's power is whatever it happens to be — often 40-60% for realistic effect sizes. At 50% power, you are essentially flipping a coin on whether you detect a real winner.
2. Peeking at results and stopping early
Peeking destroys power by turning a single statistical test into multiple tests. Each peek is an opportunity to stop the experiment when random noise happens to favor one variant. The result is inflated false positive rates and, paradoxically, reduced effective power — because tests stopped early have neither the sample size nor the duration to produce reliable conclusions. For a detailed treatment, see our article on the peeking problem in A/B testing.
3. Using one-tailed tests to gain power
A one-tailed test increases power by approximately 20% compared to a two-tailed test at the same significance level, because it concentrates the rejection region in one direction. Some practitioners use this as a shortcut to reduce sample size requirements. The problem is that a one-tailed test cannot detect negative effects. If variant B is actually hurting conversion, a one-tailed test will never flag it — you will simply get a non-significant result and potentially deploy a harmful change.
4. Conflating statistical and practical significance
A test can be statistically significant but practically meaningless. If an enterprise store detects a +0.02% absolute improvement with high confidence, the effect is real — but it may generate less incremental revenue than the engineering cost of implementation. Power analysis should begin with practical significance: what is the smallest effect worth acting on? Then design the test to detect that effect. Detecting smaller effects is a waste of statistical resources.
5. Running too many variants simultaneously
Every additional variant splits your traffic and reduces the effective sample size per comparison. An A/B/C/D test with four variants needs roughly three times the total traffic of a simple A/B test to maintain the same power for each comparison. If you also apply multiple comparison corrections (which you should), the required sample grows further. For most e-commerce brands, simple A/B tests with one control and one variant are the most efficient use of traffic. Reserve multi-variant tests for high-traffic pages where sample size is not a constraint.
How DRIP Approaches Power Analysis
Power analysis is the first gate in our experiment pipeline, not the last. Before any hypothesis is promoted from ideation to test design, we run a power calculation using the target page's historical traffic, the baseline conversion rate, and the minimum detectable effect implied by the hypothesis. If the numbers do not support a properly powered test within our maximum duration window, the experiment does not proceed to implementation.
This is a deliberate constraint, and it is one of the most valuable parts of our methodology. A rejected experiment is not a failed idea — it is a recognition that the available data cannot support a reliable evaluation. We document these ideas and revisit them when traffic conditions change, when the page is redesigned (creating a new baseline), or when the idea can be tested at a broader scope (across multiple pages rather than one).
Our median test duration of 42 days reflects this discipline. We do not run tests for 42 days because we are slow — we run them for 42 days because that is what the math requires to produce reliable results at the effect sizes we typically observe. Shorter tests would be faster but less reliable, and unreliable results are worse than no results at all.
For brands that want to apply this same discipline, the starting point is honest assessment: how much traffic do your key pages actually receive? What effect sizes can you realistically detect? And are your current testing practices producing results you can trust? If you are unsure, our CRO audit includes a full power analysis of your testing program — identifying which past results were reliable and which were under-powered.
Want to run experiments at proper statistical power? Talk to DRIP about a testing program built on methodological rigor. →