Drip
FallstudienProzessKarriere
CRO LicenseCRO Audit
BlogRessourcenArtifactsStatistik-ToolsBenchmarksResearch
Kostenloses Erstgespräch buchenErstgespräch
Startseite/Blog/Statistical Power in A/B Testing: Why Most Tests Are Under-Powered
All Articles
Methodology14 min read

Statistical Power in A/B Testing: Why Most Tests Are Under-Powered

Your A/B test might be missing real winners. Across thousands of e-commerce experiments, under-powered tests are the leading cause of false negatives — and the most expensive mistake in experimentation.

Fabian GmeindlCo-Founder, DRIP Agency·March 13, 2026
📖This article is part of our The Complete Guide to A/B Testing for E-Commerce

Statistical power is the probability that an A/B test will detect a real effect when one exists. The industry standard of 80% means 1 in 5 real winners go undetected. For e-commerce tests where median uplift is small (+2.91% CR uplift across DRIP's experiment database), under-powered tests are the most common — and most expensive — reason teams abandon winning ideas.

Contents
  1. What Is Statistical Power in A/B Testing?
  2. Why 80% Power Fails for E-Commerce
  3. How to Calculate Statistical Power
  4. Power Analysis in Practice: E-Commerce Examples
  5. Common Power Mistakes in Experimentation Programs
  6. How DRIP Approaches Power Analysis

What Is Statistical Power in A/B Testing?

Statistical power is the probability of correctly rejecting a false null hypothesis — P(reject H0 | H0 is false). At 80% power, your test has a 20% chance of missing a real effect. That missed effect is a Type II error, and in e-commerce it translates directly to abandoned revenue.

Every A/B test is a bet against noise. You are trying to determine whether an observed difference between variants reflects a genuine improvement or is simply random fluctuation. Statistical power quantifies how good that bet is. A properly powered test gives you a high probability of detecting the effect you are looking for — if it exists.

The formal definition is straightforward: power equals one minus the Type II error rate (β). If your test has 80% power, β = 0.20, meaning there is a 20% probability that a real effect of the specified size will go undetected. The test will return a non-significant result, and you will conclude — incorrectly — that the variant had no impact.

80%Standard power thresholdIndustry convention — but not always sufficient
20%Type II error rate at 80% power1 in 5 real effects missed
1 in 5Real winners missedAt the standard 80% power level
DRIP Insight
Power is the inverse of the Type II error rate (β). At 80% power, β = 20% — meaning 1 in 5 real effects will be missed. Unlike Type I errors (false positives) which are controlled by your significance level, Type II errors are invisible: you never know which inconclusive tests were actually winners.

The asymmetry between Type I and Type II errors is important to understand. A false positive is visible — you deploy a change, monitor it, and eventually notice it is not performing. A false negative is invisible. You tested a genuinely good idea, the test returned no significant result, and you moved on. The winning idea sits in your archive of 'failed' experiments, and you never revisit it.

Power level, Type II error rate, and detection reliability
Power levelType II error rate (β)Tests needed to find a winner (if 50% of ideas are truly positive)
70%30%~3.3
80%20%~5
90%10%~11
95%5%~20

The rightmost column illustrates a counter-intuitive point: higher power means you need fewer tests to identify each winner, because fewer real effects slip through undetected. Programs running at 70% power waste roughly 30% more experiments to achieve the same number of validated wins as programs running at 90%.

Why 80% Power Fails for E-Commerce

E-commerce A/B tests typically produce small effect sizes — the median CR uplift across DRIP's experiment database is +2.91%. Detecting effects this small requires massive sample sizes that most stores cannot reach within a reasonable test window, making 80% power a theoretical floor that most tests fail to meet in practice.

The 80% power convention was established in clinical research where effect sizes are often large and well-characterized. In e-commerce, the reality is different. Most valid optimizations produce small, incremental improvements — a checkout flow tweak that lifts conversion by 2%, a product page redesign that improves add-to-cart rate by 3%. These effects are real and valuable at scale, but they are difficult to detect statistically.

+2.91%Median CR upliftAcross DRIP's experiment database
42 daysMedian test durationRequired to reach proper power at typical traffic levels
~78,000Visitors per variantNeeded to detect a 10% relative MDE at 2% baseline CR (80% power)

Here is the core problem: to detect a 2.91% relative improvement on a 2% baseline conversion rate at 80% power and 95% confidence, you need approximately 470,000 visitors per variant. That is nearly a million total visitors for a simple A/B test. Most mid-market e-commerce brands do not generate that volume on a single page within a reasonable testing window.

The result is a systematic blind spot. Teams run tests that are technically under-powered for the effects they are trying to detect. The test concludes with a non-significant result, and the team marks it as a failed experiment. But the experiment did not fail — the test design failed. The idea may have been genuinely good, but the test lacked the sensitivity to confirm it.

Counterintuitive Finding
Running more under-powered tests doesn't compensate for low power. The math doesn't average out — you systematically miss your best ideas. Ten tests at 60% power will miss approximately 4 real winners. Those same 10 ideas tested at 90% power would miss only 1.

This creates a vicious cycle in experimentation programs. Under-powered tests produce a high rate of inconclusive results. Teams interpret these as evidence that optimization does not work for their brand. They reduce investment in testing, which reduces the number of experiments, which further limits their ability to find winners. The problem is not the ideas — it is the statistical machinery used to evaluate them.

How to Calculate Statistical Power

Statistical power is determined by four interdependent variables: sample size, minimum detectable effect (MDE), significance level (α), and baseline conversion rate. Fix any three, and the fourth is determined. A proper power analysis calculates the sample size needed to achieve your target power for a given MDE.

Power analysis is not optional — it is the single most important step in test design. Running a test without a power analysis is equivalent to launching a clinical trial without knowing how many patients you need. You might get a result, but you have no idea whether that result is reliable.

The Four Variables

  1. Sample size (n): The number of visitors per variant. This is usually the variable you solve for. More visitors means more power, but at diminishing returns — doubling power roughly quadruples the required sample.
  2. Minimum detectable effect (MDE): The smallest effect size you want to reliably detect. Smaller MDE requires exponentially more traffic. Choose this based on business impact: what is the smallest improvement that justifies the cost of implementation?
  3. Significance level (α): The probability of a false positive, typically set at 0.05 (5%). Lowering α (e.g., to 0.01) reduces false positives but requires larger samples to maintain the same power.
  4. Baseline conversion rate: The current performance of the control. Higher baselines are easier to test because the signal-to-noise ratio is more favorable. A page converting at 8% needs roughly half the sample of a page converting at 2% to detect the same relative MDE.

These four variables are locked in a mathematical relationship. You cannot improve one without worsening another, unless you increase sample size. This is why power analysis always begins with a business question — what effect size matters? — and ends with a logistical question — can we get enough traffic in a reasonable timeframe?

The relationship between MDE and sample size is not linear — it is quadratic. Halving your MDE (trying to detect effects half as large) requires four times the sample size. This means the jump from detecting a 10% relative lift to a 5% relative lift is not twice as hard. It is four times as hard. This mathematical reality is why most e-commerce tests cannot reliably detect effects below 5-10% relative lift.

Pro Tip
Before starting any test, run a power analysis. If your traffic can't support 80% power at your target MDE within 4-6 weeks, reconsider the test. Either increase the MDE (target a bolder change), test on a higher-traffic page, or invest in variance reduction techniques like CUPED.

Power Analysis in Practice: E-Commerce Examples

The practical impact of power analysis becomes clear when you map real traffic volumes to detectable effect sizes. A store with 10,000 monthly visitors can only detect very large effects (20%+ relative lift), while enterprise stores with 1M+ visitors can detect subtle 3-5% improvements — which is where most real gains live.

Theory is useful, but e-commerce teams need concrete guidance. The table below maps three common traffic scenarios to the minimum detectable effect achievable at 80% power within a 4-week test window, assuming a 2.5% baseline conversion rate and 95% confidence.

Detectable effect sizes by traffic volume (2.5% baseline CR, 80% power, 95% confidence, 4-week test)
Store tierMonthly visitorsVisitors per variant (4 weeks)Smallest detectable relative MDEPractical implication
Low-traffic10,000~5,000~30% relative (+0.75pp)Only radical redesigns are testable
Mid-tier100,000~50,000~10% relative (+0.25pp)Meaningful optimizations are detectable
Enterprise1,000,000+~500,000~3% relative (+0.075pp)Subtle, high-value improvements are testable
30%MDE for low-traffic stores10K monthly visitors, 4-week window
10%MDE for mid-tier stores100K monthly visitors, 4-week window
3%MDE for enterprise stores1M+ monthly visitors, 4-week window

The implications are stark. If the median real effect of a well-designed e-commerce experiment is around +2.91% relative, then low-traffic stores cannot detect the average winning experiment at all within a reasonable timeframe. Mid-tier stores can detect effects slightly above the median, but miss the smaller wins. Only enterprise-level traffic supports the sensitivity needed to capture the full distribution of positive effects.

This does not mean low-traffic stores should abandon testing. It means they must be strategic about what they test. A low-traffic store should focus on high-impact hypotheses — complete checkout redesigns, major navigation changes, fundamentally different value propositions — where the expected effect size is large enough to detect. Incremental refinements like button color changes or minor copy tweaks are statistically untestable at low volumes.

Common Mistake
A test that needs 6 months to reach power is not a valid test. Seasonal shifts, site changes, and cookie churn invalidate results long before you reach significance. If your power analysis returns a duration beyond 8 weeks, the test is not viable in its current form. Redesign the hypothesis, increase the scope of the change, or find a higher-traffic page.

Common Power Mistakes in Experimentation Programs

The five most common power-related mistakes are: skipping power analysis entirely, peeking at results and stopping early, using one-tailed tests to artificially inflate power, conflating statistical significance with practical significance, and diluting power by running too many variants simultaneously.

1. Skipping power analysis entirely

The most widespread mistake is also the most basic: launching tests without any power analysis. Teams pick an arbitrary duration ('let's run it for two weeks'), launch the test, and evaluate whatever data they have at the end. This approach means the test's power is whatever it happens to be — often 40-60% for realistic effect sizes. At 50% power, you are essentially flipping a coin on whether you detect a real winner.

2. Peeking at results and stopping early

Peeking destroys power by turning a single statistical test into multiple tests. Each peek is an opportunity to stop the experiment when random noise happens to favor one variant. The result is inflated false positive rates and, paradoxically, reduced effective power — because tests stopped early have neither the sample size nor the duration to produce reliable conclusions. For a detailed treatment, see our article on the peeking problem in A/B testing.

3. Using one-tailed tests to gain power

A one-tailed test increases power by approximately 20% compared to a two-tailed test at the same significance level, because it concentrates the rejection region in one direction. Some practitioners use this as a shortcut to reduce sample size requirements. The problem is that a one-tailed test cannot detect negative effects. If variant B is actually hurting conversion, a one-tailed test will never flag it — you will simply get a non-significant result and potentially deploy a harmful change.

4. Conflating statistical and practical significance

A test can be statistically significant but practically meaningless. If an enterprise store detects a +0.02% absolute improvement with high confidence, the effect is real — but it may generate less incremental revenue than the engineering cost of implementation. Power analysis should begin with practical significance: what is the smallest effect worth acting on? Then design the test to detect that effect. Detecting smaller effects is a waste of statistical resources.

5. Running too many variants simultaneously

Every additional variant splits your traffic and reduces the effective sample size per comparison. An A/B/C/D test with four variants needs roughly three times the total traffic of a simple A/B test to maintain the same power for each comparison. If you also apply multiple comparison corrections (which you should), the required sample grows further. For most e-commerce brands, simple A/B tests with one control and one variant are the most efficient use of traffic. Reserve multi-variant tests for high-traffic pages where sample size is not a constraint.

Pro Tip
Audit your last 10 experiments. How many had a pre-test power analysis? How many reached the calculated sample size before a decision was made? If the answer is fewer than 8 out of 10, your experimentation program has a power discipline problem — and it is likely costing you winners.

How DRIP Approaches Power Analysis

Every experiment in DRIP's program goes through a mandatory power analysis before launch. We reject approximately 30% of proposed experiments because traffic cannot support a properly powered test — not because the ideas are bad, but because running an under-powered test would waste both time and traffic.

Power analysis is the first gate in our experiment pipeline, not the last. Before any hypothesis is promoted from ideation to test design, we run a power calculation using the target page's historical traffic, the baseline conversion rate, and the minimum detectable effect implied by the hypothesis. If the numbers do not support a properly powered test within our maximum duration window, the experiment does not proceed to implementation.

This is a deliberate constraint, and it is one of the most valuable parts of our methodology. A rejected experiment is not a failed idea — it is a recognition that the available data cannot support a reliable evaluation. We document these ideas and revisit them when traffic conditions change, when the page is redesigned (creating a new baseline), or when the idea can be tested at a broader scope (across multiple pages rather than one).

42 daysMedian test durationAccounts for proper power at realistic effect sizes
~30%Experiments rejected pre-launchDue to insufficient traffic for proper power
90%+Target power levelFor experiments on high-traffic pages

Our median test duration of 42 days reflects this discipline. We do not run tests for 42 days because we are slow — we run them for 42 days because that is what the math requires to produce reliable results at the effect sizes we typically observe. Shorter tests would be faster but less reliable, and unreliable results are worse than no results at all.

DRIP Insight
We reject approximately 30% of proposed experiments before they start — not because the ideas are bad, but because the traffic can't support a properly powered test. This discipline is what allows us to trust the results of the experiments we do run.

For brands that want to apply this same discipline, the starting point is honest assessment: how much traffic do your key pages actually receive? What effect sizes can you realistically detect? And are your current testing practices producing results you can trust? If you are unsure, our CRO audit includes a full power analysis of your testing program — identifying which past results were reliable and which were under-powered.

Want to run experiments at proper statistical power? Talk to DRIP about a testing program built on methodological rigor. →

Empfohlener nächster Schritt

Die CRO Lizenz ansehen

So arbeitet DRIP mit paralleler Experimentation für planbares Umsatzwachstum.

KoRo Case Study lesen

€2,5 Mio. zusätzlicher Umsatz in 6 Monaten mit strukturiertem CRO.

Frequently Asked Questions

80% is the minimum acceptable standard. For e-commerce where effect sizes are small, 90% power is preferable if traffic allows. Below 80%, your false negative rate makes the test program unreliable.

Power and sample size are directly related: doubling your power requirement roughly quadruples the needed sample size. This is why low-traffic sites must target larger minimum detectable effects.

Yes, through variance reduction techniques like CUPED, or by targeting higher-impact changes with larger expected effects. You can also use metrics with lower variance (e.g., revenue per visitor vs. conversion rate) to improve power.

You risk Type II errors — missing real winners. Across DRIP's experiment database, we estimate that programs running at 60% power miss approximately 40% of their winning ideas, making each experiment significantly less cost-effective.

Verwandte Artikel

Methodology12 min read

The Peeking Problem: Why Checking Your A/B Test Early Destroys Results

Checking A/B test results early inflates false positives from 5% to over 20%. Learn why the peeking problem is so dangerous and what frameworks actually solve it.

Read Article →
Methodology13 min read

Minimum Detectable Effect: The Number That Makes or Breaks Your A/B Test

MDE is the smallest improvement your A/B test can reliably detect. Learn how to calculate it and what DRIP's data reveals about realistic e-commerce effect sizes.

Read Article →
A/B Testing8 min read

A/B Testing Sample Size: How to Calculate It (And Why Most Get It Wrong)

How to calculate A/B test sample sizes correctly, why stopping early creates false positives, and practical guidance for different traffic levels.

Read Article →

Stop missing winners.

DRIP runs every experiment at proper statistical power — because under-powered tests waste money. Let's audit your experimentation program.

Get a free audit

The Newsletter Read by Employees from Brands like

Lego
Nike
Tesla
Lululemon
Peloton
Samsung
Bose
Ikea
Lacoste
Gymshark
Loreal
Allbirds
Join 12,000+ Ecom founders turning CRO insights into revenue
Drip Agency
Über unsKarriereRessourcenBenchmarks
ImpressumDatenschutz

Cookies

Wir nutzen optionale Analytics- und Marketing-Cookies, um Performance zu verbessern und Kampagnen zu messen. Datenschutz