Drip
FallstudienProzessKarriere
CRO LicenseCRO Audit
BlogRessourcenArtifactsStatistik-ToolsBenchmarksResearch
Kostenloses Erstgespräch buchenErstgespräch
Startseite/Blog/How to Control False Discovery Rate When Running Multiple A/B Tests
All Articles
Methodology13 min read

How to Control False Discovery Rate When Running Multiple A/B Tests

A mature experimentation program runs dozens of tests per quarter. Without multiple-testing correction, your reported winners accumulate false positives at an alarming rate. Here is how FDR control keeps your decision pipeline honest — without crippling your velocity.

Fabian GmeindlCo-Founder, DRIP Agency·March 13, 2026
📖This article is part of our The Complete Guide to A/B Testing for E-Commerce

False Discovery Rate (FDR) is the expected proportion of declared winners that are actually false positives. When an experimentation program runs many simultaneous or sequential tests at α = 0.05, the probability that at least some winners are noise grows rapidly. The Benjamini-Hochberg procedure controls FDR by ranking p-values and adjusting significance thresholds, keeping the proportion of false discoveries below a chosen level (typically 5–10%) without the extreme conservatism of familywise error rate control.

Contents
  1. What Is False Discovery Rate?
  2. FDR vs. Familywise Error Rate: Why the Distinction Matters
  3. The Benjamini-Hochberg Procedure: Step by Step
  4. When Should You Apply FDR Correction?
  5. How DRIP Implements FDR Control Across 90+ Brands
  6. Building FDR Control Into Your Experimentation Program

What Is False Discovery Rate?

False Discovery Rate is the expected proportion of rejected null hypotheses (declared winners) that are actually false positives. It answers: of all the tests we called significant, what fraction are wrong?

Every A/B test has a chance of producing a false positive — a result that looks like a win but is really noise. When you run a single test at α = 0.05, you accept a 5% chance of this happening. The problem is arithmetic: when you run 20 tests, 40 tests, or 100 tests per quarter, the expected number of false positives scales linearly. At 20 tests with α = 0.05, you expect 1 false positive. At 60 tests, you expect 3.

False Discovery Rate formalizes this concern. Formally, FDR = E[V/R], where V is the number of false positives among your rejections and R is the total number of rejections. If you declare 10 tests significant and 2 of them are actually false positives, your FDR is 20%. The concept was introduced by Benjamini and Hochberg (1995) and has become the standard framework for multiple-testing correction in high-throughput experimentation — from genomics to digital experimentation.

1 in 20Expected false positive per 20 testsAt α = 0.05 with no correction
3–4Expected false wins in a 60-test quarterEnough to materially distort reported program value
5–10%Typical FDR target for CRO programsBalances discovery rate against decision quality
Counterintuitive Finding
A 5% significance level does not mean only 5% of your declared winners are false. It means each individual test has a 5% false positive probability. When you aggregate many tests, the proportion of false discoveries among your winners depends on how many of your tested hypotheses actually have real effects — the base rate of true positives matters enormously.

FDR vs. Familywise Error Rate: Why the Distinction Matters

Familywise Error Rate (FWER) controls the probability of making even one false discovery. FDR controls the proportion of false discoveries among rejections. FWER is appropriate for safety-critical decisions; FDR is appropriate for experimentation programs where some false positives are tolerable.

The oldest approach to multiple testing is controlling the Familywise Error Rate — the probability of making at least one Type I error across all tests. The Bonferroni correction is the canonical example: divide your significance level by the number of tests. Running 20 tests? Each one needs p < 0.0025 to be significant. This is extremely conservative. It works well when any single false positive is catastrophic — clinical drug trials, for instance. It works poorly for experimentation programs.

FWER vs. FDR control: practical comparison for CRO programs
PropertyFWER (e.g. Bonferroni)FDR (e.g. Benjamini-Hochberg)
ControlsP(at least 1 false positive)Expected proportion of false positives among winners
Threshold with 20 testsp < 0.0025Varies; typically p < 0.01–0.04
Statistical powerVery low — many real effects missedModerate — better at detecting true effects
Best forSafety-critical decisions, few testsHigh-throughput experimentation, discovery
Practical consequenceDeclares almost nothing significantMaintains controlled rate of false discoveries

For a CRO program running dozens of experiments per quarter, FWER control is a velocity killer. If you apply Bonferroni across 40 concurrent tests, the effective significance level is 0.00125 per test. You will detect only the very largest effects. Most real but moderate improvements — the 3–5% uplifts that compound into meaningful annual revenue gains — will be missed. FDR control trades a small, controlled proportion of false discoveries for dramatically higher power to detect real effects.

DRIP Insight
The pragmatic question is not “can we afford any false positives?” but “what proportion of false positives can we tolerate?” For most e-commerce experimentation programs, the answer is 5–10%. That is what FDR control delivers — and why it is the right framework for CRO at scale.

The Benjamini-Hochberg Procedure: Step by Step

The Benjamini-Hochberg (BH) procedure ranks p-values from smallest to largest, then finds the largest p-value that falls below a linearly increasing threshold. All tests with p-values at or below that rank are declared significant, controlling FDR at the specified level.

The Benjamini-Hochberg procedure is the most widely used FDR control method. It requires only the p-values from your tests, makes minimal assumptions, and can be computed in seconds. Here is the exact procedure.

  1. Collect all p-values. Gather the p-values from every test in your analysis family — this could be all tests in a quarter, a sprint, or a related experiment cluster.
  2. Rank them from smallest to largest. Label them p(1) ≤ p(2) ≤ … ≤ p(m), where m is the total number of tests.
  3. Calculate the BH threshold for each rank. For rank i, the threshold is (i/m) × q, where q is your target FDR level (e.g. 0.05 or 0.10).
  4. Find the largest significant rank. Starting from the largest rank, find the first p-value where p(i) ≤ (i/m) × q. Call this rank k.
  5. Reject all hypotheses with rank ≤ k. All tests with p-values at or below p(k) are declared significant. The rest are not.

Worked Example: 10 Tests in a Sprint

Benjamini-Hochberg procedure applied to 10 tests (q = 0.10)
Rank (i)Testp-valueBH Threshold (i/10 × 0.10)Significant?
1Checkout CTA color0.0030.010Yes
2PDP image size0.0110.020Yes
3Cart urgency timer0.0240.030Yes
4Nav menu layout0.0380.040Yes
5Homepage hero copy0.0490.050No †
6Search bar placement0.0610.060No
7Filter UX redesign0.1100.070No
8Footer CTA0.2300.080No
9Category page sort0.4100.090No
10Wishlist button0.7800.100No

In this example, the largest rank where p(i) ≤ threshold is rank 4 (0.038 ≤ 0.040). Therefore, tests at ranks 1 through 4 are declared significant. Note that test 5 (p = 0.049) would be significant at α = 0.05 in isolation — but after BH correction, it is not. The † marker highlights this: naive analysis would have called it a winner; corrected analysis does not.

Pro Tip
The BH procedure is a step-up method — you start from the bottom of the ranked list and work upward. This is important: once you find the first non-significant rank scanning from the top, you do not reject anything above it. This differs from Bonferroni, which evaluates each test independently.

When Should You Apply FDR Correction?

Apply FDR correction whenever you are making simultaneous or sequential decisions across multiple tests and the cost of a false positive is significant but not catastrophic — which describes most e-commerce experimentation programs.

Not every situation requires multiple-testing correction. The decision depends on how many tests you are evaluating together, what the cost of a false positive is, and whether you are treating each test as an independent business decision or part of a portfolio.

Scenarios Where FDR Control Is Essential

  • Multi-variant tests. Testing 4+ variants against a control means 4+ simultaneous comparisons. Without correction, the probability of at least one false positive exceeds 18%.
  • Metric families. Evaluating a single test against multiple KPIs (conversion rate, AOV, revenue per visitor, bounce rate) is multiple testing in disguise. Each metric is a separate hypothesis.
  • Segment analyses. Slicing results by device, traffic source, or customer cohort multiplies the number of comparisons. Post-hoc segment fishing is one of the most common sources of false discoveries.
  • Quarterly or sprint-level reporting. When you report “12 out of 30 tests were significant this quarter,” you are making a portfolio-level claim that should account for multiplicity.

Scenarios Where Correction Is Optional

  • Single, pre-registered A/B tests. One variant, one primary metric, one pre-specified analysis. This is the textbook setting where α = 0.05 means exactly what it says.
  • Independent business decisions with separate budgets. If two tests are run by different teams with separate decision authority and no shared reporting, they can be analyzed independently.
Common Mistake
Segment analysis after a test completes is the most dangerous form of uncorrected multiple testing. Cutting results by 5 segments on 3 metrics produces 15 comparisons. At α = 0.05, the probability of at least one spurious finding is 54%. If you report that a test “worked for mobile users on iOS,” that claim must survive correction.

How DRIP Implements FDR Control Across 90+ Brands

DRIP applies Benjamini-Hochberg correction at the experiment-family level, enforces pre-registration of primary metrics, and flags post-hoc segment analyses with separate, more stringent FDR thresholds.

Managing thousands of experiments across 90+ e-commerce brands means multiple-testing correction is not optional — it is a structural requirement. Without it, our reported results would be unreliable and our clients would ship changes that do not actually work. Here is the framework we use.

Pre-Registration and Metric Hierarchy

Every experiment at DRIP has a single pre-registered primary metric — typically conversion rate or revenue per visitor. Secondary metrics are declared before launch and analyzed with FDR correction. Post-hoc metrics are treated as exploratory and never reported as confirmed findings. This three-tier hierarchy reduces the multiplicity problem at source: fewer comparisons means less correction needed.

Family-Level BH Correction

We define “experiment families” as groups of tests that share a common decision boundary — for example, all tests in a quarterly optimization sprint for a single brand, or all variants in a multi-cell test. BH correction is applied within each family at q = 0.10 for primary metrics and q = 0.15 for secondary metrics. This means we accept that up to 10% of our declared primary-metric winners may be false discoveries — a calibrated trade-off between velocity and accuracy.

q = 0.10FDR threshold for primary metricsAt most 10% of declared winners are expected to be false
q = 0.15FDR threshold for secondary metricsSlightly more permissive for directional learning
3-tierMetric hierarchyPrimary → Secondary → Exploratory, each with different evidence standards

Segment Analysis Guardrails

Post-hoc segment analyses — slicing by device, geography, or customer cohort — are treated as a separate multiplicity family with FDR correction at q = 0.05. This is deliberately more conservative than our primary-metric threshold because segment analyses are inherently exploratory and prone to overfitting. Any segment-level finding that survives BH correction is flagged as a candidate hypothesis to be validated in a targeted follow-up test, not a confirmed result.

DRIP Insight
The combination of pre-registration, metric hierarchy, and family-level BH correction reduces false discoveries without meaningfully slowing velocity. Across our portfolio of thousands of experiments, the correction typically reclassifies 2–4 tests per quarter per brand from “significant” to “inconclusive.” That is a small price for a trustworthy decision pipeline.

Building FDR Control Into Your Experimentation Program

Implementing FDR control requires defining experiment families, choosing a target q-level, applying BH correction at reporting time, and training stakeholders to interpret corrected results.

You do not need specialized software to implement FDR control. The Benjamini-Hochberg procedure can be applied in a spreadsheet. What you need is process discipline: clear definitions of what counts as an experiment family, a pre-committed q-level, and stakeholder buy-in that corrected results may reclassify some winners as inconclusive.

Step 1: Define Your Experiment Families

An experiment family is a group of tests whose results will be evaluated together. Common groupings include all tests in a quarterly sprint, all variants in a single multi-cell test, or all tests targeting a specific funnel stage. The choice matters: larger families require more correction, but under-grouping defeats the purpose. A reasonable default is to group by brand and reporting period.

Step 2: Set Your q-Level

The q-level is the maximum proportion of false discoveries you are willing to accept. For most CRO programs, q = 0.10 is a good starting point: you accept that up to 1 in 10 declared winners may be false. More risk-averse programs (e.g. those with high implementation costs per test) should use q = 0.05. Never set q higher than 0.15 — beyond that, the signal-to-noise ratio degrades to the point where corrected results provide little practical value.

Step 3: Apply BH at Reporting Time

Run BH correction at the end of each reporting period — typically monthly or quarterly. Collect all p-values from the family, apply the step-up procedure, and report corrected results. This means individual tests can still be monitored in real-time using sequential methods, but portfolio-level claims are corrected.

Step 4: Train Stakeholders

The hardest part is organizational, not statistical. Stakeholders need to understand that a test which was significant at p = 0.04 may become non-significant after BH correction — and that this is the system working correctly, not a failure. Frame it as quality control: FDR correction catches low-confidence results before they consume engineering resources.

Pro Tip
Start with retrospective analysis. Apply BH correction to your last quarter of results and show stakeholders how many “winners” would have been reclassified. This makes the abstract concept concrete. In our experience, 10–20% of uncorrected winners get reclassified — enough to be material, not enough to be demoralizing.
Want a multiplicity audit of your experimentation program? Our CRO audit includes FDR analysis. →

Empfohlener nächster Schritt

Die CRO Lizenz ansehen

So arbeitet DRIP mit paralleler Experimentation für planbares Umsatzwachstum.

KoRo Case Study lesen

€2,5 Mio. zusätzlicher Umsatz in 6 Monaten mit strukturiertem CRO.

Frequently Asked Questions

For most e-commerce experimentation programs, a target FDR (q-level) of 5–10% provides a good balance between discovery rate and decision quality. This means you accept that up to 1 in 10 declared winners may be false positives. For high-stakes decisions with large implementation costs, use q = 0.05. For rapid learning sprints where some false positives are tolerable, q = 0.10 is reasonable.

Yes. Sequential testing and FDR correction operate at different levels. Sequential testing controls the error rate of an individual test over time (addressing the peeking problem). FDR correction controls the proportion of false discoveries across multiple tests. You can — and should — use both: sequential monitoring for each test, then BH correction when evaluating the batch of results.

It reclassifies marginal winners — tests that barely crossed the 0.05 threshold — as inconclusive. Strong results (p < 0.01) are rarely affected. In practice, BH correction at q = 0.10 typically reclassifies 10–20% of uncorrected winners. Those reclassified tests are disproportionately likely to be false positives, so you are losing noise, not signal.

The peeking problem is about repeated analysis of a single test over time — checking results before the sample size is reached. FDR is about the accumulation of false positives across multiple separate tests. Peeking inflates the error rate of individual tests; uncorrected multiplicity inflates the proportion of false discoveries across your portfolio. They are distinct problems that require distinct solutions, though both stem from uncorrected multiple testing.

Verwandte Artikel

Methodology11 min read

Sample Ratio Mismatch: The First Thing to Check in Every A/B Test

Sample Ratio Mismatch means your A/B test groups aren't the expected size. Learn how to detect SRM, what causes it, and why it invalidates results.

Read Article →
Methodology12 min read

The Peeking Problem: Why Checking Your A/B Test Early Destroys Results

Checking A/B test results early inflates false positives from 5% to over 20%. Learn why the peeking problem is so dangerous and what frameworks actually solve it.

Read Article →
Methodology14 min read

Statistical Power in A/B Testing: Why Most Tests Are Under-Powered

Statistical power determines whether your A/B test can detect real effects. Learn why 80% isn't always enough and how to properly power e-commerce experiments.

Read Article →

How many of your winners are real?

DRIP applies Benjamini-Hochberg correction across every experiment family. Our multiplicity audit reveals how many of your reported results survive correction.

Request a multiplicity audit

The Newsletter Read by Employees from Brands like

Lego
Nike
Tesla
Lululemon
Peloton
Samsung
Bose
Ikea
Lacoste
Gymshark
Loreal
Allbirds
Join 12,000+ Ecom founders turning CRO insights into revenue
Drip Agency
Über unsKarriereRessourcenBenchmarks
ImpressumDatenschutz

Cookies

Wir nutzen optionale Analytics- und Marketing-Cookies, um Performance zu verbessern und Kampagnen zu messen. Datenschutz