What Is False Discovery Rate?
Every A/B test has a chance of producing a false positive — a result that looks like a win but is really noise. When you run a single test at α = 0.05, you accept a 5% chance of this happening. The problem is arithmetic: when you run 20 tests, 40 tests, or 100 tests per quarter, the expected number of false positives scales linearly. At 20 tests with α = 0.05, you expect 1 false positive. At 60 tests, you expect 3.
False Discovery Rate formalizes this concern. Formally, FDR = E[V/R], where V is the number of false positives among your rejections and R is the total number of rejections. If you declare 10 tests significant and 2 of them are actually false positives, your FDR is 20%. The concept was introduced by Benjamini and Hochberg (1995) and has become the standard framework for multiple-testing correction in high-throughput experimentation — from genomics to digital experimentation.
FDR vs. Familywise Error Rate: Why the Distinction Matters
The oldest approach to multiple testing is controlling the Familywise Error Rate — the probability of making at least one Type I error across all tests. The Bonferroni correction is the canonical example: divide your significance level by the number of tests. Running 20 tests? Each one needs p < 0.0025 to be significant. This is extremely conservative. It works well when any single false positive is catastrophic — clinical drug trials, for instance. It works poorly for experimentation programs.
| Property | FWER (e.g. Bonferroni) | FDR (e.g. Benjamini-Hochberg) |
|---|---|---|
| Controls | P(at least 1 false positive) | Expected proportion of false positives among winners |
| Threshold with 20 tests | p < 0.0025 | Varies; typically p < 0.01–0.04 |
| Statistical power | Very low — many real effects missed | Moderate — better at detecting true effects |
| Best for | Safety-critical decisions, few tests | High-throughput experimentation, discovery |
| Practical consequence | Declares almost nothing significant | Maintains controlled rate of false discoveries |
For a CRO program running dozens of experiments per quarter, FWER control is a velocity killer. If you apply Bonferroni across 40 concurrent tests, the effective significance level is 0.00125 per test. You will detect only the very largest effects. Most real but moderate improvements — the 3–5% uplifts that compound into meaningful annual revenue gains — will be missed. FDR control trades a small, controlled proportion of false discoveries for dramatically higher power to detect real effects.
The Benjamini-Hochberg Procedure: Step by Step
The Benjamini-Hochberg procedure is the most widely used FDR control method. It requires only the p-values from your tests, makes minimal assumptions, and can be computed in seconds. Here is the exact procedure.
- Collect all p-values. Gather the p-values from every test in your analysis family — this could be all tests in a quarter, a sprint, or a related experiment cluster.
- Rank them from smallest to largest. Label them p(1) ≤ p(2) ≤ … ≤ p(m), where m is the total number of tests.
- Calculate the BH threshold for each rank. For rank i, the threshold is (i/m) × q, where q is your target FDR level (e.g. 0.05 or 0.10).
- Find the largest significant rank. Starting from the largest rank, find the first p-value where p(i) ≤ (i/m) × q. Call this rank k.
- Reject all hypotheses with rank ≤ k. All tests with p-values at or below p(k) are declared significant. The rest are not.
Worked Example: 10 Tests in a Sprint
| Rank (i) | Test | p-value | BH Threshold (i/10 × 0.10) | Significant? |
|---|---|---|---|---|
| 1 | Checkout CTA color | 0.003 | 0.010 | Yes |
| 2 | PDP image size | 0.011 | 0.020 | Yes |
| 3 | Cart urgency timer | 0.024 | 0.030 | Yes |
| 4 | Nav menu layout | 0.038 | 0.040 | Yes |
| 5 | Homepage hero copy | 0.049 | 0.050 | No † |
| 6 | Search bar placement | 0.061 | 0.060 | No |
| 7 | Filter UX redesign | 0.110 | 0.070 | No |
| 8 | Footer CTA | 0.230 | 0.080 | No |
| 9 | Category page sort | 0.410 | 0.090 | No |
| 10 | Wishlist button | 0.780 | 0.100 | No |
In this example, the largest rank where p(i) ≤ threshold is rank 4 (0.038 ≤ 0.040). Therefore, tests at ranks 1 through 4 are declared significant. Note that test 5 (p = 0.049) would be significant at α = 0.05 in isolation — but after BH correction, it is not. The † marker highlights this: naive analysis would have called it a winner; corrected analysis does not.
When Should You Apply FDR Correction?
Not every situation requires multiple-testing correction. The decision depends on how many tests you are evaluating together, what the cost of a false positive is, and whether you are treating each test as an independent business decision or part of a portfolio.
Scenarios Where FDR Control Is Essential
- Multi-variant tests. Testing 4+ variants against a control means 4+ simultaneous comparisons. Without correction, the probability of at least one false positive exceeds 18%.
- Metric families. Evaluating a single test against multiple KPIs (conversion rate, AOV, revenue per visitor, bounce rate) is multiple testing in disguise. Each metric is a separate hypothesis.
- Segment analyses. Slicing results by device, traffic source, or customer cohort multiplies the number of comparisons. Post-hoc segment fishing is one of the most common sources of false discoveries.
- Quarterly or sprint-level reporting. When you report “12 out of 30 tests were significant this quarter,” you are making a portfolio-level claim that should account for multiplicity.
Scenarios Where Correction Is Optional
- Single, pre-registered A/B tests. One variant, one primary metric, one pre-specified analysis. This is the textbook setting where α = 0.05 means exactly what it says.
- Independent business decisions with separate budgets. If two tests are run by different teams with separate decision authority and no shared reporting, they can be analyzed independently.
How DRIP Implements FDR Control Across 90+ Brands
Managing thousands of experiments across 90+ e-commerce brands means multiple-testing correction is not optional — it is a structural requirement. Without it, our reported results would be unreliable and our clients would ship changes that do not actually work. Here is the framework we use.
Pre-Registration and Metric Hierarchy
Every experiment at DRIP has a single pre-registered primary metric — typically conversion rate or revenue per visitor. Secondary metrics are declared before launch and analyzed with FDR correction. Post-hoc metrics are treated as exploratory and never reported as confirmed findings. This three-tier hierarchy reduces the multiplicity problem at source: fewer comparisons means less correction needed.
Family-Level BH Correction
We define “experiment families” as groups of tests that share a common decision boundary — for example, all tests in a quarterly optimization sprint for a single brand, or all variants in a multi-cell test. BH correction is applied within each family at q = 0.10 for primary metrics and q = 0.15 for secondary metrics. This means we accept that up to 10% of our declared primary-metric winners may be false discoveries — a calibrated trade-off between velocity and accuracy.
Segment Analysis Guardrails
Post-hoc segment analyses — slicing by device, geography, or customer cohort — are treated as a separate multiplicity family with FDR correction at q = 0.05. This is deliberately more conservative than our primary-metric threshold because segment analyses are inherently exploratory and prone to overfitting. Any segment-level finding that survives BH correction is flagged as a candidate hypothesis to be validated in a targeted follow-up test, not a confirmed result.
Building FDR Control Into Your Experimentation Program
You do not need specialized software to implement FDR control. The Benjamini-Hochberg procedure can be applied in a spreadsheet. What you need is process discipline: clear definitions of what counts as an experiment family, a pre-committed q-level, and stakeholder buy-in that corrected results may reclassify some winners as inconclusive.
Step 1: Define Your Experiment Families
An experiment family is a group of tests whose results will be evaluated together. Common groupings include all tests in a quarterly sprint, all variants in a single multi-cell test, or all tests targeting a specific funnel stage. The choice matters: larger families require more correction, but under-grouping defeats the purpose. A reasonable default is to group by brand and reporting period.
Step 2: Set Your q-Level
The q-level is the maximum proportion of false discoveries you are willing to accept. For most CRO programs, q = 0.10 is a good starting point: you accept that up to 1 in 10 declared winners may be false. More risk-averse programs (e.g. those with high implementation costs per test) should use q = 0.05. Never set q higher than 0.15 — beyond that, the signal-to-noise ratio degrades to the point where corrected results provide little practical value.
Step 3: Apply BH at Reporting Time
Run BH correction at the end of each reporting period — typically monthly or quarterly. Collect all p-values from the family, apply the step-up procedure, and report corrected results. This means individual tests can still be monitored in real-time using sequential methods, but portfolio-level claims are corrected.
Step 4: Train Stakeholders
The hardest part is organizational, not statistical. Stakeholders need to understand that a test which was significant at p = 0.04 may become non-significant after BH correction — and that this is the system working correctly, not a failure. Frame it as quality control: FDR correction catches low-confidence results before they consume engineering resources.
