What is a good false discovery rate for A/B testing?

For most e-commerce experimentation programs, a target FDR (q-level) of 5–10% provides a good balance between discovery rate and decision quality. This means you accept that up to 1 in 10 declared winners may be false positives. For high-stakes decisions with large implementation costs, use q = 0.05. For rapid learning sprints where some false positives are tolerable, q = 0.10 is reasonable.

Is the Benjamini-Hochberg procedure compatible with sequential testing?

Yes. Sequential testing and FDR correction operate at different levels. Sequential testing controls the error rate of an individual test over time (addressing the peeking problem). FDR correction controls the proportion of false discoveries across multiple tests. You can — and should — use both: sequential monitoring for each test, then BH correction when evaluating the batch of results.

Does FDR correction reduce the number of winners we find?

It reclassifies marginal winners — tests that barely crossed the 0.05 threshold — as inconclusive. Strong results (p < 0.01) are rarely affected. In practice, BH correction at q = 0.10 typically reclassifies 10–20% of uncorrected winners. Those reclassified tests are disproportionately likely to be false positives, so you are losing noise, not signal.

How is FDR different from the peeking problem?

The peeking problem is about repeated analysis of a single test over time — checking results before the sample size is reached. FDR is about the accumulation of false positives across multiple separate tests. Peeking inflates the error rate of individual tests; uncorrected multiplicity inflates the proportion of false discoveries across your portfolio. They are distinct problems that require distinct solutions, though both stem from uncorrected multiple testing.

How to Control False Discovery Rate in A/B Testing

Q: How is FDR different from the peeking problem?

The peeking problem is about repeated analysis of a single test over time — checking results before the sample size is reached. FDR is about the accumulation of false positives across multiple separate tests. Peeking inflates the error rate of individual tests; uncorrected multiplicity inflates the proportion of false discoveries across your portfolio. They are distinct problems that require distinct solutions, though both stem from uncorrected multiple testing.

What Is False Discovery Rate?

False Discovery Rate is the expected proportion of rejected null hypotheses (declared winners) that are actually false positives. It answers: of all the tests we called significant, what fraction are wrong?

Every A/B test has a chance of producing a false positive — a result that looks like a win but is really noise. When you run a single test at α = 0.05, you accept a 5% chance of this happening. The problem is arithmetic: when you run 20 tests, 40 tests, or 100 tests per quarter, the expected number of false positives scales linearly. At 20 tests with α = 0.05, you expect 1 false positive. At 60 tests, you expect 3.

False Discovery Rate formalizes this concern. Formally, FDR = E[V/R], where V is the number of false positives among your rejections and R is the total number of rejections. If you declare 10 tests significant and 2 of them are actually false positives, your FDR is 20%. The concept was introduced by Benjamini and Hochberg (1995) and has become the standard framework for multiple-testing correction in high-throughput experimentation — from genomics to digital experimentation.

1 in 20Expected false positive per 20 testsAt α = 0.05 with no correction

3–4Expected false wins in a 60-test quarterEnough to materially distort reported program value

5–10%Typical FDR target for CRO programsBalances discovery rate against decision quality

Counterintuitive Finding

A 5% significance level does not mean only 5% of your declared winners are false. It means each individual test has a 5% false positive probability. When you aggregate many tests, the proportion of false discoveries among your winners depends on how many of your tested hypotheses actually have real effects — the base rate of true positives matters enormously.

FDR vs. Familywise Error Rate: Why the Distinction Matters

Familywise Error Rate (FWER) controls the probability of making even one false discovery. FDR controls the proportion of false discoveries among rejections. FWER is appropriate for safety-critical decisions; FDR is appropriate for experimentation programs where some false positives are tolerable.

The oldest approach to multiple testing is controlling the Familywise Error Rate — the probability of making at least one Type I error across all tests. The Bonferroni correction is the canonical example: divide your significance level by the number of tests. Running 20 tests? Each one needs p < 0.0025 to be significant. This is extremely conservative. It works well when any single false positive is catastrophic — clinical drug trials, for instance. It works poorly for experimentation programs.

FWER vs. FDR control: practical comparison for CRO programs

Property	FWER (e.g. Bonferroni)	FDR (e.g. Benjamini-Hochberg)
Controls	P(at least 1 false positive)	Expected proportion of false positives among winners
Threshold with 20 tests	p < 0.0025	Varies; typically p < 0.01–0.04
Statistical power	Very low — many real effects missed	Moderate — better at detecting true effects
Best for	Safety-critical decisions, few tests	High-throughput experimentation, discovery
Practical consequence	Declares almost nothing significant	Maintains controlled rate of false discoveries

For a CRO program running dozens of experiments per quarter, FWER control is a velocity killer. If you apply Bonferroni across 40 concurrent tests, the effective significance level is 0.00125 per test. You will detect only the very largest effects. Most real but moderate improvements — the 3–5% uplifts that compound into meaningful annual revenue gains — will be missed. FDR control trades a small, controlled proportion of false discoveries for dramatically higher power to detect real effects.

DRIP Insight

The pragmatic question is not “can we afford any false positives?” but “what proportion of false positives can we tolerate?” For most e-commerce experimentation programs, the answer is 5–10%. That is what FDR control delivers — and why it is the right framework for CRO at scale.

The Benjamini-Hochberg Procedure: Step by Step

The Benjamini-Hochberg (BH) procedure ranks p-values from smallest to largest, then finds the largest p-value that falls below a linearly increasing threshold. All tests with p-values at or below that rank are declared significant, controlling FDR at the specified level.

The Benjamini-Hochberg procedure is the most widely used FDR control method. It requires only the p-values from your tests, makes minimal assumptions, and can be computed in seconds. Here is the exact procedure.

Collect all p-values. Gather the p-values from every test in your analysis family — this could be all tests in a quarter, a sprint, or a related experiment cluster.
Rank them from smallest to largest. Label them p(1) ≤ p(2) ≤ … ≤ p(m), where m is the total number of tests.
Calculate the BH threshold for each rank. For rank i, the threshold is (i/m) × q, where q is your target FDR level (e.g. 0.05 or 0.10).
Find the largest significant rank. Starting from the largest rank, find the first p-value where p(i) ≤ (i/m) × q. Call this rank k.
Reject all hypotheses with rank ≤ k. All tests with p-values at or below p(k) are declared significant. The rest are not.

Worked Example: 10 Tests in a Sprint

Benjamini-Hochberg procedure applied to 10 tests (q = 0.10)

Rank (i)	Test	p-value	BH Threshold (i/10 × 0.10)	Significant?
1	Checkout CTA color	0.003	0.010	Yes
2	PDP image size	0.011	0.020	Yes
3	Cart urgency timer	0.024	0.030	Yes
4	Nav menu layout	0.038	0.040	Yes
5	Homepage hero copy	0.049	0.050	No †
6	Search bar placement	0.061	0.060	No
7	Filter UX redesign	0.110	0.070	No
8	Footer CTA	0.230	0.080	No
9	Category page sort	0.410	0.090	No
10	Wishlist button	0.780	0.100	No

In this example, the largest rank where p(i) ≤ threshold is rank 4 (0.038 ≤ 0.040). Therefore, tests at ranks 1 through 4 are declared significant. Note that test 5 (p = 0.049) would be significant at α = 0.05 in isolation — but after BH correction, it is not. The † marker highlights this: naive analysis would have called it a winner; corrected analysis does not.

Pro Tip

The BH procedure is a step-up method — you start from the bottom of the ranked list and work upward. This is important: once you find the first non-significant rank scanning from the top, you do not reject anything above it. This differs from Bonferroni, which evaluates each test independently.

When Should You Apply FDR Correction?

Apply FDR correction whenever you are making simultaneous or sequential decisions across multiple tests and the cost of a false positive is significant but not catastrophic — which describes most e-commerce experimentation programs.

Not every situation requires multiple-testing correction. The decision depends on how many tests you are evaluating together, what the cost of a false positive is, and whether you are treating each test as an independent business decision or part of a portfolio.

Scenarios Where FDR Control Is Essential

Multi-variant tests. Testing 4+ variants against a control means 4+ simultaneous comparisons. Without correction, the probability of at least one false positive exceeds 18%.
Metric families. Evaluating a single test against multiple KPIs (conversion rate, AOV, revenue per visitor, bounce rate) is multiple testing in disguise. Each metric is a separate hypothesis.
Segment analyses. Slicing results by device, traffic source, or customer cohort multiplies the number of comparisons. Post-hoc segment fishing is one of the most common sources of false discoveries.
Quarterly or sprint-level reporting. When you report “12 out of 30 tests were significant this quarter,” you are making a portfolio-level claim that should account for multiplicity.

Scenarios Where Correction Is Optional

Single, pre-registered A/B tests. One variant, one primary metric, one pre-specified analysis. This is the textbook setting where α = 0.05 means exactly what it says.
Independent business decisions with separate budgets. If two tests are run by different teams with separate decision authority and no shared reporting, they can be analyzed independently.

Common Mistake

Segment analysis after a test completes is the most dangerous form of uncorrected multiple testing. Cutting results by 5 segments on 3 metrics produces 15 comparisons. At α = 0.05, the probability of at least one spurious finding is 54%. If you report that a test “worked for mobile users on iOS,” that claim must survive correction.

How DRIP Implements FDR Control Across 90+ Brands

DRIP applies Benjamini-Hochberg correction at the experiment-family level, enforces pre-registration of primary metrics, and flags post-hoc segment analyses with separate, more stringent FDR thresholds.

Managing thousands of experiments across 90+ e-commerce brands means multiple-testing correction is not optional — it is a structural requirement. Without it, our reported results would be unreliable and our clients would ship changes that do not actually work. Here is the framework we use.

Pre-Registration and Metric Hierarchy

Every experiment at DRIP has a single pre-registered primary metric — typically conversion rate or revenue per visitor. Secondary metrics are declared before launch and analyzed with FDR correction. Post-hoc metrics are treated as exploratory and never reported as confirmed findings. This three-tier hierarchy reduces the multiplicity problem at source: fewer comparisons means less correction needed.

Family-Level BH Correction

We define “experiment families” as groups of tests that share a common decision boundary — for example, all tests in a quarterly optimization sprint for a single brand, or all variants in a multi-cell test. BH correction is applied within each family at q = 0.10 for primary metrics and q = 0.15 for secondary metrics. This means we accept that up to 10% of our declared primary-metric winners may be false discoveries — a calibrated trade-off between velocity and accuracy.

q = 0.10FDR threshold for primary metricsAt most 10% of declared winners are expected to be false

q = 0.15FDR threshold for secondary metricsSlightly more permissive for directional learning

3-tierMetric hierarchyPrimary → Secondary → Exploratory, each with different evidence standards

Segment Analysis Guardrails

Post-hoc segment analyses — slicing by device, geography, or customer cohort — are treated as a separate multiplicity family with FDR correction at q = 0.05. This is deliberately more conservative than our primary-metric threshold because segment analyses are inherently exploratory and prone to overfitting. Any segment-level finding that survives BH correction is flagged as a candidate hypothesis to be validated in a targeted follow-up test, not a confirmed result.

DRIP Insight

The combination of pre-registration, metric hierarchy, and family-level BH correction reduces false discoveries without meaningfully slowing velocity. Across our portfolio of thousands of experiments, the correction typically reclassifies 2–4 tests per quarter per brand from “significant” to “inconclusive.” That is a small price for a trustworthy decision pipeline.

Building FDR Control Into Your Experimentation Program

Implementing FDR control requires defining experiment families, choosing a target q-level, applying BH correction at reporting time, and training stakeholders to interpret corrected results.

You do not need specialized software to implement FDR control. The Benjamini-Hochberg procedure can be applied in a spreadsheet. What you need is process discipline: clear definitions of what counts as an experiment family, a pre-committed q-level, and stakeholder buy-in that corrected results may reclassify some winners as inconclusive.

Step 1: Define Your Experiment Families

An experiment family is a group of tests whose results will be evaluated together. Common groupings include all tests in a quarterly sprint, all variants in a single multi-cell test, or all tests targeting a specific funnel stage. The choice matters: larger families require more correction, but under-grouping defeats the purpose. A reasonable default is to group by brand and reporting period.

Step 2: Set Your q-Level

The q-level is the maximum proportion of false discoveries you are willing to accept. For most CRO programs, q = 0.10 is a good starting point: you accept that up to 1 in 10 declared winners may be false. More risk-averse programs (e.g. those with high implementation costs per test) should use q = 0.05. Never set q higher than 0.15 — beyond that, the signal-to-noise ratio degrades to the point where corrected results provide little practical value.

Step 3: Apply BH at Reporting Time

Run BH correction at the end of each reporting period — typically monthly or quarterly. Collect all p-values from the family, apply the step-up procedure, and report corrected results. This means individual tests can still be monitored in real-time using sequential methods, but portfolio-level claims are corrected.

Step 4: Train Stakeholders

The hardest part is organizational, not statistical. Stakeholders need to understand that a test which was significant at p = 0.04 may become non-significant after BH correction — and that this is the system working correctly, not a failure. Frame it as quality control: FDR correction catches low-confidence results before they consume engineering resources.

Pro Tip

Start with retrospective analysis. Apply BH correction to your last quarter of results and show stakeholders how many “winners” would have been reclassified. This makes the abstract concept concrete. In our experience, 10–20% of uncorrected winners get reclassified — enough to be material, not enough to be demoralizing.

How to Control False Discovery Rate When Running Multiple A/B Tests

What Is False Discovery Rate?

FDR vs. Familywise Error Rate: Why the Distinction Matters