What Are Frequentist and Bayesian A/B Testing?
The frequentist approach to A/B testing follows a straightforward protocol. Before launching the test, you specify a significance level (typically α = 0.05), a minimum detectable effect, and the required statistical power (typically 80-90%). From these parameters, you calculate the exact sample size needed. You run the test to completion, analyze the results once, and either reject or fail to reject the null hypothesis. The p-value tells you the probability of observing data at least as extreme as yours if the null hypothesis were true — nothing more.
The Bayesian approach works differently. You start with a prior distribution that encodes your beliefs about the likely effect size before seeing any data. As data accumulates, Bayes' theorem updates this prior into a posterior distribution — a probability distribution over all possible effect sizes given the data. From this posterior, you can compute statements like 'there is a 94% probability that the treatment is better than the control' or 'the expected lift is 3.2% with a 95% credible interval of [1.1%, 5.4%].'
| Property | Frequentist | Bayesian |
|---|---|---|
| Core question | How surprising is this data under the null hypothesis? | What is the probability that the treatment is better? |
| Key output | p-value, confidence interval | Posterior distribution, credible interval |
| Prior information | Not used in the test itself | Required — explicitly specified |
| Error control | Fixed Type I and Type II error rates | No direct frequentist error rate guarantee |
| Sample size | Pre-determined via power analysis | Can be flexible (with caveats) |
| Interpretation | 'We reject the null at 5% significance' | 'There's a 94% probability treatment is better' |
Both frameworks, when applied correctly, can produce valid inferences. The question is not which framework is 'right' in some absolute sense — it's which framework provides better guardrails, transparency, and decision quality for the specific context of commercial experimentation.
What Are the Advantages of Frequentist Testing?
Transparent Error Rate Control
The single most important advantage of frequentist testing in a commercial setting is that you know exactly how often your process will produce wrong decisions. If you set α = 0.05 and run your tests properly, no more than 5% of your declared winners will be false positives in the long run. If you design for 80% power, you will detect 80% of real effects of a given size. These are not aspirational targets — they are mathematical guarantees conditional on the test being properly executed.
This matters because experimentation programs operate at scale. If you run 100 tests per year and your false positive rate is genuinely controlled at 5%, you know that roughly 5 of your 'winners' are noise. You can budget for that. If your false positive rate is uncontrolled — as it is in many Bayesian implementations without careful calibration — you have no idea whether 5% or 25% of your winners are false. You lose the ability to reason about the reliability of your entire program.
Pre-Registration Discipline
Frequentist testing forces you to make decisions before you see the data: what is the hypothesis, what is the primary metric, what is the minimum detectable effect, and how large is the sample? This is not bureaucratic overhead — it's a feature. Pre-registration prevents the most damaging forms of analytical flexibility: choosing the metric that looks best after the test, changing the hypothesis to match the data, or redefining 'success' to match the outcome.
No Prior Specification Problem
Frequentist methods do not require you to specify a prior distribution over effect sizes. This eliminates an entire category of subjectivity from the analysis. In practice, most Bayesian A/B testing tools use 'uninformative' or 'weakly informative' priors — but the choice of prior always influences the posterior, especially with small sample sizes. Two analysts with different priors looking at the same data will reach different conclusions. With frequentist methods, the same data and the same pre-registered analysis plan produce the same result, regardless of the analyst's beliefs.
Reproducibility and Auditability
When a stakeholder asks 'why did we ship this change?' the answer in a frequentist framework is fully traceable: 'we pre-registered a test with 80% power to detect a 2% lift at α = 0.05, ran it for 4 weeks to reach the required sample size, and observed a p-value of 0.018 with a confidence interval of [0.8%, 3.4%].' Every number is transparent. Every decision point is auditable. There is no prior to debate, no subjective calibration to question.
What Are the Arguments for Bayesian Testing?
Intuitive Probability Statements
The most commonly cited advantage of Bayesian A/B testing is the intuitiveness of its output. Instead of a p-value (which is notoriously misinterpreted), you get direct probability statements: 'there is a 96% probability that variation B is better than the control.' For stakeholders without statistical training, this is easier to understand and act on. The appeal is real — p-values do require careful interpretation, and the frequentist framing is genuinely counterintuitive for most people.
Flexible Stopping Rules
In theory, Bayesian methods allow you to stop a test at any point and interpret the current posterior. There's no formal requirement for a pre-determined sample size — you can accumulate data, check the posterior, and stop when you're sufficiently confident. This flexibility is attractive for teams under time pressure or for tests with limited traffic.
Incorporation of Prior Knowledge
If you have genuine, well-calibrated prior information about likely effect sizes — for example, from thousands of previous experiments in the same domain — Bayesian methods provide a formal mechanism to incorporate that knowledge. This can improve estimation efficiency, particularly for small samples. A well-specified informative prior can effectively 'shrink' extreme estimates toward more plausible values.
Why Do the Bayesian Advantages Break Down in Practice?
Flat Priors Make Bayesian Tests Frequentist in Disguise
Here is the inconvenient truth about most commercial Bayesian A/B testing tools: they use uninformative (flat or very weakly informative) priors. When the prior is uninformative, the posterior is dominated entirely by the likelihood — which means the Bayesian credible interval is numerically identical to the frequentist confidence interval. The '96% probability that B is better' is, in these cases, mathematically the same statement as 'the two-sided p-value is 0.08.' The intuitive framing is a relabeling, not a different analysis.
This is not a minor technical point. If your Bayesian tool uses a flat prior — and most do, because specifying an informative prior requires expertise that most users lack — then you are getting a frequentist analysis with a Bayesian label. The 'advantage' of intuitive probability statements evaporates because those statements don't mean what users think they mean without a properly specified prior.
'Probability of Being Best' Is Misleading Without Loss Functions
Bayesian A/B testing tools frequently report a 'probability of being best' metric — say, 92%. Teams use this as a decision rule: 'if the probability of being best exceeds 90%, ship it.' The problem is that this metric ignores the magnitude of the difference. A variant could have a 92% probability of being better than control, but if the expected improvement is 0.02% with a wide credible interval spanning [-1%, +1%], shipping it is a bad decision. Without a loss function that penalizes wrong decisions proportionally to their cost, 'probability of being best' is an incomplete decision criterion.
Flexible Stopping Isn't Free
The claim that Bayesian methods allow 'stopping whenever you want' is technically true but practically misleading. If you stop a Bayesian test early whenever the posterior probability exceeds a threshold, you inflate the rate at which you ship false positives — just as with frequentist peeking. The posterior probability at any given moment is a valid summary of current evidence, but using it as a stopping rule without calibration produces the same operational problems that sequential testing was designed to solve.
Properly calibrated Bayesian stopping rules — those that control the long-run false positive rate — end up requiring sample sizes comparable to or larger than frequentist sequential testing designs. The flexibility is not a free parameter; it's paid for with either larger samples or higher error rates. As Georgi Georgiev has documented extensively, when you hold Bayesian and frequentist methods to the same operational standard — the same false positive rate, the same power — the sample size requirements converge.
Prior Sensitivity Is a Real Problem
The prior distribution is the Achilles heel of applied Bayesian testing. If you use an uninformative prior, your analysis is effectively frequentist (see above). If you use an informative prior, your results depend on that choice — and different reasonable priors can lead to materially different conclusions, especially with small samples. In a commercial setting where multiple stakeholders need to trust the result, introducing a subjective prior creates an attack surface for disagreement: 'I don't agree with your prior, so I don't trust the result.'
| Claimed Advantage | Reality Check |
|---|---|
| Intuitive probability statements | Only meaningful with a proper prior. With flat priors, numerically identical to frequentist results with a different label. |
| Flexible stopping rules | Uncalibrated stopping inflates false positives. Calibrated stopping matches frequentist sequential sample sizes. |
| Works with small samples | Only if the prior is informative — which means results depend on prior choice, not just data. |
| No need for p-values | 'Probability of being best' without a loss function is equally prone to misinterpretation. |
| Incorporates prior knowledge | Genuine advantage when priors are well-calibrated. Rare in practice — most teams use flat priors. |
When Does Each Framework Make Sense?
This is not a blanket dismissal of Bayesian statistics — it's a claim about defaults. The frequentist framework provides stronger default guardrails for the failure modes most common in commercial experimentation: underdisciplined stopping, post-hoc metric selection, and overconfident shipping decisions. Bayesian methods can be rigorous too, but their rigor requires more expertise and more deliberate calibration.
Use Frequentist Methods When…
- You need explicit error rate control: Any program where you need to state 'our false positive rate is controlled at 5%' — board reports, executive summaries, regulatory contexts.
- You run a high-volume program: At scale, the long-run properties of frequentist error control compound. Across hundreds of tests, controlled error rates protect the integrity of the entire program.
- Multiple stakeholders need to trust results: Frequentist results don't depend on analyst-chosen priors. The same data and analysis plan produce the same conclusion — full stop.
- You want auditable decision trails: Every decision can be traced back to a pre-registered plan, an observed test statistic, and a pre-specified threshold. No subjective components.
- You use sequential testing: Frequentist sequential methods (group sequential, alpha spending) are mature, well-understood, and provide the same early-stopping benefits that Bayesian proponents claim — with explicit error guarantees.
Consider Bayesian Methods When…
- You have genuine, well-calibrated prior information: If you've run thousands of similar experiments and can specify an empirical prior on effect sizes, Bayesian shrinkage can improve estimation accuracy.
- You're doing multi-armed bandit allocation: Thompson sampling and other bandit algorithms are naturally Bayesian and are appropriate for continuous optimization problems (not hypothesis testing).
- You're in an exploratory phase: When you're screening many ideas quickly and formal error control is less important than directional learning, Bayesian methods can provide useful summaries.
- Sample sizes are genuinely tiny: For B2B contexts with very few conversions per week, informative priors can stabilize otherwise useless estimates — but only if the priors are genuinely well-calibrated.
Why DRIP Uses Frequentist Methods as the Default
Our decision to default to frequentist methods was not ideological — it was empirical. After running thousands of experiments across 90+ e-commerce brands, we've found that the failure modes of commercial experimentation are overwhelmingly failures of discipline, not failures of statistical sophistication. Teams stop tests early. They change the primary metric after seeing results. They ship 'winners' that were never properly powered. Frequentist methods, with their rigid pre-registration requirements and fixed error budgets, directly address these failure modes.
Every experiment in our program follows the same protocol: hypothesis specification before launch, power analysis determining the sample size, a fixed significance threshold of α = 0.05, and group sequential monitoring with O'Brien-Fleming alpha spending boundaries for optional interim analyses. The analysis plan is documented before data collection begins. There is no room for analytical flexibility, motivated reasoning, or post-hoc rationalization.
This does not mean we ignore Bayesian ideas entirely. We use empirical Bayes shrinkage when estimating effect sizes across experiments for portfolio-level analysis. We use Bayesian thinking when setting priors on likely effect sizes during the planning phase — not to influence the test analysis, but to calibrate our expectations about what sample size we'll need. The key distinction is that the hypothesis test itself — the mechanism by which we decide to ship or not ship — remains frequentist. The error control stays explicit. The decision trail stays auditable.
The question we always return to is simple: if a stakeholder challenges a result six months from now, can we point to a pre-registered plan, a controlled error rate, and a transparent analysis? With frequentist methods, the answer is always yes. That auditability is worth more than any amount of intuitive probability framing.
See how DRIP's methodology drives reliable results →