What Are Multi-Armed Bandits?
The name comes from a gambling analogy: imagine a row of slot machines (one-armed bandits), each with an unknown payout rate. You want to maximize your total winnings across a fixed number of pulls. Pulling the same machine every time risks missing a better one. Pulling randomly wastes pulls on bad machines. Bandit algorithms try to find the optimal balance — exploring enough to identify the best machine, then exploiting that knowledge to maximize payout.
In the context of website optimization, each 'arm' is a variant (control, variant A, variant B), and each 'pull' is a visitor. The bandit observes conversion outcomes and adjusts traffic allocation dynamically, sending more traffic to variants that appear to be performing well.
The major bandit algorithms
| Algorithm | How It Works | Key Tradeoff |
|---|---|---|
| Epsilon-Greedy | Sends (1-ε) traffic to current best, ε traffic randomly to explore | Simple but crude — exploration rate is fixed, not adaptive |
| Upper Confidence Bound (UCB) | Selects the arm with the highest optimistic estimate (mean + uncertainty bonus) | Explores uncertain arms automatically, but assumes stationary rewards |
| Thompson Sampling | Samples from posterior distributions of each arm's conversion rate, selects the arm with the highest sample | Elegant and efficient, but Bayesian assumptions affect convergence speed |
Thompson Sampling has become the most popular bandit approach in testing platforms because it naturally balances exploration and exploitation without a manually tuned parameter. It maintains a probability distribution over each variant's true conversion rate and selects variants in proportion to the probability that they are the best. As data accumulates, the distributions narrow and traffic concentrates on the leading variant.
Regret Minimization vs. Learning: The Core Distinction
This is the most important conceptual distinction in the entire bandits-vs-A/B-tests debate, and it is frequently glossed over. The two methods optimize for fundamentally different objectives, and which objective matters more depends on the ratio between your test duration and your deployment duration.
Consider the math. An A/B test runs for 4 weeks and identifies the correct winner with 95% confidence. A bandit runs for 4 weeks and sends more traffic to the apparent winner during the test, saving some conversions — but its final conclusion about which variant is actually better carries less statistical certainty. If the winning variant will be deployed for 12 months, the value of being correct about the winner dwarfs the value of marginal conversion gains during the 4-week test. A 1% improvement deployed for 12 months is worth roughly 12x more than a 1% improvement during a 4-week test.
The regret minimization framing makes bandits look attractive because it focuses on the test period in isolation. But experimentation programs do not exist in isolation. The entire point of running a test is to make a decision that will affect millions of future visitors. When you frame it that way — total value across both the test period and the deployment period — the learning advantage of A/B tests dominates in nearly every realistic e-commerce scenario.
This is not a theoretical argument. Across thousands of experiments in our database, the median test duration is roughly 6 weeks — and the median deployment period for a winning variant is well over a year. The test period accounts for less than 10% of the total value window. Optimizing the test period at the cost of decision quality is, in the language of bandits themselves, maximizing short-term reward at the expense of long-term regret.
Why Bandits Fail in Practice for Most E-Commerce Tests
Even if you accept the regret minimization framing, bandit algorithms face several practical challenges in e-commerce environments that their theoretical foundations do not account for. These are not edge cases — they are the norm.
Delayed conversions corrupt reward signals
Bandit algorithms update their allocation based on observed rewards. In e-commerce, the primary reward — a purchase — often occurs hours or days after the initial visit. A visitor who sees a product page on Monday may convert on Thursday. During those intervening days, the bandit has been making allocation decisions based on incomplete data. Variants that drive longer consideration cycles will appear to underperform in the short term, and the bandit will route traffic away from them — potentially suppressing the very variant that produces the highest total conversion rate.
Non-stationary environments violate core assumptions
Bandit algorithms assume that the true conversion rate of each variant is fixed (stationary). In e-commerce, this assumption is routinely violated. Conversion rates fluctuate by day of week, time of day, marketing campaign cycles, inventory changes, and seasonal patterns. A variant that outperforms on weekdays may underperform on weekends. The bandit adapts to this noise as if it were signal, constantly adjusting allocation in response to fluctuations that have nothing to do with variant quality.
Sample ratio mismatch undermines validity
When a bandit shifts traffic allocation, it creates different sample sizes across variants. This introduces sample ratio mismatch — a condition that in standard A/B testing is treated as a serious diagnostic red flag. Unequal sample sizes reduce statistical power for comparing the underexplored variants, make it harder to detect bugs in implementation, and introduce compositional bias if the traffic allocation mechanism interacts with user segments in unexpected ways.
| Challenge | Impact on Bandits | Impact on A/B Tests |
|---|---|---|
| Delayed conversions | Allocation decisions based on incomplete data; biases against high-consideration variants | No impact — equal allocation is independent of conversion timing |
| Day-of-week effects | Algorithm chases noise; over-allocates to variants that happen to match recent traffic patterns | Balanced across all days by design; no allocation bias |
| Multiple metrics | Must choose one reward signal; can miss regressions on secondary metrics | All metrics measured with equal statistical rigor |
| Segment-level effects | Allocation optimizes for aggregate; can mask harmful effects on subgroups | Equal allocation enables clean segment analysis post-test |
When Do Bandits Genuinely Outperform A/B Tests?
It would be intellectually dishonest to dismiss bandits entirely. There are legitimate use cases where the explore-exploit tradeoff favors bandits over standard A/B tests. The common thread is that in these scenarios, the test period is the deployment period — eliminating the learning advantage of A/B tests.
Short-lived promotions and time-limited offers
A Black Friday promotion runs for 4 days. There is no 'deployment period' — the campaign ends on Monday. In this context, every conversion saved during the test is final value, not a means to future value. Bandit algorithms can identify the better creative or offer copy within hours and shift traffic accordingly, capturing more total conversions than an A/B test that would still be collecting data when the promotion ends.
Content personalization at scale
Recommendation engines, content feeds, and homepage personalization are continuous optimization problems, not one-time tests. The 'right' content changes constantly based on inventory, user context, and trending behavior. Bandits are well-suited here because the objective is ongoing optimization, not a definitive answer about which variant is better. The contextual bandit — a variant that conditions allocation on user features — is the standard approach in recommendation systems for good reason.
Many-arm problems with rapid turnover
Testing 50 email subject lines or 30 ad creatives simultaneously is impractical with standard A/B tests — the sample size requirements per arm make it infeasible. Bandits excel here because they can quickly identify the bottom performers and stop wasting traffic on them, concentrating exploration on the promising candidates. The statistical rigor on any single comparison is lower, but the practical outcome — finding a good option fast — is what matters.
- Use bandits when: the test IS the deployment (no post-test rollout), you have many variants with rapid turnover, or the optimization is continuous rather than a one-time decision.
- Use A/B tests when: the winner will be deployed for months or years, you need reliable effect estimates for business cases, you care about secondary metrics, or you need to understand why something works (not just that it does).
The Platform Problem: How Bandit Implementations Mislead Teams
The theoretical debate about bandits vs. A/B tests is nuanced. The way testing platforms market bandit features is not. Phrases like 'automatically optimize,' 'no wasted traffic,' and 'smarter than A/B testing' are common, and they create the impression that bandits are a free upgrade — all of the benefits of A/B testing with none of the costs.
In practice, platform implementations often have significant limitations that are buried in documentation or not disclosed at all. The most common issues include opaque allocation logic (you cannot verify what the algorithm is actually doing), no correction for delayed conversions, no formal significance testing on the final results, and no ability to analyze secondary metrics or segments with the same rigor as a properly controlled A/B test.
This is not a hypothetical concern. Teams that rely on bandit-selected 'winners' without proper validation frequently discover that the selected variant does not outperform the control when tested in a clean A/B test. The bandit identified a variant that appeared best during a noisy test period — and the unequal allocation prevented the data from revealing that the difference was not statistically reliable. For an introduction to the statistical foundations that bandits typically bypass, see our explanation of sample size requirements in A/B testing.
Why DRIP Uses Standard A/B Tests as the Default
Our position is not anti-bandit. It is pro-learning. Across thousands of experiments for 90+ e-commerce brands, we have consistently found that the decisions which drive the most long-term revenue are the ones backed by the most reliable evidence. And reliable evidence requires equal allocation, proper statistical controls, and clean measurement of all metrics — exactly the properties that bandit allocation compromises.
We have adopted a clear decision framework. For standard conversion optimization experiments — where the winning variant will be deployed for months or longer — we run properly powered A/B tests with sequential monitoring. Sequential monitoring gives us the ability to stop early on clear winners (saving time without sacrificing validity), while the equal allocation ensures every conclusion is backed by rigorous statistical evidence.
| Scenario | Method | Rationale |
|---|---|---|
| Standard CRO experiment (winner deploys for months) | A/B test with sequential monitoring | Learning value dominates; need reliable effect estimates |
| Short-lived campaign (days, not weeks) | Bandit (Thompson Sampling) | No post-test deployment; regret minimization is the correct objective |
| Content personalization / recommendations | Contextual bandit | Continuous optimization problem; no single 'winner' to deploy |
| Many creative variants (10+ arms) | Bandit for screening → A/B test for validation | Bandits efficiently prune; A/B tests validate the finalists |
The screening-then-validation approach for many-arm problems deserves emphasis. When a client wants to test 20 product page layouts, we use a bandit phase to quickly identify the top 2-3 candidates, then run a proper A/B test between the finalists. This captures the efficiency advantage of bandits in exploration while maintaining the rigor of A/B tests for the final decision. It is the best of both worlds — but only because each method is used for what it is actually good at.
