What Is Sample Ratio Mismatch?
In a 50/50 A/B test with 100,000 visitors, you would expect roughly 50,000 in each group. Due to random variation, a split of 50,200 vs 49,800 is perfectly normal — the binomial distribution guarantees some fluctuation. But a split of 51,500 vs 48,500 is a different story entirely. At that sample size, a deviation of 1,500 is astronomically unlikely under random assignment, and it means something systematic is pushing users disproportionately into one group.
SRM is detected using a chi-squared goodness-of-fit test that compares observed frequencies to expected frequencies. The test is simple: compute the chi-squared statistic from the deviation between observed and expected counts, then check the resulting p-value. A p-value below 0.001 is strong evidence that the split is not the result of chance — it is the result of a bug.
Why SRM Invalidates Your Results
A/B testing rests on a single foundational assumption: random assignment creates statistically equivalent groups. When randomization works, the only systematic difference between control and variant is the treatment itself. Every other characteristic — purchase intent, device type, time of day, prior behavior — is balanced across groups by the law of large numbers. SRM breaks this assumption at the root.
Consider a concrete example: your variant includes a new hero image that is 400KB larger than the control. The variant page loads 200ms slower, causing some users to bounce before the tracking pixel fires. Those bounced users are never counted in the variant group. The result: the variant group is smaller than expected (SRM detected) and is systematically enriched with more patient, higher-intent users — who would have converted at higher rates regardless of the treatment. Your test shows a 'win,' but the lift is an artifact of survivorship bias, not a genuine treatment effect.
Common Causes of SRM
| Cause | Mechanism | Direction of Bias |
|---|---|---|
| Slow variant loading | Users bounce before tracking fires | Variant group smaller, biased toward patient users |
| JavaScript errors in variant | Tracking code fails to execute | Variant group smaller, missing error-affected users |
| Bot filtering differences | Bots blocked differently per variant | Unpredictable direction |
| Redirect tests | Server-side redirects lose users in transit | Variant group smaller |
| Cookie-based assignment with deletion | Users re-randomized on return visits | Groups drift over time |
| Cache differences | CDN serves stale pages to some users | Depends on implementation |
How to Detect SRM
The Chi-Squared Test
The chi-squared goodness-of-fit test is the standard method for SRM detection. You take the observed visitor counts per group, compute the expected counts from the intended allocation ratio and total sample, then calculate the chi-squared statistic: X² = Σ (observed - expected)² / expected. With one degree of freedom (two groups), a chi-squared value above 10.83 corresponds to p < 0.001 — definitive evidence of SRM.
When to Check
Check SRM continuously throughout the test. This is one area where peeking is not only acceptable but encouraged. Unlike peeking at metric results, checking SRM does not inflate false positive rates — it is a data quality diagnostic, not a hypothesis test. An SRM check asks whether the experiment is running correctly, not whether the treatment is working.
- p < 0.001 — Definite SRM. Investigate immediately. Do not report results until root cause is identified and resolved.
- p < 0.01 — Likely SRM. Monitor closely over the next 24-48 hours. If the p-value continues to drop, treat as confirmed SRM.
- p < 0.05 — Possible SRM. Flag for review at the end of the test. May resolve as sample size grows, but warrants scrutiny.
What to Do When You Find SRM
- Pause the test immediately. Stop accumulating corrupted data. Every additional day of a broken experiment is wasted traffic.
- Check variant code for JavaScript errors. Open both control and variant in an incognito browser, inspect the console, and look for errors that fire only in one group.
- Compare page load times between control and variant. Use your analytics or a tool like WebPageTest to measure real-user performance for both experiences.
- Check for bot traffic differences. Segment traffic by user agent and verify that bot filtering is consistent across groups.
- Verify tracking implementation. Confirm that the experiment tracking fires at the same point in the page lifecycle for both groups, and that no race conditions exist.
- Fix the root cause. Address the specific issue — whether it is a slow asset, a broken script, or a misconfigured redirect.
- Restart with a clean population. Do not resume the existing test. Clear the experiment state and begin fresh so that historical bias does not carry over.
The temptation to salvage weeks of test data is understandable — no team wants to restart an experiment that has been running for two weeks. But corrupted data produces corrupted decisions. Shipping a false positive (or false negative) based on biased data will cost far more in lost revenue and misallocated engineering effort than the cost of restarting a test. The math is unambiguous: restart.
SRM in Practice: How Often Does It Happen?
Published research from Microsoft and other large-scale experimentation platforms indicates that 5-10% of all A/B tests exhibit detectable SRM. That figure reflects well-instrumented platforms with dedicated engineering teams. In less mature setups — particularly e-commerce teams using client-side testing tools without server-side validation — the rate can climb above 20%. Redirect tests are especially prone to SRM because every redirect introduces an opportunity for user loss.
At DRIP, every experiment is automatically checked for SRM at multiple checkpoints throughout its lifecycle. Tests with confirmed SRM are flagged, investigated, and resolved before any results are reported to the client. This is not optional — it is a non-negotiable quality gate in our experimentation process. We treat SRM the same way a lab would treat a contaminated sample: the data is discarded and the experiment is rerun under controlled conditions.
See how DRIP runs reliable A/B tests at scale →