A/B Testing for Data-Driven Decisions
A/B testing transforms opinions into data. Done correctly, it provides statistically valid evidence for decision-making. Done poorly, it produces misleading conclusions.
The Basics
An A/B test compares two versions:
- Control (A): Current/baseline version
- Treatment (B): New variation
Users are randomly assigned to each group, and key metrics are compared.
Statistical Foundation
Hypothesis Testing
- Null hypothesis (H₀): No difference between A and B
- Alternative (H₁): B is different from A
Key Parameters
alpha = 0.05 # Significance level (false positive rate)
beta = 0.20 # False negative rate
power = 1 - beta # Probability of detecting true effect
MDE = 0.05 # Minimum detectable effect (5%)
Sample Size Calculation
from statsmodels.stats.power import TTestIndPower
# Two-sample t-test power analysis
analysis = TTestIndPower()
# Calculate required sample size per group
sample_size = analysis.solve_power(
effect_size=0.2, # Expected effect size
alpha=0.05, # Significance level
power=0.8, # Desired power
alternative='two-sided'
)
print(f"Required sample per group: {sample_size:.0f}")
Running the Test
import numpy as np
from scipy import stats
# Collect data
control_conversions = 120
control_total = 2000
treatment_conversions = 145
treatment_total = 2000
# Proportions
p_control = control_conversions / control_total
p_treatment = treatment_conversions / treatment_total
# Z-test for proportions
from statsmodels.stats.proportion import proportions_ztest
counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]
z_stat, p_value = proportions_ztest(counts, nobs)
print(f"Control: {p_control:.2%}")
print(f"Treatment: {p_treatment:.2%}")
print(f"Lift: {(p_treatment/p_control - 1):.2%}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Statistically significant difference")
Common Pitfalls
- Peeking: Checking results before reaching sample size
- Multiple comparisons: Testing many metrics inflates false positives
- Selection bias: Non-random assignment
- Novelty effects: Short-term changes that don't persist
- Under-powered tests: Insufficient sample size
Sequential Testing
For continuous monitoring without peeking problems:
from scipy.stats import norm
def sequential_test(data, alpha=0.05):
"""
Sequential probability ratio test (SPRT)
Allows valid early stopping
"""
# Implementation depends on specific methodology
# Common options: SPRT, always-valid p-values
pass
Beyond Simple A/B
- A/B/n testing: Multiple variants simultaneously
- Multi-armed bandits: Adaptive allocation
- Factorial designs: Test multiple factors
- Causal inference: When randomization isn't possible
Invest in statistical rigor. Wrong conclusions from bad testing can be worse than no testing at all.