A/B Testing for Data-Driven Decisions

2024.12.10 ANALYSIS

A/B Testing for Data-Driven Decisions

A/B testing transforms opinions into data. Done correctly, it provides statistically valid evidence for decision-making. Done poorly, it produces misleading conclusions.

The Basics

An A/B test compares two versions:

Control (A): Current/baseline version
Treatment (B): New variation

Users are randomly assigned to each group, and key metrics are compared.

Statistical Foundation

Hypothesis Testing

Null hypothesis (H₀): No difference between A and B
Alternative (H₁): B is different from A

Key Parameters

alpha = 0.05      # Significance level (false positive rate)
beta = 0.20       # False negative rate
power = 1 - beta  # Probability of detecting true effect
MDE = 0.05        # Minimum detectable effect (5%)

Sample Size Calculation

from statsmodels.stats.power import TTestIndPower

# Two-sample t-test power analysis
analysis = TTestIndPower()

# Calculate required sample size per group
sample_size = analysis.solve_power(
    effect_size=0.2,    # Expected effect size
    alpha=0.05,         # Significance level
    power=0.8,          # Desired power
    alternative='two-sided'
)

print(f"Required sample per group: {sample_size:.0f}")

Running the Test

import numpy as np
from scipy import stats

# Collect data
control_conversions = 120
control_total = 2000
treatment_conversions = 145
treatment_total = 2000

# Proportions
p_control = control_conversions / control_total
p_treatment = treatment_conversions / treatment_total

# Z-test for proportions
from statsmodels.stats.proportion import proportions_ztest

counts = [treatment_conversions, control_conversions]
nobs = [treatment_total, control_total]

z_stat, p_value = proportions_ztest(counts, nobs)

print(f"Control: {p_control:.2%}")
print(f"Treatment: {p_treatment:.2%}")
print(f"Lift: {(p_treatment/p_control - 1):.2%}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Statistically significant difference")

Common Pitfalls

Peeking: Checking results before reaching sample size
Multiple comparisons: Testing many metrics inflates false positives
Selection bias: Non-random assignment
Novelty effects: Short-term changes that don't persist
Under-powered tests: Insufficient sample size

Sequential Testing

For continuous monitoring without peeking problems:

from scipy.stats import norm

def sequential_test(data, alpha=0.05):
    """
    Sequential probability ratio test (SPRT)
    Allows valid early stopping
    """
    # Implementation depends on specific methodology
    # Common options: SPRT, always-valid p-values
    pass

Beyond Simple A/B

A/B/n testing: Multiple variants simultaneously
Multi-armed bandits: Adaptive allocation
Factorial designs: Test multiple factors
Causal inference: When randomization isn't possible

Invest in statistical rigor. Wrong conclusions from bad testing can be worse than no testing at all.