Introduction to Statistical Analysis with Python
Python has become the de facto language for data analysis. Its scientific computing stack—NumPy, Pandas, SciPy, and StatsModels—provides everything needed for rigorous statistical work.
The Core Stack
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
Descriptive Statistics
Start with understanding your data's basic properties:
# Load and examine data
df = pd.read_csv('sales_data.csv')
# Central tendency
print(df['revenue'].mean())
print(df['revenue'].median())
# Dispersion
print(df['revenue'].std())
print(df['revenue'].var())
# Distribution shape
print(df['revenue'].skew())
print(df['revenue'].kurtosis())
Hypothesis Testing
Testing whether observed differences are statistically significant:
# T-test: comparing two groups
group_a = df[df['region'] == 'NORTH']['revenue']
group_b = df[df['region'] == 'SOUTH']['revenue']
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
alpha = 0.05
if p_value < alpha:
print("Reject null hypothesis: significant difference")
else:
print("Fail to reject null hypothesis")
Correlation Analysis
Understanding relationships between variables:
# Pearson correlation
corr, p_value = stats.pearsonr(
df['marketing_spend'],
df['revenue']
)
# Spearman for non-linear relationships
rho, p_value = stats.spearmanr(
df['customer_rating'],
df['repeat_purchases']
)
Linear Regression
Modeling relationships and making predictions:
# Simple linear regression with statsmodels
X = df['marketing_spend']
X = sm.add_constant(X) # Add intercept
y = df['revenue']
model = sm.OLS(y, X).fit()
print(model.summary())
Key Statistical Concepts
- P-values: Probability of observing results under null hypothesis
- Confidence intervals: Range likely to contain true parameter
- Effect size: Practical significance beyond p-values
- Power: Probability of detecting an effect if it exists
Understanding these foundations is essential before moving to machine learning. Statistics provides the rigor that prevents spurious conclusions.