Introduction to Statistical Analysis with Python

2024.06.10 ANALYSIS

Introduction to Statistical Analysis with Python

Python has become the de facto language for data analysis. Its scientific computing stack—NumPy, Pandas, SciPy, and StatsModels—provides everything needed for rigorous statistical work.

The Core Stack

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

Descriptive Statistics

Start with understanding your data's basic properties:

# Load and examine data
df = pd.read_csv('sales_data.csv')

# Central tendency
print(df['revenue'].mean())
print(df['revenue'].median())

# Dispersion
print(df['revenue'].std())
print(df['revenue'].var())

# Distribution shape
print(df['revenue'].skew())
print(df['revenue'].kurtosis())

Hypothesis Testing

Testing whether observed differences are statistically significant:

# T-test: comparing two groups
group_a = df[df['region'] == 'NORTH']['revenue']
group_b = df[df['region'] == 'SOUTH']['revenue']

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis: significant difference")
else:
    print("Fail to reject null hypothesis")

Correlation Analysis

Understanding relationships between variables:

# Pearson correlation
corr, p_value = stats.pearsonr(
    df['marketing_spend'], 
    df['revenue']
)

# Spearman for non-linear relationships
rho, p_value = stats.spearmanr(
    df['customer_rating'], 
    df['repeat_purchases']
)

Linear Regression

Modeling relationships and making predictions:

# Simple linear regression with statsmodels
X = df['marketing_spend']
X = sm.add_constant(X)  # Add intercept
y = df['revenue']

model = sm.OLS(y, X).fit()
print(model.summary())

Key Statistical Concepts

P-values: Probability of observing results under null hypothesis
Confidence intervals: Range likely to contain true parameter
Effect size: Practical significance beyond p-values
Power: Probability of detecting an effect if it exists

Understanding these foundations is essential before moving to machine learning. Statistics provides the rigor that prevents spurious conclusions.