Machine Learning Feature Engineering Essentials

2024.10.14 ANALYSIS

Machine Learning Feature Engineering Essentials

Feature engineering often has more impact on model performance than algorithm selection. Here are the techniques that consistently improve results across different problem types.

Numerical Features

Scaling and Normalization

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Z-score normalization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max scaling to [0, 1]
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

Transformations

import numpy as np

# Log transform for right-skewed data
df['log_revenue'] = np.log1p(df['revenue'])

# Square root for count data
df['sqrt_orders'] = np.sqrt(df['order_count'])

# Box-Cox for optimal normalization
from scipy import stats
df['boxcox_value'], lam = stats.boxcox(df['value'] + 1)

Categorical Features

Encoding Strategies

# One-hot encoding for low cardinality
df_encoded = pd.get_dummies(df, columns=['category'])

# Target encoding for high cardinality
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df['category_encoded'] = encoder.fit_transform(
    df['category'], df['target']
)

# Frequency encoding
freq = df['category'].value_counts() / len(df)
df['category_freq'] = df['category'].map(freq)

Temporal Features

# Extract time components
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6])

# Cyclical encoding for periodic features
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Lag features for time series
df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['rolling_mean_7'] = df['value'].rolling(7).mean()

Interaction Features

# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_interactions = poly.fit_transform(X)

# Manual domain-specific interactions
df['revenue_per_customer'] = df['revenue'] / df['customers']
df['conversion_rate'] = df['purchases'] / df['visits']
df['avg_basket'] = df['revenue'] / df['orders']

Missing Value Strategies

# Indicator for missingness (can be predictive)
df['value_missing'] = df['value'].isna().astype(int)

# Smart imputation
from sklearn.impute import SimpleImputer, KNNImputer

# Median for numerical skewed data
imputer = SimpleImputer(strategy='median')

# KNN imputer uses similar samples
knn_imputer = KNNImputer(n_neighbors=5)

Feature Selection

from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Select top K features
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)

# Feature importance from tree models
feature_importance = model.feature_importances_
important_features = X.columns[feature_importance > 0.01]

Good features encode domain knowledge into a form models can use. Spend time here before hyperparameter tuning.