Git Workflow for Data Projects
Standard Git workflows need adaptation for data projects. Large files, Jupyter notebooks, and evolving datasets create unique challenges that traditional software development doesn't face.
Project Structure
Start with a clear, consistent structure:
project/
├── data/
│ ├── raw/ # Immutable original data
│ ├── processed/ # Cleaned, transformed data
│ └── external/ # Third-party data sources
├── notebooks/
│ ├── exploration/ # Initial analysis
│ └── reports/ # Final deliverables
├── src/
│ ├── data/ # Data processing scripts
│ ├── features/ # Feature engineering
│ └── models/ # Model training
├── tests/
├── .gitignore
├── requirements.txt
└── README.md
Handling Large Files
Git isn't designed for large files. Solutions include:
Git LFS
# Install and setup
git lfs install
git lfs track "*.csv"
git lfs track "*.parquet"
git add .gitattributes
DVC (Data Version Control)
# Better for ML workflows
pip install dvc
dvc init
dvc add data/raw/large_dataset.csv
git add data/raw/.gitignore
git add data/raw/large_dataset.csv.dvc
Notebook Challenges
Jupyter notebooks are JSON files that don't diff well. Solutions:
- nbstripout: Automatically strip output before committing
- Jupytext: Pair notebooks with .py files
- ReviewNB: GitHub integration for notebook diffs
# Setup nbstripout
pip install nbstripout
nbstripout --install
Branching Strategy
Adapt to data science workflows:
- main: Production-ready code and models
- develop: Integration branch for features
- experiment/*: Data exploration and modeling tests
- feature/*: Specific feature development
Commit Messages
Be specific about what changed and why:
# Good examples
git commit -m "feat(data): add customer segmentation features"
git commit -m "experiment: test XGBoost with tuned hyperparameters"
git commit -m "fix(pipeline): handle missing values in date column"
.gitignore Essentials
# Data
*.csv
*.parquet
*.pkl
data/raw/*
data/processed/*
!data/raw/.gitkeep
# Notebooks
.ipynb_checkpoints/
*/.ipynb_checkpoints/
# Models
models/*.pkl
models/*.joblib
# Environment
.env
venv/
*.pyc
__pycache__/
Good version control practices are especially critical for reproducibility in data science.