Docker for Data Science Workflows
"It works on my machine" is the enemy of reproducibility. Docker solves this by packaging your code, dependencies, and environment into portable containers.
Why Docker for Data Science?
- Reproducibility: Same environment everywhere
- Dependency management: Isolate conflicting packages
- Collaboration: Share working environments
- Production parity: Dev matches prod
Basic Dockerfile for Python
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY notebooks/ ./notebooks/
# Default command
CMD ["python", "src/main.py"]
Jupyter Notebook Container
FROM jupyter/scipy-notebook:latest
USER root
# Install additional packages
RUN pip install \
sqlalchemy \
psycopg2-binary \
plotly \
scikit-learn
USER $NB_UID
EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]
Docker Compose for Multi-Service
version: '3.8'
services:
jupyter:
build: .
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/notebooks
- ./data:/home/jovyan/data
environment:
- JUPYTER_TOKEN=mytoken
postgres:
image: postgres:15
environment:
POSTGRES_DB: analytics
POSTGRES_USER: analyst
POSTGRES_PASSWORD: secret
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:alpine
ports:
- "6379:6379"
volumes:
postgres_data:
GPU Support
For ML training with GPUs:
FROM nvidia/cuda:11.8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip
RUN pip3 install \
torch \
torchvision \
transformers
# Run with: docker run --gpus all ...
Best Practices
- Use .dockerignore: Exclude data, .git, __pycache__
- Layer efficiently: Rarely-changing deps first
- Pin versions: Exact versions for reproducibility
- Multi-stage builds: Smaller production images
- Non-root user: Security best practice
Development Workflow
# Build image
docker build -t my-analysis:latest .
# Run with mounted volumes for development
docker run -it \
-v $(pwd)/notebooks:/app/notebooks \
-v $(pwd)/data:/app/data \
-p 8888:8888 \
my-analysis:latest
# Execute one-off command
docker run --rm my-analysis:latest python -m pytest
Docker adds initial complexity but pays dividends in reproducibility and deployment.