JBON_DATA

Docker for Data Science Workflows

"It works on my machine" is the enemy of reproducibility. Docker solves this by packaging your code, dependencies, and environment into portable containers.

Why Docker for Data Science?

  • Reproducibility: Same environment everywhere
  • Dependency management: Isolate conflicting packages
  • Collaboration: Share working environments
  • Production parity: Dev matches prod

Basic Dockerfile for Python

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY notebooks/ ./notebooks/

# Default command
CMD ["python", "src/main.py"]

Jupyter Notebook Container

FROM jupyter/scipy-notebook:latest

USER root

# Install additional packages
RUN pip install \
    sqlalchemy \
    psycopg2-binary \
    plotly \
    scikit-learn

USER $NB_UID

EXPOSE 8888

CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]

Docker Compose for Multi-Service

version: '3.8'

services:
  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/notebooks
      - ./data:/home/jovyan/data
    environment:
      - JUPYTER_TOKEN=mytoken

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: analytics
      POSTGRES_USER: analyst
      POSTGRES_PASSWORD: secret
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:alpine
    ports:
      - "6379:6379"

volumes:
  postgres_data:

GPU Support

For ML training with GPUs:

FROM nvidia/cuda:11.8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3-pip

RUN pip3 install \
    torch \
    torchvision \
    transformers

# Run with: docker run --gpus all ...

Best Practices

  1. Use .dockerignore: Exclude data, .git, __pycache__
  2. Layer efficiently: Rarely-changing deps first
  3. Pin versions: Exact versions for reproducibility
  4. Multi-stage builds: Smaller production images
  5. Non-root user: Security best practice

Development Workflow

# Build image
docker build -t my-analysis:latest .

# Run with mounted volumes for development
docker run -it \
    -v $(pwd)/notebooks:/app/notebooks \
    -v $(pwd)/data:/app/data \
    -p 8888:8888 \
    my-analysis:latest

# Execute one-off command
docker run --rm my-analysis:latest python -m pytest

Docker adds initial complexity but pays dividends in reproducibility and deployment.

← Back to Blog