JBON_DATA

Infrastructure as Code Fundamentals

Manual infrastructure provisioning doesn't scale and creates drift. Infrastructure as Code (IaC) makes your infrastructure reproducible, version-controlled, and self-documenting.

Why IaC?

  • Reproducibility: Same code = same infrastructure
  • Version control: Track and audit changes
  • Collaboration: Review infrastructure changes like code
  • Documentation: Code documents what exists

Terraform Basics

# main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = var.aws_region
}

# Variables
variable "aws_region" {
  default = "us-east-1"
}

variable "environment" {
  type = string
}

Data Infrastructure Example

# S3 bucket for data lake
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-${var.environment}"

  tags = {
    Environment = var.environment
    Purpose     = "Data Lake"
  }
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

# RDS for data warehouse
resource "aws_db_instance" "analytics" {
  identifier     = "analytics-${var.environment}"
  engine         = "postgres"
  engine_version = "15"
  instance_class = "db.r6g.large"
  
  allocated_storage     = 100
  max_allocated_storage = 500
  
  db_name  = "analytics"
  username = var.db_username
  password = var.db_password
  
  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name
  
  backup_retention_period = 7
  skip_final_snapshot     = false
}

Modules for Reusability

# modules/data-pipeline/main.tf
variable "pipeline_name" {}
variable "s3_bucket" {}

resource "aws_glue_job" "etl" {
  name     = var.pipeline_name
  role_arn = aws_iam_role.glue.arn
  
  command {
    script_location = "s3://${var.s3_bucket}/scripts/${var.pipeline_name}.py"
    python_version  = "3"
  }
}

# Usage
module "sales_pipeline" {
  source        = "./modules/data-pipeline"
  pipeline_name = "sales-etl"
  s3_bucket     = aws_s3_bucket.scripts.id
}

Workflow

# Initialize
terraform init

# Preview changes
terraform plan

# Apply changes
terraform apply

# Destroy (careful!)
terraform destroy

Best Practices

  1. Always use remote state storage
  2. Lock state to prevent concurrent modifications
  3. Use workspaces or directories for environments
  4. Keep sensitive values in secrets manager
  5. Review plan output before applying

IaC is a fundamental skill for modern data engineering. Start small and expand coverage iteratively.

← Back to Blog