Infrastructure as Code Fundamentals
Manual infrastructure provisioning doesn't scale and creates drift. Infrastructure as Code (IaC) makes your infrastructure reproducible, version-controlled, and self-documenting.
Why IaC?
- Reproducibility: Same code = same infrastructure
- Version control: Track and audit changes
- Collaboration: Review infrastructure changes like code
- Documentation: Code documents what exists
Terraform Basics
# main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
# Variables
variable "aws_region" {
default = "us-east-1"
}
variable "environment" {
type = string
}
Data Infrastructure Example
# S3 bucket for data lake
resource "aws_s3_bucket" "data_lake" {
bucket = "company-data-lake-${var.environment}"
tags = {
Environment = var.environment
Purpose = "Data Lake"
}
}
resource "aws_s3_bucket_versioning" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
# RDS for data warehouse
resource "aws_db_instance" "analytics" {
identifier = "analytics-${var.environment}"
engine = "postgres"
engine_version = "15"
instance_class = "db.r6g.large"
allocated_storage = 100
max_allocated_storage = 500
db_name = "analytics"
username = var.db_username
password = var.db_password
vpc_security_group_ids = [aws_security_group.db.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 7
skip_final_snapshot = false
}
Modules for Reusability
# modules/data-pipeline/main.tf
variable "pipeline_name" {}
variable "s3_bucket" {}
resource "aws_glue_job" "etl" {
name = var.pipeline_name
role_arn = aws_iam_role.glue.arn
command {
script_location = "s3://${var.s3_bucket}/scripts/${var.pipeline_name}.py"
python_version = "3"
}
}
# Usage
module "sales_pipeline" {
source = "./modules/data-pipeline"
pipeline_name = "sales-etl"
s3_bucket = aws_s3_bucket.scripts.id
}
Workflow
# Initialize
terraform init
# Preview changes
terraform plan
# Apply changes
terraform apply
# Destroy (careful!)
terraform destroy
Best Practices
- Always use remote state storage
- Lock state to prevent concurrent modifications
- Use workspaces or directories for environments
- Keep sensitive values in secrets manager
- Review plan output before applying
IaC is a fundamental skill for modern data engineering. Start small and expand coverage iteratively.