Infrastructure as Code Best Practices: Terraform, Ansible, Kubernetes

Introduction

Infrastructure as Code (IaC) is how modern teams build reliable systems. Instead of manually clicking through cloud consoles or SSHing into servers, you define infrastructure in code—testable, version-controlled, repeatable. This guide shows you practical patterns for Terraform, Ansible, and Kubernetes with real examples, not just theory.

Why Infrastructure as Code?

Consider a production outage scenario:

Without IaC:

Database server dies
You manually recreate it through AWS console (30 minutes)
Forgot to enable backups? Another 15 minutes
Need to reconfigure custom security groups? More time
Total recovery: 2-4 hours
Risk of missing steps = still broken

With IaC:

Database server dies
Run: terraform apply (5 minutes)
Everything recreates: RDS instance, security groups, backups, monitoring, IAM roles
Verify with: terraform plan before applying
Total recovery: 15-30 minutes
Zero manual steps = zero mistakes

The difference is catastrophic when you’re in an incident.

Terraform: State Management in Practice

The Problem With Manual Infrastructure

Imagine this scenario: Two SREs both need to provision servers, so they both log into AWS console and create them. Now you have:

Two separate places defining infrastructure (console + minds)
No version history of what changed
No rollback capability
No way to know why a security group exists
Manual changes drift from documentation
Disaster recovery means starting from scratch

This is configuration drift, and it’s your enemy.

Remote State: The Foundation

Terraform’s .tfstate file tracks what exists in your cloud. Without it, Terraform can’t know whether to create, update, or delete resources.

The wrong way (local state):

# Your laptop
terraform apply
# Creates infrastructure AND stores state in local terraform.tfstate
# Problem: If laptop crashes, you lose the state file
# Problem: Team members have different state files (chaos)
# Problem: State contains secrets (passwords, API keys)

The right way (remote state with encryption):

# versions.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

This simple configuration solves multiple problems:

State Storage:

State file stored in S3 (survives laptop crashes)
Encrypted at rest (secrets are protected)

State Locking:

DynamoDB table prevents concurrent modifications
When you run terraform apply, it locks the state
Another team member can’t apply until you’re done
Prevents conflicts and corruption

Example scenario:

# Alice starts deployment
$ terraform apply

# Bob tries to apply at same time
$ terraform apply
# Waits... waiting... (DynamoDB lock acquired by Alice)
# After Alice finishes, Bob can proceed

# This prevents both modifying the same infrastructure at once

Organizing Infrastructure Into Modules

Monolithic infrastructure code becomes unmaintainable fast:

Bad: Single 5000-line main.tf

# main.tf (everything dumped here)
resource "aws_vpc" "main" { ... }           # 50 lines
resource "aws_subnet" "public_1" { ... }    # 20 lines
resource "aws_subnet" "public_2" { ... }    # 20 lines
resource "aws_route_table" "public" { ... } # 15 lines
# ... 4900 more lines
# You can't find anything. New team members are confused.

Good: Organized modules

infrastructure/
├── modules/
│   ├── networking/
│   │   ├── main.tf        (VPC, subnets, routing)
│   │   ├── variables.tf   (inputs)
│   │   └── outputs.tf     (what other modules need)
│   ├── compute/
│   │   ├── main.tf        (EC2, security groups)
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── database/
│       ├── main.tf        (RDS, backups)
│       ├── variables.tf
│       └── outputs.tf
│
├── environments/
│   ├── dev/
│   │   └── main.tf
│   ├── staging/
│   │   └── main.tf
│   └── production/
│       └── main.tf
│
└── versions.tf

Each module has a single responsibility. This enables reuse and clarity.

Step 1: Define Module Inputs (variables.tf)

What it does: The variables.tf file declares what inputs your module accepts. Think of it like function parameters—it defines what values the module needs to work. This is where you validate inputs and set defaults.

# modules/networking/variables.tf

# STEP 1: Define what CIDR block the VPC should use
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  # Validation: Only accept valid CIDR notation
  # Examples: 10.0.0.0/16, 172.16.0.0/16, 192.168.0.0/16
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be valid CIDR notation (e.g., 10.0.0.0/16)"
  }
}

# STEP 2: Define which availability zones to use
# AZs are physical data centers in a region (us-east-1a, us-east-1b, etc.)
# We need at least 2 for high availability
variable "azs" {
  description = "List of availability zones"
  type        = list(string)
  # Example: ["us-east-1a", "us-east-1b"]
  # Having 2 means if one data center fails, the other still works
}

# STEP 3: Define environment name (used for naming resources)
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
  # Validates that only valid environment names are used
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Must be dev, staging, or production"
  }
}

# STEP 4: Optional feature flag for NAT Gateway
# In dev, we might skip this to save costs
# In production, we need it for security
variable "enable_nat_gateway" {
  description = "Create NAT gateway for private subnets to access internet"
  type        = bool
  # Default: false (skip it unless explicitly enabled)
  # NAT Gateway costs ~$32/month, so dev doesn't need it
  default     = false
}

# STEP 5: Tags for cost tracking and organization
# Tags are labels that help you identify and organize resources
variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  # Example: { Environment = "prod", Team = "platform", CostCenter = "ops" }
  # These help with:
  # - AWS billing (who spent what)
  # - Resource identification (which team owns this)
  # - Automation (find all prod resources)
  default     = {}
}

Why this matters:

Reusability: Same module works with different inputs (dev vs prod)
Validation: Prevents invalid inputs from being applied (e.g., invalid CIDR)
Documentation: Each variable explains what it does
Safety: Errors caught before applying to cloud

Step 2: Create Resources (main.tf)

What it does: The main.tf file contains the actual resource definitions. This is where you say “create a VPC”, “create subnets”, etc. It references the variables from variables.tf to customize behavior.

# modules/networking/main.tf

# RESOURCE 1: Create the Virtual Private Cloud (VPC)
# A VPC is like a private network in AWS
# CIDR block (10.0.0.0/16) means:
#   - 10.0.0.0 to 10.0.255.255 = 65,536 IP addresses available
#   - Subnets will carve out smaller chunks (10.0.0.0/24 = 256 IPs each)
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr  # Use the CIDR from variables
  enable_dns_hostnames = true          # Allows friendly DNS names
  enable_dns_support   = true          # Enables DNS resolution

  # Tags are crucial for organization
  tags = merge(var.tags, {
    Name = "${var.environment}-vpc"  # Creates: "dev-vpc", "prod-vpc", etc.
  })
}

# RESOURCE 2: Create Public Subnets
# Public subnets are where you put load balancers and NAT gateways
# They have internet access directly
# We create one per availability zone for redundancy
# count = length(var.azs) means: if we have 2 AZs, create 2 subnets
resource "aws_subnet" "public" {
  count = length(var.azs)  # Creates one subnet per availability zone

  vpc_id                  = aws_vpc.main.id  # Put subnet in the VPC we created
  cidr_block              = cidrsubnet(var.vpc_cidr, 4, count.index)
  # cidrsubnet breaks the VPC CIDR into smaller chunks
  # Example: 10.0.0.0/16 becomes 10.0.0.0/20, 10.0.16.0/20, etc.
  
  availability_zone       = var.azs[count.index]  # First subnet in first AZ, etc.
  map_public_ip_on_launch = true  # Instances here get public IPs automatically

  tags = merge(var.tags, {
    Name = "${var.environment}-public-${count.index + 1}"
    Type = "public"  # Tag helps identify purpose
  })
}

# RESOURCE 3: Create Private Subnets
# Private subnets are for application servers and databases
# They don't have direct internet access (safer)
# Applications reach internet through NAT Gateway (in public subnet)
# We place them in the upper half of the CIDR block (after public subnets)
resource "aws_subnet" "private" {
  count = length(var.azs)  # One per availability zone, just like public

  vpc_id            = aws_vpc.main.id
  # offset by number of public subnets (count.index + length(var.azs))
  # so they don't overlap with public subnets
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.azs))
  availability_zone = var.azs[count.index]

  tags = merge(var.tags, {
    Name = "${var.environment}-private-${count.index + 1}"
    Type = "private"
  })
}

# RESOURCE 4: Create NAT Gateway (Optional)
# NAT Gateway allows private subnets to reach the internet
# But internet can't reach back into private subnets (secure)
# Only created if enable_nat_gateway = true
# count = var.enable_nat_gateway ? 1 : 0 means:
#   - If true: create 1 NAT Gateway
#   - If false: create 0 (none)
resource "aws_nat_gateway" "main" {
  count = var.enable_nat_gateway ? 1 : 0

  # NAT Gateway needs an Elastic IP address (public IP that doesn't change)
  allocation_id = aws_eip.nat[0].id
  
  # Put NAT in a public subnet (so it's accessible from internet)
  subnet_id = aws_subnet.public[0].id

  tags = merge(var.tags, {
    Name = "${var.environment}-nat"
  })

  # Terraform best practice: depend on internet gateway existing first
  depends_on = [aws_internet_gateway.main]
}

Why this structure:

Modular: Each resource is a building block
Reusable: Same code works for different environments
Readable: Clear what each resource does
Maintainable: Easy to find and update specific resources

Step 3: Define Module Outputs (outputs.tf)

What it does: The outputs.tf file specifies what values the module exposes to other modules. If variables.tf is input, outputs.tf is output. Other modules need IDs, ARNs, and addresses from this module.

# modules/networking/outputs.tf

# OUTPUT 1: VPC ID
# Other modules (compute, database) need this to know which VPC to use
# Example value: vpc-0a1b2c3d4e5f6g7h8
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "VPC ID for reference by other modules"
  # This value is used by compute module to create EC2 instances in this VPC
}

# OUTPUT 2: Public Subnet IDs
# Load balancers and NAT gateways need public subnets
# Example value: ["subnet-0abc123", "subnet-0def456"]
output "public_subnet_ids" {
  value       = aws_subnet.public[*].id
  description = "Public subnet IDs where load balancers go"
  # The [*].id syntax means: extract the id from each subnet
}

# OUTPUT 3: Private Subnet IDs
# Application servers and databases go in private subnets
# Example value: ["subnet-0ghi789", "subnet-0jkl012"]
output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "Private subnet IDs where apps and databases go"
}

# OUTPUT 4: NAT Gateway IP (if created)
# Other resources might need to know the NAT Gateway's public IP
output "nat_gateway_eip" {
  value       = var.enable_nat_gateway ? aws_eip.nat[0].public_ip : null
  description = "Public IP of NAT Gateway for firewall rules"
  # null means "not created" if NAT Gateway is disabled
}

Why outputs matter:

Module communication: Lets modules pass data to each other
Dependency management: Terraform uses outputs to understand what depends on what
Documentation: Shows what useful values the module provides

Step 4: Use the Module (environments/*)

What it does: Now that you have a reusable networking module, use it in different environments with different configurations. This is where the real power of modules shows.

# environments/dev/main.tf
# This is for the DEV environment
# Dev is cheaper, simpler, fewer redundancies

module "vpc" {
  # Where is the module code?
  source = "../../modules/networking"

  # Input configuration for DEV
  vpc_cidr           = "10.0.0.0/16"           # Small network for dev
  environment        = "dev"
  enable_nat_gateway = false                   # Don't spend money on NAT in dev
  azs                = ["us-east-1a", "us-east-1b"]  # 2 AZs

  tags = {
    Environment = "dev"
    Team        = "platform"
    CostCenter  = "engineering"  # Charge to engineering budget
  }
}

# Use the outputs
output "dev_vpc_id" {
  value = module.vpc.vpc_id
}

output "dev_subnet_ids" {
  value = module.vpc.private_subnet_ids
}

# environments/production/main.tf
# This is for the PRODUCTION environment
# Production needs redundancy, security, and monitoring

module "vpc" {
  source = "../../modules/networking"

  # Input configuration for PRODUCTION
  vpc_cidr           = "10.100.0.0/16"  # Larger network
  environment        = "production"
  enable_nat_gateway = true             # Must have NAT for security and compliance
  azs                = ["us-east-1a", "us-east-1b", "us-east-1c"]  # 3 AZs for extra redundancy

  tags = {
    Environment = "production"
    Team        = "platform"
    CostCenter  = "ops"  # Charge to operations budget
  }
}

# Use the outputs
output "prod_vpc_id" {
  value = module.vpc.vpc_id
}

output "prod_private_subnets" {
  value = module.vpc.private_subnet_ids
}

output "prod_public_subnets" {
  value = module.vpc.public_subnet_ids
}

Key differences between dev and prod:

Aspect	Dev	Production
VPC CIDR	10.0.0.0/16	10.100.0.0/16
NAT Gateway	Disabled (cost savings)	Enabled (security requirement)
AZs	2 zones	3 zones (more redundancy)
CostCenter	engineering	ops

The power of this approach:

Same module code for both environments (no duplication)
Different configurations per environment (flexibility)
Easy to maintain: bug fix in module helps both dev and prod
Easy to add staging: just copy production’s config, change CIDR and name

Step 5: Connecting Modules Together

What it does: Modules don’t live in isolation. The compute module needs to know which VPC to create instances in. This is where module outputs become inputs to other modules.

# environments/production/main.tf (expanded)

# First: Create networking
module "vpc" {
  source = "../../modules/networking"
  
  vpc_cidr           = "10.100.0.0/16"
  environment        = "production"
  enable_nat_gateway = true
  azs                = ["us-east-1a", "us-east-1b", "us-east-1c"]
  
  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

# Then: Create compute infrastructure in that VPC
module "compute" {
  source = "../../modules/compute"
  
  # Use VPC outputs as inputs
  vpc_id              = module.vpc.vpc_id  # ← VPC output becomes compute input
  subnet_ids          = module.vpc.private_subnet_ids  # ← Same here
  
  instance_count = 3
  instance_type  = "t3.large"
  environment    = "production"
  
  tags = {
    Environment = "production"
  }
}

# Then: Create database in that VPC
module "database" {
  source = "../../modules/database"
  
  # Use VPC outputs again
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
  
  # Database specific settings
  engine         = "postgres"
  instance_class = "db.r5.large"
  environment    = "production"
  
  tags = {
    Environment = "production"
  }
}

# Output the complete stack
output "app_url" {
  value = module.compute.load_balancer_dns
}

output "database_endpoint" {
  value = module.database.endpoint
}

How modules work together:

modules/networking/outputs.tf
         ↓ (outputs)
         ↓ vpc_id, subnet_ids
         ↓
modules/compute/variables.tf ← (receives as input)
modules/database/variables.tf ← (receives as input)
         ↓
         ↓ Creates resources using these IDs
         ↓
Complete production stack!

Same module code, different configurations. You’re not copy-pasting; you’re reusing.

Testing Infrastructure Changes

Before applying to production, validate your code:

# Syntax check
$ terraform validate

# Format check (consistency)
$ terraform fmt -recursive

# Policy checking (security rules)
$ terraform plan -json | tfjson policy-check.rego

# Show exactly what will change
$ terraform plan -out=tfplan

# Review the plan carefully before applying
$ terraform apply tfplan

Real-world workflow:

# 1. Local development
$ terraform plan -var-file=environments/dev/terraform.tfvars
# Output: 3 resources will be created

# 2. Version control
$ git add -A
$ git commit -m "Add NAT gateway to dev environment"
$ git push origin feature/add-nat-gateway

# 3. Code review
# Team reviews the changes, sees exactly what will change

# 4. CI/CD runs automated checks
# GitHub Actions runs:
#   - terraform validate
#   - terraform plan
#   - tfsec (security scanning)
# If all pass, PR can be merged

# 5. Auto-apply in CI/CD
$ terraform apply  # After merge to main

Ansible: Idempotent Configuration Management

The Challenge: Making Things Repeatable

Manual configuration work breaks easily:

Without Ansible (manual SSH):

$ ssh web1.example.com
$ sudo apt-get update
$ sudo apt-get install -y nginx
$ sudo systemctl start nginx
$ sudo systemctl enable nginx

$ ssh web2.example.com
# Repeat the same commands... forgot a step? Now they're inconsistent
# New team member doesn't know what's installed where
# Disaster recovery = manually SSH to each server

With Ansible (repeatable):

Define desired state once
Apply to 100 servers with one command
Re-run anytime, same result (idempotent)
Document what’s on each server
Version controlled playbooks

Building Idempotent Playbooks

Idempotency means: running the playbook multiple times = same result (no changes on subsequent runs).

Bad: Non-idempotent task

---
- name: Configure web server
  hosts: web
  tasks:
    - name: Run setup script
      shell: /opt/setup.sh
      # Problem: Runs every time, even if already configured
      # Problem: Could fail if run twice

    - name: Append to config file
      shell: echo "ServerLimit 256" >> /etc/apache2/apache2.conf
      # Problem: Each run appends more lines!

Good: Idempotent tasks

---
- name: Configure web server
  hosts: web
  tasks:
    - name: Ensure required packages installed
      package:
        name: "{{ item }}"
        state: present  # Ensures installed, idempotent
      loop:
        - nginx
        - openssl
        - curl

    - name: Copy nginx configuration
      copy:
        src: files/nginx.conf
        dest: /etc/nginx/nginx.conf
        owner: root
        group: root
        mode: '0644'
        backup: yes  # Creates backup if changes
      notify: restart nginx  # Only restart if changed

    - name: Ensure nginx is running and enabled
      systemd:
        name: nginx
        state: started  # Only starts if not running
        enabled: yes    # Only enables if not enabled
        daemon_reload: yes

  handlers:
    - name: restart nginx
      systemd:
        name: nginx
        state: restarted

What makes this idempotent:

state: present checks if already installed
copy module compares files, only updates if different
systemd checks current status before acting
Handlers only run if a notify was triggered

Running this multiple times:

# First run:
$ ansible-playbook site.yml
# Output: 3 changed (installed packages, copied config, started nginx)

# Second run:
$ ansible-playbook site.yml
# Output: 0 changed (everything already in desired state)
# This is idempotency - safe to run repeatedly

Real Configuration Example

Role structure for a web application:

roles/
└── web_app/
    ├── files/
    │   └── nginx.conf           # Static nginx config
    ├── templates/
    │   └── app-env.j2           # Template with variables
    ├── tasks/
    │   ├── main.yml
    │   ├── install.yml
    │   ├── configure.yml
    │   └── deploy.yml
    ├── handlers/
    │   └── main.yml             # Service restart handlers
    ├── defaults/
    │   └── main.yml             # Default variables
    └── vars/
        └── main.yml             # Role-specific variables

# roles/web_app/tasks/main.yml
---
- name: Install dependencies
  package:
    name: "{{ item }}"
    state: present
  loop: "{{ packages_to_install }}"

- name: Create application user
  user:
    name: appuser
    home: /home/appuser
    shell: /bin/bash
    createhome: yes
    state: present

- name: Create app directory
  file:
    path: /opt/myapp
    state: directory
    owner: appuser
    group: appuser
    mode: '0755'

- name: Copy application files
  copy:
    src: ../files/app/
    dest: /opt/myapp/
    owner: appuser
    group: appuser
    mode: '0755'

- name: Generate environment configuration
  template:
    src: app-env.j2
    dest: /opt/myapp/.env
    owner: appuser
    group: appuser
    mode: '0600'  # Secrets file, restrictive permissions
  notify: restart app service

- name: Install Python dependencies
  pip:
    requirements: /opt/myapp/requirements.txt
    virtualenv: /opt/myapp/venv
  become_user: appuser

- name: Create systemd service file
  template:
    src: app-service.j2
    dest: /etc/systemd/system/myapp.service
    owner: root
    group: root
    mode: '0644'
  notify: restart app service

- name: Enable and start application
  systemd:
    name: myapp
    state: started
    enabled: yes
    daemon_reload: yes

# roles/web_app/templates/app-env.j2
# Generated from Ansible template
ENVIRONMENT={{ app_environment }}
DATABASE_URL=postgresql://{{ db_user }}:{{ db_password }}@{{ db_host }}/{{ db_name }}
LOG_LEVEL={{ log_level }}
SECRET_KEY={{ secret_key }}
API_TIMEOUT=30

# roles/web_app/handlers/main.yml
---
- name: restart app service
  systemd:
    name: myapp
    state: restarted
  listen: "app service needs restart"

# roles/web_app/defaults/main.yml
---
packages_to_install:
  - python3
  - python3-pip
  - postgresql-client
  - curl

app_environment: production
log_level: info

Using the role:

# playbooks/deploy.yml
---
- name: Deploy web application
  hosts: web_servers
  become: yes
  roles:
    - web_app
  vars:
    app_environment: "{{ target_env }}"  # From -e flag
    db_host: "db.example.com"
    db_user: "app_user"
    db_password: "{{ vault_db_password }}"  # From Ansible vault
    db_name: "app_db"

Deploying:

# Deploy to dev
$ ansible-playbook playbooks/deploy.yml \
  -i inventory/dev.ini \
  -e target_env=dev

# Deploy to production
$ ansible-playbook playbooks/deploy.yml \
  -i inventory/production.ini \
  -e target_env=production

This playbook is idempotent. Run it 10 times, same result.

Kubernetes: GitOps and IaC

The GitOps Pattern

Modern Kubernetes deployments follow GitOps: your Git repository is the source of truth for your entire cluster.

Without GitOps (manual kubectl):

$ kubectl apply -f app.yaml
$ kubectl set image deployment/app app=myapp:v2.1
$ kubectl scale deployment/app --replicas=5
$ kubectl port-forward pod/debug-xyz 8080:8080

# Now your cluster state is different from your Git repo
# No audit trail of who changed what
# New team member doesn't know how to recreate the cluster
# Disaster recovery = starting from scratch

With GitOps (all in Git):

gitops-repo/
├── base/
│   ├── kustomization.yaml
│   ├── app-deployment.yaml
│   ├── app-service.yaml
│   └── app-config.yaml
│
├── overlays/
│   ├── dev/
│   │   └── kustomization.yaml  (3 replicas, dev domain)
│   ├── staging/
│   │   └── kustomization.yaml  (5 replicas, staging domain)
│   └── production/
│       └── kustomization.yaml  (10 replicas, prod domain)
│
└── .github/workflows/
    └── deploy.yml  (ArgoCD watches this repo)

All changes flow through Git:

Edit YAML in Git
Create PR
Team reviews
Merge to main
ArgoCD automatically syncs to cluster

Example: Deploying new app version

# base/app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: default
spec:
  replicas: 3  # Overridden by overlays
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
        version: v2.1
    spec:
      containers:
      - name: app
        image: myregistry/app:v2.1  # Image tag in Git
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

To deploy a new version:

# Instead of: kubectl set image deployment/app app=myapp:v2.2

# Do this:
$ git checkout -b bump-app-version
# Edit: base/app-deployment.yaml, change image tag to v2.2
$ git add base/app-deployment.yaml
$ git commit -m "Bump app to v2.2"
$ git push origin bump-app-version
# Open PR → team reviews → merge

# ArgoCD automatically sees the change and syncs
# Result: New version running, change tracked in Git

Resource Limits and Requests

Kubernetes requires you to think about resource usage:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:2.1
        resources:
          requests:  # Minimum resources needed
            cpu: 200m        # 0.2 CPU cores
            memory: 256Mi    # 256 MB
          limits:    # Maximum resources allowed
            cpu: 1000m       # 1 CPU core
            memory: 1Gi      # 1 GB
        # Problem: If actual use exceeds limits, pod gets killed
        # Solution: Use monitoring to adjust

What happens:

Pod requests 200m, limit 1000m
├── Node has 1000m available
├── Pod can use 200m-1000m depending on contention
└── If Pod tries to use >1000m, it gets OOMKilled

Pod requests 1Gi memory, limit 2Gi
├── Node allocates 1Gi for scheduling purposes
├── Pod can use 1Gi-2Gi
└── If it tries >2Gi, container restarts

Good practices:

# Requests = what the pod needs to run (for scheduling)
# Limits = maximum before it gets killed (for safety)

# Conservative but safe:
requests:
  cpu: 100m
  memory: 128Mi
limits:
  cpu: 500m
  memory: 512Mi

# Watch your actual usage with:
# kubectl top pod POD_NAME
# kubectl top node

# Adjust after observing real usage for 1-2 weeks

Security with RBAC and Service Accounts

What is RBAC?

RBAC stands for Role-Based Access Control. In Kubernetes, it’s a security mechanism that answers three questions:

Who can do what?

Who: Service accounts (identities for applications)
Do what: Specific actions (get, list, create, delete)
On what: Specific resources (pods, secrets, configmaps)

Without RBAC - The Problem:

Imagine your application is deployed in Kubernetes. By default:

Your app runs as root user
Your app can do ANYTHING in the cluster
If a hacker compromises your app, they have full cluster access
They can steal secrets, delete databases, access other applications

Compromised App → Full Cluster Access → Data Breach

With RBAC - The Solution:

Your app has a service account with minimal permissions:

Can only read its own ConfigMap
Can only read its own Secret
Cannot list other secrets
Cannot delete pods
Cannot access other namespaces

Compromised App → Limited to own resources → Breach contained

Why Do We Need It?

Real-world scenario: Data breach through a compromised application

1. Attacker finds SQL injection in your app
2. Exploits it to run commands inside pod
3. WITHOUT RBAC:
   - Attacker runs: kubectl get secrets -A
   - Gets ALL secrets from ALL namespaces
   - Finds database credentials for production database
   - Accesses production data
   - Data breach: 100 million users affected
   
4. WITH RBAC:
   - Attacker runs: kubectl get secrets -A
   - Kubernetes returns: "Error: permission denied"
   - Attacker only has access to this app's one secret
   - Cannot see other applications' credentials
   - Breach limited to this one app's data

The principle: Principle of Least Privilege

Give each application the MINIMUM permissions it needs
If app only needs to read ConfigMap, don’t give it Secret access
If app only needs one Secret, don’t give it all Secrets
If app doesn’t need to delete pods, don’t give it that permission

Step 1: Create a Service Account

A service account is an identity for your application (like a user account for apps).

---
# Step 1: Create a service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app                    # Name of the service account
  namespace: production        # Only valid in this namespace

# Why you need this:
# - Each app should have its own identity
# - Kubernetes authenticates using this account
# - Audit logs show which account did what
# - Makes security easier to manage

What happens without a service account:

Pod uses default service account
Default account often has too many permissions
Hard to track which app did what in logs
Security risk if default is compromised

Step 2: Create a Role with Minimal Permissions

A role defines what actions are allowed on which resources.

---
# Step 2: Create a role with EXACTLY the permissions needed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app                    # Role name
  namespace: production        # Only for this namespace

rules:
# RULE 1: Read ConfigMaps, but only the app's own ConfigMap
- apiGroups: [""]             # Empty string = core API
  resources: ["configmaps"]   # Only ConfigMaps
  resourceNames: ["app-config"]  # ONLY this specific ConfigMap!
  verbs: ["get", "list", "watch"]  # Only read operations
  
  # Example: App needs to read configuration
  # Allowed: kubectl get configmap app-config
  # Denied: kubectl get configmap other-config
  # Denied: kubectl delete configmap app-config

# RULE 2: Read Secrets, but only the app's own Secret
- apiGroups: [""]
  resources: ["secrets"]      # Only Secrets
  resourceNames: ["app-secret"]  # ONLY this specific Secret!
  verbs: ["get"]              # Only get (not list, not delete)
  
  # Example: App needs database password from Secret
  # Allowed: kubectl get secret app-secret
  # Denied: kubectl get secret admin-secret
  # Denied: kubectl list secrets (can't see all secrets)
  # This is important! Even listing secrets can be a leak!

# What's NOT in this role:
# - Can't create pods (can't spawn new containers)
# - Can't delete pods (can't break cluster)
# - Can't create secrets (can't store malicious data)
# - Can't access other namespaces (confined to production)

Real example of what happens:

# App inside pod tries to:
$ kubectl get secrets
# Result: Error! Permission denied
# Why: Role only allows "get" on specific secret "app-secret", not "list" all secrets

$ kubectl get secret app-secret
# Result: Success! App gets its credentials
# Why: Role specifically allows this

$ kubectl delete configmap app-config
# Result: Error! Permission denied
# Why: Role only allows "get, list, watch" - not "delete"

Step 3: Bind Role to Service Account

A RoleBinding connects a Role to a ServiceAccount, saying “this service account has this role”.

---
# Step 3: Create a RoleBinding (connect Role to ServiceAccount)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app                    # Name of the binding
  namespace: production        # Same namespace as Role

roleRef:                       # Reference to the Role
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: app                    # Name of the Role we created above

subjects:                      # Who gets this role
- kind: ServiceAccount         # It's a service account
  name: app                    # The service account name
  namespace: production        # In this namespace

What this accomplishes:

Service account “app” now has the permissions defined in Role “app”
Any pod using this service account gets these permissions
Multiple service accounts can have the same role
Multiple roles can be applied to one service account

Step 4: Use Service Account in Deployment

Now configure your pod to use this restricted service account.

---
# Step 4: Use the service account in your deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: production
spec:
  replicas: 3
  template:
    spec:
      # CRITICAL: Use the restricted service account
      serviceAccountName: app  # Use our service account!
      
      # CRITICAL: Don't run as root
      securityContext:
        runAsNonRoot: true      # Refuse to run as root
        runAsUser: 1000         # Run as unprivileged user (UID 1000)
        fsGroup: 2000           # File system group for volume mounts
        
      containers:
      - name: app
        image: myapp:2.1
        
        # Container-level security settings
        securityContext:
          # Prevent privilege escalation (sudo-like operations)
          allowPrivilegeEscalation: false
          
          # Read-only filesystem (app can't modify container)
          # If compromised, attacker can't install tools or backdoors
          readOnlyRootFilesystem: true
          
          # Drop all Linux capabilities
          # Prevents: mount, network operations, privilege escalation
          capabilities:
            drop:
              - ALL  # Drop EVERYTHING by default
          
          # If your app needs specific capabilities, add them back:
          # add:
          #   - NET_BIND_SERVICE  # Only if it needs to bind to ports <1024

Why each security setting matters:

Setting	Protection
`serviceAccountName`	Only access allowed resources
`runAsNonRoot`	Can’t run as root (blocks full cluster takeover)
`runAsUser: 1000`	Uses unprivileged user (limited damage)
`allowPrivilegeEscalation: false`	Can’t escalate to root with sudo
`readOnlyRootFilesystem`	Can’t install backdoors/malware
`drop: ALL` capabilities	Can’t do system-level operations

Real-World Security Scenario

Scenario: Kubernetes cluster with 5 applications

# App 1: Web Frontend (needs only to read config)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: web-frontend
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: web-frontend
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["frontend-config"]
  verbs: ["get"]

---
# App 2: API Server (needs config + database secret + logging)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-server
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: api-server
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["api-config"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["db-credentials", "api-key"]
  verbs: ["get"]

---
# App 3: Background Worker (needs only message queue secret)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: worker
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: worker
rules:
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["queue-credentials"]
  verbs: ["get"]

If API Server is compromised:

Attacker can:
  ✓ Read api-config ConfigMap
  ✓ Read db-credentials Secret
  ✓ Read api-key Secret

Attacker CANNOT:
  ✗ Read frontend-config (not allowed)
  ✗ Read queue-credentials (not allowed)
  ✗ Create new secrets (no permission)
  ✗ Delete pods (no permission)
  ✗ Access admin account (different service account)

Damage contained! Without RBAC, attacker could access everything.

Multi-Tool Orchestration

Understanding the Complete Pipeline

Real infrastructure deployments combine multiple tools in sequence:

The complete flow (AWS best practice):

1. Infrastructure Layer (Terraform)
   ↓
2. Application Layer (Kubernetes / EKS)
   ↓
3. Running Applications

Using AWS EKS simplifies this dramatically:

Why EKS over manual Kubernetes on EC2?

Aspect	EC2 + Manual K8s	AWS EKS
Setup Time	2-3 hours	15-20 minutes
Maintenance	You manage everything	AWS manages control plane
Updates	Manual version upgrades	Automated updates
Security Patches	You apply them	AWS applies them
Multi-AZ	Manual setup	Built-in by default
Cost	Lower (you manage it)	Higher (but less operational work)
Best for	Learning, custom needs	Production, managed service

With EKS, you only manage worker nodes. AWS manages:

Kubernetes control plane (API server, etcd, scheduler)
Master node availability
Security patches
Updates
Backups

Terraform + EKS Deployment (Simplified)

File structure:

infrastructure/
├── main.tf           (Create VPC and EKS cluster)
├── variables.tf      (Input variables)
├── outputs.tf        (Cluster info for kubectl)
└── environments/
    ├── dev/
    │   └── terraform.tfvars
    └── production/
        └── terraform.tfvars

Step 1: Create VPC for EKS

# main.tf - Part 1: Network Infrastructure

# Create VPC
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.environment}-vpc"
  }
}

# Create public subnets (for load balancers and NAT)
resource "aws_subnet" "public" {
  count = length(var.azs)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 4, count.index)
  availability_zone       = var.azs[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name                                      = "${var.environment}-public-${count.index + 1}"
    "kubernetes.io/role/elb"                  = "1"  # EKS needs this tag
  }
}

# Create private subnets (for EKS worker nodes)
resource "aws_subnet" "private" {
  count = length(var.azs)

  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.azs))
  availability_zone = var.azs[count.index]

  tags = {
    Name                                      = "${var.environment}-private-${count.index + 1}"
    "kubernetes.io/role/internal-elb"         = "1"  # EKS needs this tag
  }
}

# Create Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.environment}-igw"
  }
}

# Create NAT Gateway for private subnet internet access
resource "aws_eip" "nat" {
  count  = length(var.azs)
  domain = "vpc"

  tags = {
    Name = "${var.environment}-nat-eip-${count.index + 1}"
  }

  depends_on = [aws_internet_gateway.main]
}

resource "aws_nat_gateway" "main" {
  count         = length(var.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = {
    Name = "${var.environment}-nat-${count.index + 1}"
  }

  depends_on = [aws_internet_gateway.main]
}

# Route tables for public subnets
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block      = "0.0.0.0/0"
    gateway_id      = aws_internet_gateway.main.id
  }

  tags = {
    Name = "${var.environment}-public-rt"
  }
}

resource "aws_route_table_association" "public" {
  count          = length(aws_subnet.public)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

# Route tables for private subnets
resource "aws_route_table" "private" {
  count  = length(var.azs)
  vpc_id = aws_vpc.main.id

  route {
    cidr_block      = "0.0.0.0/0"
    nat_gateway_id  = aws_nat_gateway.main[count.index].id
  }

  tags = {
    Name = "${var.environment}-private-rt-${count.index + 1}"
  }
}

resource "aws_route_table_association" "private" {
  count          = length(aws_subnet.private)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

Step 2: Create IAM roles for EKS

# main.tf - Part 2: IAM Roles

# IAM role for EKS cluster (control plane)
resource "aws_iam_role" "eks_cluster_role" {
  name = "${var.environment}-eks-cluster-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "eks.amazonaws.com"  # EKS service can assume this role
        }
      }
    ]
  })
}

# Attach required policy for cluster
resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.eks_cluster_role.name
}

# IAM role for EKS worker nodes
resource "aws_iam_role" "eks_worker_role" {
  name = "${var.environment}-eks-worker-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"  # EC2 instances (nodes) can assume this role
        }
      }
    ]
  })
}

# Attach required policies for worker nodes
resource "aws_iam_role_policy_attachment" "eks_worker_node_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.eks_worker_role.name
}

resource "aws_iam_role_policy_attachment" "eks_cni_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.eks_worker_role.name
}

resource "aws_iam_role_policy_attachment" "eks_registry_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.eks_worker_role.name
}

# Instance profile for worker nodes
resource "aws_iam_instance_profile" "eks_worker_profile" {
  name = "${var.environment}-eks-worker-profile"
  role = aws_iam_role.eks_worker_role.name
}

Step 3: Create EKS Cluster

# main.tf - Part 3: EKS Cluster

# Security group for EKS cluster
resource "aws_security_group" "eks_cluster" {
  name        = "${var.environment}-eks-cluster-sg"
  description = "Security group for EKS cluster"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # In production, restrict this!
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-eks-cluster-sg"
  }
}

# Create EKS Cluster
resource "aws_eks_cluster" "main" {
  name            = "${var.environment}-cluster"
  role_arn        = aws_iam_role.eks_cluster_role.arn
  version         = var.kubernetes_version  # e.g., "1.27"

  vpc_config {
    subnet_ids              = concat(aws_subnet.private[*].id, aws_subnet.public[*].id)
    endpoint_private_access = true   # Internal access
    endpoint_public_access  = true   # External access via kubectl
    security_group_ids      = [aws_security_group.eks_cluster.id]
  }

  enabled_cluster_log_types = [
    "api",
    "audit",
    "authenticator",
    "controllerManager",
    "scheduler"
  ]

  tags = {
    Name = "${var.environment}-eks-cluster"
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy
  ]
}

# Create EKS Node Group (managed worker nodes)
resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.environment}-node-group"
  node_role_arn   = aws_iam_role.eks_worker_role.arn
  subnet_ids      = aws_subnet.private[*].id

  scaling_config {
    desired_size = var.desired_node_count
    max_size     = var.max_node_count
    min_size     = var.min_node_count
  }

  instance_types = [var.node_instance_type]  # e.g., "t3.medium"

  tags = {
    Name = "${var.environment}-node-group"
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_worker_node_policy,
    aws_iam_role_policy_attachment.eks_cni_policy,
    aws_iam_role_policy_attachment.eks_registry_policy
  ]
}

Step 4: Output Kubeconfig Information

# outputs.tf

output "cluster_name" {
  value       = aws_eks_cluster.main.name
  description = "EKS cluster name"
}

output "cluster_endpoint" {
  value       = aws_eks_cluster.main.endpoint
  description = "EKS cluster API endpoint"
}

output "cluster_version" {
  value       = aws_eks_cluster.main.version
  description = "EKS cluster Kubernetes version"
}

# Command to update kubeconfig
output "configure_kubectl" {
  value       = "aws eks update-kubeconfig --region ${var.aws_region} --name ${aws_eks_cluster.main.name}"
  description = "Command to configure kubectl"
}

Simplified Deployment Script

File: deploy-infrastructure.sh

#!/bin/bash
# Simplified deployment: Terraform + EKS only
# No need for Ansible anymore!

set -e

ENVIRONMENT=${1:-dev}
REGION=${2:-us-east-1}

echo "╔════════════════════════════════════════════════╗"
echo "║  Deploying EKS Infrastructure: ${ENVIRONMENT}   ║"
echo "╚════════════════════════════════════════════════╝"

# ════════════════════════════════════════════════════════
# PHASE 1: PROVISION INFRASTRUCTURE WITH TERRAFORM
# ════════════════════════════════════════════════════════
echo ""
echo "✓ PHASE 1: Creating VPC and EKS cluster..."
cd terraform/

terraform init
terraform plan \
  -var-file="environments/${ENVIRONMENT}/terraform.tfvars" \
  -var="aws_region=$REGION" \
  -out=tfplan
terraform apply tfplan

# Get cluster info
CLUSTER_NAME=$(terraform output -raw cluster_name)
KUBECONFIG_CMD=$(terraform output -raw configure_kubectl)

cd ..

echo "   • EKS Cluster: $CLUSTER_NAME"
echo "   • Region: $REGION"

# ════════════════════════════════════════════════════════
# PHASE 2: CONFIGURE KUBECTL
# ════════════════════════════════════════════════════════
echo ""
echo "✓ PHASE 2: Configuring kubectl..."

# Update kubeconfig (AWS managed, no SSH needed!)
eval $KUBECONFIG_CMD

# Verify cluster access
kubectl get nodes

echo "   • Cluster access configured"
echo "   • Worker nodes ready"

# ════════════════════════════════════════════════════════
# PHASE 3: DEPLOY APPLICATIONS
# ════════════════════════════════════════════════════════
echo ""
echo "✓ PHASE 3: Deploying applications..."

# Wait for nodes to be ready
kubectl wait --for=condition=Ready node --all --timeout=600s

# Deploy applications
kubectl apply -k kubernetes/overlays/${ENVIRONMENT}

echo "   • Applications deployed"
echo "   • Services configured"

# ════════════════════════════════════════════════════════
# VERIFY DEPLOYMENT
# ════════════════════════════════════════════════════════
echo ""
echo "✓ VERIFICATION"

echo "   • Nodes:"
kubectl get nodes -o wide

echo ""
echo "   • Pods:"
kubectl get pods -A

echo "   • Services:"
kubectl get svc -A

# ════════════════════════════════════════════════════════
# FINAL STATUS
# ════════════════════════════════════════════════════════
echo ""
echo "╔════════════════════════════════════════════════╗"
echo "║  ✅ DEPLOYMENT COMPLETE!                      ║"
echo "╠════════════════════════════════════════════════╣"
echo "║  EKS Cluster: $CLUSTER_NAME                    ║"
echo "║  Region: $REGION                               ║"
echo "║                                                ║"
echo "║  Access your applications:                     ║"
echo "║  → kubectl get svc -A                          ║"
echo "║  → kubectl port-forward svc/...                ║"
echo "║                                                ║"
echo "║  View cluster:                                 ║"
echo "║  → AWS Console: EKS → Clusters                 ║"
echo "║  → AWS CloudWatch Logs                         ║"
echo "║                                                ║"
echo "║  Kubectl context:                              ║"
echo "║  → kubectl config current-context              ║"
echo "║  → kubectl cluster-info                        ║"
echo "║                                                ║"
echo "║  Timestamp: $(date)                            ║"
echo "╚════════════════════════════════════════════════╝"

Why EKS is Better for AWS

Advantages:

No Configuration Layer - EKS is fully managed, no Ansible needed
Automated Control Plane - AWS handles master nodes, upgrades, patches
Multi-AZ by Default - Spreads across 3 availability zones
Integrated with AWS Services - RDS, ALB, IAM, CloudWatch, VPC
Security Patches Automatic - AWS patches vulnerabilities immediately
Simpler Backup/Recovery - Managed by AWS
Compliance - Easier to meet regulatory requirements

Simpler deployment flow:

Terraform creates:
  ✓ VPC with subnets
  ✓ EKS cluster (control plane)
  ✓ Node groups (worker nodes)
  ✓ IAM roles

kubectl applies:
  ✓ Deployments
  ✓ Services
  ✓ Ingress
  ✓ ConfigMaps, Secrets

No Ansible needed because:

EKS worker nodes come pre-configured
Kubernetes, Docker, and CNI already installed
Security hardening already applied by AWS
Just add your applications with kubectl

Infrastructure as Code with EKS

Your final deployment:

Terraform        → Create cloud infrastructure (VPC, EKS, nodes)
kubectl/Helm     → Deploy containerized applications
ArgoCD/Flux      → Continuous GitOps synchronization

This is the AWS-native, production-recommended approach! 🎯

Introduction#

Why Infrastructure as Code?#

Terraform: State Management in Practice#

The Problem With Manual Infrastructure#

Remote State: The Foundation#

Organizing Infrastructure Into Modules#

Step 1: Define Module Inputs (variables.tf)#

Step 2: Create Resources (main.tf)#

Step 3: Define Module Outputs (outputs.tf)#

Step 4: Use the Module (environments/*)#

Step 5: Connecting Modules Together#

Testing Infrastructure Changes#

Ansible: Idempotent Configuration Management#

The Challenge: Making Things Repeatable#

Building Idempotent Playbooks#

Real Configuration Example#

Kubernetes: GitOps and IaC#

The GitOps Pattern#

Resource Limits and Requests#

Security with RBAC and Service Accounts#

What is RBAC?#

Why Do We Need It?#

Step 1: Create a Service Account#

Step 2: Create a Role with Minimal Permissions#

Step 3: Bind Role to Service Account#

Step 4: Use Service Account in Deployment#

Real-World Security Scenario#

Multi-Tool Orchestration#

Understanding the Complete Pipeline#

Terraform + EKS Deployment (Simplified)#

Simplified Deployment Script#

Why EKS is Better for AWS#

Infrastructure as Code with EKS#