Lessons Learned Running Terraform at Scale - Nytra

Terraform is the lingua franca of infrastructure as code. But running it at scale — across multiple teams, environments, and cloud providers — introduces challenges that aren’t obvious when you’re managing a single main.tf file.

State management is everything

The single most important decision in a Terraform-at-scale setup is how you manage state. We’ve converged on a few principles:

One state file per environment per service. This limits blast radius and reduces lock contention
Remote state with locking is non-negotiable. S3 + DynamoDB for AWS, GCS for GCP
State should be read-only to humans. All state modifications happen through CI/CD pipelines

Module design matters

Poorly designed modules create more problems than they solve. Our module design guidelines:

Modules should represent a single logical resource group (e.g., “VPC”, “EKS cluster”, “RDS instance”)
Pin module versions. Use a module registry (Terraform Cloud or a Git-based approach)
Expose only the variables that consumers need. Internal implementation details should be hidden

module "vpc" {
  source  = "app.terraform.io/nytra/vpc/aws"
  version = "2.1.0"

  name       = "production"
  cidr_block = "10.0.0.0/16"
  azs        = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Policy as code

Every Terraform workspace should have policy checks that run before apply:

Cost estimation: Flag changes that would significantly increase spend
Security scanning: Check for public S3 buckets, overly permissive security groups, unencrypted resources
Naming conventions: Enforce consistent resource naming across the organization

We use Open Policy Agent (OPA) or Sentinel depending on the client’s toolchain.

The CI/CD pipeline

A production Terraform pipeline should include:

terraform fmt -check — enforce formatting
terraform validate — catch syntax errors
terraform plan — generate and review the execution plan
Policy checks — security, cost, compliance
Manual approval (for production)
terraform apply — execute the plan

Drift detection

Infrastructure drift is inevitable. Teams modify resources through the console, automated processes create resources outside of Terraform, and state diverges from reality.

Run terraform plan on a schedule (we recommend hourly for critical infrastructure) and alert on unexpected diffs. This turns drift from a slow-moving disaster into a manageable operational task.