Lessons Learned Running Terraform at Scale - Nytra
After managing Terraform across dozens of client environments, here are the patterns and pitfalls we've encountered.
by Nytra Team
Terraform is the lingua franca of infrastructure as code. But running it at scale — across multiple teams, environments, and cloud providers — introduces challenges that aren’t obvious when you’re managing a single main.tf file.
State management is everything
The single most important decision in a Terraform-at-scale setup is how you manage state. We’ve converged on a few principles:
- One state file per environment per service. This limits blast radius and reduces lock contention
- Remote state with locking is non-negotiable. S3 + DynamoDB for AWS, GCS for GCP
- State should be read-only to humans. All state modifications happen through CI/CD pipelines
Module design matters
Poorly designed modules create more problems than they solve. Our module design guidelines:
- Modules should represent a single logical resource group (e.g., “VPC”, “EKS cluster”, “RDS instance”)
- Pin module versions. Use a module registry (Terraform Cloud or a Git-based approach)
- Expose only the variables that consumers need. Internal implementation details should be hidden
module "vpc" {
source = "app.terraform.io/nytra/vpc/aws"
version = "2.1.0"
name = "production"
cidr_block = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
Policy as code
Every Terraform workspace should have policy checks that run before apply:
- Cost estimation: Flag changes that would significantly increase spend
- Security scanning: Check for public S3 buckets, overly permissive security groups, unencrypted resources
- Naming conventions: Enforce consistent resource naming across the organization
We use Open Policy Agent (OPA) or Sentinel depending on the client’s toolchain.
The CI/CD pipeline
A production Terraform pipeline should include:
terraform fmt -check— enforce formattingterraform validate— catch syntax errorsterraform plan— generate and review the execution plan- Policy checks — security, cost, compliance
- Manual approval (for production)
terraform apply— execute the plan
Drift detection
Infrastructure drift is inevitable. Teams modify resources through the console, automated processes create resources outside of Terraform, and state diverges from reality.
Run terraform plan on a schedule (we recommend hourly for critical infrastructure) and alert on unexpected diffs. This turns drift from a slow-moving disaster into a manageable operational task.