Terraform at Scale

Terraform is easy to start and easy to outgrow. The first main.tf is fine; the hundredth resource in a single state file is a liability. Scaling Terraform is mostly about structure and discipline — DRY modules, isolated state, and automation that makes drift visible. Here's the layout we use.

1. Separate state per environment

One state file per environment, in a remote backend, keeps blast radius small. A bad apply in dev can never touch prod. Use a GCS (or S3) backend with state locking and versioning.

# envs/prod/backend.tf
terraform {
  backend "gcs" {
    bucket = "acme-tfstate-prod"
    prefix = "platform"
  }
}

2. Reusable modules, thin environments

Push real logic into versioned modules; keep each environment a thin composition that passes variables. This is the DRY payoff — fix a bug once, roll it everywhere.

terraform/
├── modules/
│   ├── network/
│   ├── gke/
│   └── cloud-sql/
├── dev/
├── stage/
└── prod/

# prod/main.tf
module "gke" {
  source       = "../modules/gke"
  cluster_name = "prod-gke"
  private      = true
  network      = module.network.vpc_self_link
  environment  = "prod"
}

3. Pin everything

Unpinned providers and modules turn terraform init into a roll of the dice. Pin Terraform, providers, and module versions, and commit the lockfile.

terraform {
  required_version = "~> 1.9"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.40" }
  }
}

4. Make CI run the plan — and humans approve it

No more laptop applies. CI runs fmt, validate, a security scan, and plan on every PR; apply happens only after review on the protected branch. Authenticate with Workload Identity Federation / OIDC — never long-lived keys.

# .github/workflows/terraform.yml (excerpt)
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
    service_account: ${{ secrets.TF_SA }}
- run: terraform fmt -check && terraform validate
- run: terraform plan -out=tfplan
- run: tfsec . && checkov -d .

5. Add policy-as-code guardrails

Catch dangerous changes before apply with OPA / Conftest or Sentinel: deny public buckets, require encryption, enforce tagging. The plan becomes the thing your policies evaluate.

6. Treat drift as a first-class signal

Run a scheduled plan against production and alert on any non-empty diff. Drift means something changed outside Terraform — that's exactly what you want to know about early.

Scaling checklist

Remote backend with locking + versioning, state per environment
Logic in versioned modules; environments stay thin
Pinned Terraform, providers, and modules; lockfile committed
CI runs fmt / validate / scan / plan on every PR
OIDC / Workload Identity — no static credentials
Policy-as-code gates (OPA/Conftest, tfsec, Checkov)
Scheduled drift detection with alerts

Self-service Terraform with golden-path modules and policy guardrails is exactly what we're building into the ATechsCloud Infrastructure Automation Portal.