Terraform is easy to start and easy to outgrow. The first main.tf is fine; the hundredth resource in a single state file is a liability. Scaling Terraform is mostly about structure and discipline โ DRY modules, isolated state, and automation that makes drift visible. Here's the layout we use.
1. Separate state per environment
One state file per environment, in a remote backend, keeps blast radius small. A bad apply in dev can never touch prod. Use a GCS (or S3) backend with state locking and versioning.
# envs/prod/backend.tf
terraform {
backend "gcs" {
bucket = "acme-tfstate-prod"
prefix = "platform"
}
}
2. Reusable modules, thin environments
Push real logic into versioned modules; keep each environment a thin composition that passes variables. This is the DRY payoff โ fix a bug once, roll it everywhere.
terraform/
โโโ modules/
โ โโโ network/
โ โโโ gke/
โ โโโ cloud-sql/
โโโ dev/
โโโ stage/
โโโ prod/
# prod/main.tf
module "gke" {
source = "../modules/gke"
cluster_name = "prod-gke"
private = true
network = module.network.vpc_self_link
environment = "prod"
}
3. Pin everything
Unpinned providers and modules turn terraform init into a roll of the dice. Pin Terraform, providers, and module versions, and commit the lockfile.
terraform {
required_version = "~> 1.9"
required_providers {
google = { source = "hashicorp/google", version = "~> 5.40" }
}
}
4. Make CI run the plan โ and humans approve it
No more laptop applies. CI runs fmt, validate, a security scan, and plan on every PR; apply happens only after review on the protected branch. Authenticate with Workload Identity Federation / OIDC โ never long-lived keys.
# .github/workflows/terraform.yml (excerpt)
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.TF_SA }}
- run: terraform fmt -check && terraform validate
- run: terraform plan -out=tfplan
- run: tfsec . && checkov -d .
5. Add policy-as-code guardrails
Catch dangerous changes before apply with OPA / Conftest or Sentinel: deny public buckets, require encryption, enforce tagging. The plan becomes the thing your policies evaluate.
6. Treat drift as a first-class signal
Run a scheduled plan against production and alert on any non-empty diff. Drift means something changed outside Terraform โ that's exactly what you want to know about early.
Scaling checklist
- Remote backend with locking + versioning, state per environment
- Logic in versioned modules; environments stay thin
- Pinned Terraform, providers, and modules; lockfile committed
- CI runs fmt / validate / scan / plan on every PR
- OIDC / Workload Identity โ no static credentials
- Policy-as-code gates (OPA/Conftest, tfsec, Checkov)
- Scheduled drift detection with alerts
Self-service Terraform with golden-path modules and policy guardrails is exactly what we're building into the ATechsCloud Infrastructure Automation Portal.