pfn-header-logo
pfn-logo-white

Advanced Terraform State Management in Azure

professnet-hero-18-dashboard
professnet-hero-18-dashboard

Handling Drift and State Migration at Enterprise Scale

Two engineers. One Azure subscription. Both ran terraform apply within minutes of each other from separate laptops, each with a local state file. The result: duplicate resource groups, competing NSG rules, and a Terraform state that no longer reflects what exists. Nobody knows what the ground truth is.

This is not a contrived scenario. It is what happens when state management is treated as an afterthought rather than a design decision. In enterprise Azure environments multiple subscriptions, dozens of teams, hundreds of modules state management is the single most operationally consequential part of a Terraform practice. Get it wrong and you are debugging phantom resources, failed plans, and corrupted state files under production pressure.

This post covers how to structure state correctly in Azure from day one, how to detect and respond to drift systematically, and the modern toolkit for state migration and refactoring at scale.

The Azure Blob Backend: What You Actually Need to Configure

The azurerm backend stores Terraform state in Azure Blob Storage. Azure automatically acquires a blob lease before any write operation, which is Terraform’s native state lock on Azure. There is no DynamoDB equivalent to set up locking is built in and cannot be disabled.

A minimal backend configuration looks straightforward, but production setups need several settings that are not in the default documentation examples.

terraform {
  backend “azurerm” {
    resource_group_name  = “rg-terraform-state”
    storage_account_name = “tfstateenterprise”
    container_name       = “tfstate”
    key                  = “teams/platform/network/terraform.tfstate”
    use_azuread_auth     = true   # OAuth2 token, not shared key
    use_msi              = true   # use Managed Identity in CI/CD
  }
}

Setting use_azuread_auth = true is the critical change from default. Without it, Terraform authenticates using the storage account access key, which never expires, cannot be scoped to a specific container, and shows up in plain text in pipeline logs if you are not careful. With Entra-based auth, you can scope access precisely: assign Storage Blob Data Contributor to your service principal at the container level, not the storage account level.

Three blob storage settings that belong in every production state backend: blob versioning (every state change creates a recoverable snapshot), soft delete with a 14-day retention window (protection against accidental deletion), and blob change feed (an audit log of every read and write to the state file, useful for security investigations). None of these are enabled by default.

PRO TIP
Isolate state storage in a dedicated subscription. Place the Azure Storage Account for Terraform state in a separate shared services or platform subscription, not in the subscription it manages. If an attacker compromises an application team’s service principal, they should not also get access to the state file for the entire landing zone. Scoping Storage Blob Data Contributor at the container level per team is the right access boundary. 

When a CI runner crashes mid-apply, the blob lease persists as an orphaned lock. Add this to your pipeline wrapper:

 az storage blob lease break \
  –account-name tfstateenterprise \
  –container-name tfstate \
  –blob-name teams/platform/network/terraform.tfstate 


Run this only when you have confirmed no active apply is in progress. Never automate it unconditionally.

Understanding Drift: Three Types That Need Different Responses

Drift gets used as a catch-all term, but treating all drift the same leads to the wrong responses. There are three distinct types in practice.

Configuration drift is when someone changed a resource directly in the Azure portal, via Azure CLI, or through another automation tool. The resource exists and is in state, but its actual properties no longer match what Terraform expects. A VM SKU changed by an ops engineer. A firewall rule added via the portal during an incident. Tags modified by Azure Policy. Terraform will detect this on the next plan and offer to revert the change.

State drift is when the state file no longer accurately reflects what Terraform deployed. Someone deleted a resource group through the portal without telling Terraform. A resource was deleted via the Azure CLI. An M&A event migrated subscriptions and resources changed IDs. Terraform’s state file still has entries for resources that no longer exist, which causes plans to fail with resource-not-found errors.

Coverage drift is when resources exist in Azure that Terraform has no knowledge of no state entry, no configuration. Resources created manually, resources provisioned by another team, resources deployed by ARM templates or Bicep before IaC adoption. Coverage drift does not cause plan failures, which makes it the most dangerous: it is invisible until you audit your environment and realize your blast radius is larger than you thought.

WATCH OUT
The state file contains sensitive information. Resource IDs, connection strings, service principal credentials, and in some cases passwords for resources created with auto-generated credentials are stored in plain text in the state file. This is a known property of Terraform state, not a misconfiguration. It means every person with read access to the state storage account has access to this data. Treat the state file with the same sensitivity as a secrets vault.

Detecting Drift: refresh-only Plans and Scheduled Runs

The most important distinction in drift detection is between terraform plan and terraform plan -refresh-only. Understanding the difference prevents a common and expensive mistake.

terraform plan does two things: it refreshes the state (reads current resource properties from Azure) and then computes what changes are needed to match your configuration. If someone changed a VM SKU in the portal, a plain terraform plan will show a change that reverts it. The engineer reviewing the plan may not realize this is drift remediation rather than a new change.

terraform plan -refresh-only only does the first part: it refreshes state and shows you what changed in Azure, without computing any configuration delta. The output tells you exactly what drifted and in which direction, without mixing it with intended changes. This is the correct tool for drift auditing.

# Audit drift without mixing with configuration changes
terraform plan -refresh-only
# If drift is intentional (e.g. ops team correctly resized a VM):
terraform apply -refresh-only
# This updates state to match reality no infra changes, just state sync

# Check plan exit code in CI (0 = no changes, 1 = error, 2 = changes present)
terraform plan -refresh-only -detailed-exitcode
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ];
then  echo “Drift detected” && send_alert
fi

For continuous drift detection at scale, run scheduled plans in CI every few hours across all workspaces. A plan that exits with code 2 means drift exists and someone needs to review it. The cost of running a read-only plan against Azure is negligible; the cost of discovering drift three weeks later during an incident is not.

PRO TIP
Parse plan output with terraform show -json planfile | jq to programmatically categorize drifted resources by type, subscription, or team ownership. Build a dashboard from this output rather than reading raw plan text. In environments with 200+ state files, manual review of drift alerts does not scale. 

Some attributes drift legitimately at runtime: auto-scaling instance counts, dynamic tags applied by Azure Policy, certificate expiry dates. Use lifecycle { ignore_changes = [tags, capacity] } to exclude these attributes from drift detection rather than suppressing alerts. Over-broad ignore_changes masks real drift scope it to the specific attribute.

Responding to Drift: Four Decisions

Once drift is detected, every instance requires one of four responses. The table below maps the decision to the mechanism and its risk profile.

ResponseWhen to useMechanismRisk
Accept driftManual change was intentional and correct; update state to match realityterraform apply -refresh-onlyLow. Only updates state, no infra changes.
Remediate driftManual change was incorrect; revert to desired state defined in codeterraform applyMedium. Destroys the manual change. Confirm the plan first.
Ignore permanentlyResource attribute changes dynamically at runtime (tags, scaling counts)lifecycle { ignore_changes = […] }Low if scoped tightly. High if over-used masks real drift.
Import as intentionalResource was created outside Terraform; bring it under managementimport block (v1.5+) with -generate-config-outMedium. Generated config needs review before apply.

The most commonly misused response is terraform apply without reviewing the plan output carefully. When configuration drift has accumulated over weeks, a plain apply can destroy manual changes that were correct and intentional. Always use -refresh-only first to see what changed, then decide whether to accept or remediate.

The import block response (bringing unmanaged resources under Terraform control) deserves special attention for enterprise environments. The terraform plan -generate-config-out=generated.tf flag, introduced in Terraform 1.5, auto-generates resource configuration from an existing Azure resource. This is the fastest path from coverage drift to full IaC coverage, but the generated configuration always needs review: it includes computed attributes that should be removed, and lifecycle blocks that may not reflect your intent.

PRO TIP
When using -generate-config-out on Azure resources, the generated config often includes attributes like location as a hardcoded value rather than a variable, and may include read-only attributes that will cause a plan error. After generating, run terraform plan and check for Values for these provider attributes cannot be configured errors those attributes need to be removed from the generated config, not fixed.

The State Migration Toolkit: Moving Past terraform state mv

Enterprise Terraform codebases eventually need refactoring: modules get reorganized, naming conventions change, workspaces get split as teams grow. Every one of these operations requires state changes. The old approach was terraform state mv run manually in a terminal, with no record in version control and no review process. There is a better toolkit now.

ToolWhat it doesRequiresLimitation
terraform state mvMoves resource address in state file imperativelyTerraform 1.0+Not in git, not reviewed in PR. Cross-state only with -state-out flag.
moved blockDeclarative rename/move within same state fileTerraform 1.1+Same state only. No cross-workspace. No conditional logic.
import block + generate-config-outBrings unmanaged resource into state with auto-generated configTerraform 1.5+Generated config needs manual cleanup. One-time operation per resource.
removed blockRemoves resource from management without destroying itTerraform 1.7+Does not move resource to another state. Use before deleting resource block.
tfmigrateCross-state state operations as declarative HCL, GitOps-friendlyExternal binaryThird-party tool. Adds dependency outside HashiCorp ecosystem.

The moved Block: Declarative Refactoring in Git

The moved block (Terraform 1.1+) is the right tool for renaming resources or moving them between modules within the same state file. It belongs in your Terraform configuration, gets reviewed in a pull request, and is applied automatically during the next plan/apply. No terminal commands, no manual state manipulation.

# Rename a resource (added to main.tf, reviewed in PR)
moved {
  from = azurerm_virtual_network.vnet
  to   = azurerm_virtual_network.hub_vnet
}
# Move resource into a modulemoved {
  from = azurerm_subnet.gateway
  to   = module.hub_network.azurerm_subnet.gateway
}
# Chain moves for multi-step refactors (all versions applied in sequence)
moved {
  from = azurerm_virtual_network.network
  to   = azurerm_virtual_network.vnet
}
moved {
  from = azurerm_virtual_network.vnet
  to   = module.hub_network.azurerm_virtual_network.vnet
}

Keep moved blocks in your codebase until you are certain all state files across all environments have been updated. In a shared module, remove them at the next major version bump. Never edit an existing moved block after it has been applied create a new one chaining from the current address.

The removed Block: Deregistering Without Destroying

Before Terraform 1.7, removing a resource block from configuration would cause terraform apply to destroy the resource. The removed block changes this: it explicitly deregisters a resource from state without destroying the underlying infrastructure. Use it when a resource is being handed to another team, migrated to another state file, or should simply stop being managed by Terraform.

# Stop managing this resource without destroying it
removed {
  from = azurerm_resource_group.legacy_rg
  lifecycle {
    destroy = false
  }
}

Cross-State Migration with tfmigrate

The moved block cannot move resources between state files. For cross-state migrations, the options are terraform state mv (manual, imperative), the import+removed block combination (declarative but two-step), or tfmigrate (third-party, GitOps-native).

tfmigrate writes state operations as declarative HCL, commits them to git, and applies them as a sequence. This means state migrations are reviewed in pull requests, can be dry-run without affecting remote state, and are replayable. For teams managing 50+ state files, this matters: state migrations without a git record are a compliance and debugging liability.

# tfmigrate_split_network.hcl reviewed in PR, applied in CI
migration “multi_state” “split_network_module” {
from_dir = “platform/hub”
  to_dir   = “platform/network”
  actions = [
    “mv azurerm_virtual_network.hub
module.hub_network.azurerm_virtual_network.hub”,
    “mv azurerm_subnet.gateway module.hub_network.azurerm_subnet.gateway”,
  ]
}
PRO TIP
Before any cross-state migration, run terraform state pull > backup_$(date +%Y%m%d_%H%M%S).tfstate in both the source and destination workspace. Blob versioning means you can recover, but having a local timestamped copy costs nothing and removes a variable from the recovery process. 

To audit what is in a state file without running a plan, use: terraform state pull | jq ‘.resources[].type’ | sort | uniq -c | sort -rn. This gives a frequency count of resource types, useful for deciding how to split a large state file and for estimating migration effort.

Splitting State Files at Enterprise Scale

A Terraform state file grows linearly with the number of managed resources. Plans get slower, concurrent apply operations conflict, and changes to one team’s resources block another team’s pipeline. The right time to split is before these symptoms appear, not after.

The boundary for splitting should follow ownership, not resource type. Splitting by resource type (one state for networking, one for compute, one for storage) creates cross-state dependencies that are hard to manage. Splitting by team or product boundary means each team can plan and apply independently.

The pattern for splitting a state file without downtime:

  1. Backup both source and destination state files before starting.
  2. In the source configuration, add removed blocks for every resource moving to the new state. Set destroy = false.
  3. In the destination configuration, add import blocks for each resource, referencing the Azure resource ID.
  4. Run terraform plan on both configurations. Verify the source shows only removals and the destination shows only imports, with no create/destroy actions.
  5. Apply source first (resources removed from management), then destination (resources imported). Both operations are non-destructive to the actual infrastructure.

For cross-state data sharing after the split, terraform_remote_state data source can read outputs from another state file. Use this sparingly: hard dependencies between state files create coupling that defeats the purpose of splitting. Prefer sharing data through Azure resource IDs and naming conventions rather than state outputs where possible.

PRO TIP
When a state file has grown to 50+ resources and plan time exceeds three minutes, it is time to split. A terraform state pull | jq ‘.resources | length’ gives you the resource count. A plan time over five minutes in CI is a team productivity issue that compounds daily split early.

Problems We Have Actually Run Into

The portal engineer problem. On every enterprise Azure engagement, there is at least one person with Owner access to subscriptions who makes regular changes through the portal. Network security rules opened during incidents and never closed. VM sizes changed without a change request. Tags updated by hand. The right response is not a process argument it is terraform plan -refresh-only on a schedule, with results posted to a Slack channel. Visibility changes behavior faster than policy does.

Generated config from -generate-config-out is not production-ready. The flag generates a Terraform configuration file that will import the resource without destroying it, but the generated HCL consistently includes attributes that are read-only or computed. Applying it without review causes plan errors at best and unexpected modifications at worst. Treat generated config as a first draft that needs cleanup: remove id attributes, check that any sensitive attributes are moved to variables, and verify lifecycle blocks reflect actual intent.

Force-unlock is irreversible if you use the wrong lock ID. When terraform force-unlock is used, it releases the blob lease by lock ID. If you unlock the wrong workspace (common in monorepos with similar state file paths), you may unlock a state that has an active apply in progress in CI. The result is concurrent writes to the same state file, which can produce corruption. Always confirm the lock ID, the workspace, and who holds the lock before forcing an unlock. Azure portal shows the current lease holder on the blob check it first.

Cross-state moves break the moved block. Teams accustomed to the moved block for same-state refactors sometimes try to use it for cross-state operations and hit a confusing error: “The moved block refers to a resource not in this configuration.” The moved block only works within a single state file. For cross-state moves, the correct pattern is removed (source) + import (destination), or tfmigrate if the move is complex enough to warrant a tool.

State file size causes plan timeouts in large subscriptions. A state file managing 300+ resources in an Azure subscription with complex networking can hit plan timeouts not because of Terraform but because the AzureRM provider makes hundreds of API calls during the refresh phase. Azure API rate limiting kicks in during these large refreshes. The fix is to split the state file before hitting this limit. The warning sign: plan times growing week over week without new resources being added.

Frequently Asked Questions

What is the difference between terraform plan -refresh-only and terraform plan?

terraform plan -refresh-only reads current resource state from Azure and shows what has changed since the last apply, without computing what Terraform needs to do to match your configuration. terraform plan does both: it refreshes state and computes the configuration delta. Use refresh-only for drift auditing; use the regular plan for deployment decisions.

When should I use the moved block vs terraform state mv?

Use the moved block for all same-state refactors in Terraform 1.1+. It is declarative, reviewable in PRs, and lives in git. Use terraform state mv only when you need to move resources between different state files (cross-state), which the moved block cannot do. terraform state mv should be treated as a last resort, not a routine tool.

How do I bring manually-created Azure resources under Terraform management without destroying them?

Use the import block (Terraform 1.5+) with terraform plan -generate-config-out=generated.tf. This generates a resource configuration that matches the existing Azure resource. Review and clean the generated HCL, then run terraform apply. The resource is imported into state with no infrastructure changes. The Azure resource ID required for import is available from az resource show or the Azure portal Properties blade.

How do I safely recover from an orphaned state lock on Azure Blob Storage?

First, check the blob lease status in Azure portal or with az storage blob show –query ‘properties.lease’. Confirm no active Terraform process holds the lock. Then either use terraform force-unlock <LOCK_ID> (the lock ID is shown in the error message when the lock was acquired) or break the blob lease directly with az storage blob lease break. Never force-unlock without confirming the locking process is dead.

What is the removed block and when should I use it?

The removed block (Terraform 1.7+) removes a resource from Terraform state without destroying the underlying infrastructure when destroy = false is set. Use it before deleting a resource block from configuration when you want to stop managing a resource but leave it running. Without a removed block, deleting a resource block causes Terraform to destroy the resource on the next apply.

How should state files be organized for a large Azure landing zone?

Split by team ownership and blast radius, not by resource type. Each team should have one state file per environment (dev/stage/prod) for the resources they own. Platform or networking state files should be owned by a platform team and consumed by other teams via remote state outputs or through shared naming conventions. Avoid a single state file for an entire landing zone it creates a bottleneck and a single point of corruption for all teams.

Should I keep moved blocks in my codebase permanently?

Keep them until all state files across all environments have been updated, and until you are confident no one is running a Terraform version older than when the blocks were introduced. For shared modules, keep them until the next major version bump. For application code, keep them for at least one release cycle. Removing them too early causes terraform plan to show a destroy/create pair for resources whose state entry was never updated.

Key Takeaways

State management is infrastructure. Treat it as such from day one: isolated storage account, Entra-based auth, blob versioning, soft delete enabled. These are the settings that make recovery possible when something goes wrong. Without them, a corrupted state file is a crisis.

Drift is inevitable. The question is how quickly you detect it and what you do about it. Scheduled terraform plan -refresh-only runs with exit-code monitoring are a cheap detection layer that most teams skip until their first major drift incident. Build it into your CI before you need it.

The state manipulation toolkit has improved significantly: the moved block (v1.1) for same-state refactors, the import block with -generate-config-out (v1.5) for bringing existing resources under management, the removed block (v1.7) for clean deregistration. terraform state mv is now a tool of last resort, not a routine command. If your team still reaches for it by default, update the playbook.

State file boundaries define team autonomy. A state file that a team cannot plan against without blocking another team is a bottleneck. Split early, along ownership lines, before plan times become a daily complaint.

Table of contents

We are always happy to talk

Reach out to us about a project, consultation, or to explore other collaboration opportunities.

© 2026 Professnet. All rights reserved.