OVO Tech Blog

#122: Exploring Terraform through a Github issue

Introduction

Eleanor Nicholson


terraform gcp IaC

#122: Exploring Terraform through a Github issue

Posted by Eleanor Nicholson on .
Featured

terraform gcp IaC

#122: Exploring Terraform through a Github issue

Posted by Eleanor Nicholson on .

Recently I came across some odd behaviour in the Google provider for Terraform: updating the contents of an object in Google Cloud Storage removed any access controls set on the object. Looking into the issue in more detail was an interesting way to explore the internals of Terraform. I'd like to share what I learnt on the journey with you.

Description of the issue

I had an object in a GCP Storage bucket (the config for a cloud key rotator, if you’re wondering) and I was giving a service account read access to it. I’d written the infrastructure in Terraform, something like this:

resource "google_storage_bucket_object" "config" {
	name     = "ckr-config.json"
	bucket   = “my-bucket”
	content  = <<EOF
{
	"EnableKeyAgeLogging": true,
	"RotationMode": true,
	"CloudProviders": [{
		"Project":"${var.project_name}",
		"Name": "gcp"
	}]
}
EOF
}

resource "google_storage_object_access_control" "config_access" {
	object   = google_storage_bucket_object.config.output_name
	bucket   = “my-bucket”
	role     = "READER"
	entity   = "user-key-rotator@my-project.iam.gserviceaccount.com"
}

I deployed the change, the Terraform applied in CD and my key rotator service account was able to access its config. Success!

However, shortly afterwards I wanted to change the config for my key rotator. I edited the contents of the config object, redeployed the Terraform and ran my key rotator. It failed, unable to access its config object.

I went to check out what happened in the Terraform apply. The output was:

Terraform will perform the following actions:

# google_storage_bucket_object.config must be replaced
-/+ resource "google_storage_bucket_object" "config" {
    bucket         = "my-bucket"
  ~ content        = (sensitive value)
  ~ content_type   = "text/plain; charset=utf-8" -> (known after apply)
  ~ crc32c         = "3Vj39g==" -> (known after apply)
  ~ detect_md5hash = "w2BLW7aAg==" -> "different hash" # forces replacement
  ~ id             = "******************************************" -> (known after apply)
  ~ md5hash        = "w2BLW7aAg==" -> (known after apply)
  ~ media_link     = "https://storage.googleapis.com/download/storage/v1/b/my-bucket/o/ckr-config.json?generation=123456789&alt=media" -> (known after apply)
  - metadata       = {} -> null
    name           = "ckr-config.json"
  ~ output_name    = "ckr-config.json" -> (known after apply)
  ~ self_link      = "https://www.googleapis.com/storage/v1/b/my-bucket/o/ckr-config.json" -> (known after apply)
  ~ storage_class  = "STANDARD" -> (known after apply)
}

# google_storage_object_access_control.config_access will be updated in-place
  ~ resource "google_storage_object_access_control" "config_access" {
    bucket       = "my-bucket"
    email        = "key-rotator@my-project.iam.gserviceaccount.com"
    entity       = "user-key-rotator@my-project.iam.gserviceaccount.com"
    generation   = 123456789
    id           = "my-bucket/ckr-config.json/user-key-rotator@my-project.iam.gserviceaccount.com"
  ~ object       = "ckr-config.json" -> (known after apply)
    project_team = []
    role         = "READER"
}

Plan: 1 to add, 1 to change, 1 to destroy.

google_storage_bucket_object.config: Destroying... [id=ckr-config.json]
google_storage_bucket_object.config: Destruction complete after 0s
google_storage_bucket_object.config: Creating...
google_storage_bucket_object.config: Creation complete after 0s [id=ckr-config.json]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

Did you spot what’s wrong? I certainly didn’t. It took me another couple of head scratches before I looked more closely at the apply summary. In the plan Terraform says there is one resource to add, one to change and one to destroy but after the apply zero have changed.

I raised an issue with the Google provider and gave up on my dream of object level permission granularity for the key rotator bucket. However, when the lovely Google people responded to my issue, they explained that they couldn't fix the issue as it was a problem with Terraform itself. They linked me to this issue raised in the Terraform repo. It turns out this behaviour manifests with a number of different resources and their associated ACLs (Access Control Lists). At this point I got curious. Terraform is something I use comfortably but not something I feel confident with the internals of. That was reinforced when I read through the description of the underlying issue and didn't understand it. I started digging deeper. (Incidentally, if you do understand that github issue, then you’re not going to learn anything new from this blog post. If you don’t, read on!)

Terraform dependencies

So why doesn’t the ACL get recreated? In order to answer this question we need to learn more about how Terraform builds dependencies. At a high level Terraform runs an apply by:

  1. Scanning the Terraform configuration to build a dependency tree of resources
  2. Refreshing the Terraform state by reviewing the actual state of live resources
  3. Comparing the dependency tree with the state
  4. Working out the changes that need to be made to bring the Terraform state in line with the configuration dependency tree
  5. Working through changes identified in dependency order, using the provider to make them
  6. Updating the state with the changes

How does Terraform build the dependency tree? In the Terraform configuration, some resources use fields from other resources to define themselves. For example, my google_storage_object_access_control resource uses the name of google_storage_bucket_object.config  to define which object it is providing access to. This use of fields from other resources is called interpolation and is how Terraform builds up the dependency tree. Terraform can see that the name of the storage object is required for the access control resource to be defined, and so the storage object must be created first.

During the plan phase Terraform works out this dependency tree theoretically, as it does not yet know what actual value the name will take. During the apply phase, the storage object is recreated and its new name becomes known. This can then be used to concretely define the access control resource and any subsequent actions that need to be taken for that resource.

Terraform providers

The interplay between Terraform and the Google provider is also interesting. The provider doesn’t know anything about the bigger picture of the infrastructure dependencies, that’s all managed by Terraform. The provider defines the available resources and their fields and provides an interface to the API to do CRUD with those resources.

So what is the provider defining about the google_storage_bucket_object resource and the google_storage_object_access_control resource? We can have a look in the provider code to find out more.

Schema: map[string]*schema.Schema{
    "bucket": {
        Type:        schema.TypeString,
        Required:    true,
        ForceNew:    true,
        Description: `The name of the containing bucket.`,
    },
    "name": {
        Type:        schema.TypeString,
        Required:    true,
        ForceNew:    true,
        Description: `The name of the object. If you're interpolating the name of this object, see output_name instead.`,
    },
........
}


This is a snippet of the schema for the google_storage_bucket_object resource (link). Each field on the bucket has some schema behaviours associated with it. The ForceNew behaviour is particularly relevant to the issue we're looking at. ForceNew lets Terraform know that the resource will need to be destroyed and recreated if that field changes. The reason my google_storage_bucket_object was marked as needing recreating was because of the detect_md5hash field, which computes an MD5 hash of the content of the object. This field has ForceNew set to true, so if the MD5 hash of the storage object in GCP differs from the MD5 hash of the content in the Terraform configuration, the storage object will be destroyed and recreated, exactly as happened in the Terraform I ran. If you're curious about what other schema behaviours are available, the documentation for them is here.

Explanation of the issue

So why is the config object's ACL removed during the Terraform apply? As I mentioned earlier, the google_storage_bucket_object.output_name value is used to fill the google_storage_object_access_control.object value. The field google_storage_object_access_control.object isn't ForceNew, so we're expecting our access control resource to be updated in place, as was indicated in the plan. However, when the object is recreated its output name is the same as it was before the plan was applied. Because of this there is no difference between the new config and the config in the Terraform state, so Terraform does not take action to change the resource. This would be fine if access control resources were entirely independent entities in GCP. However, access control is a property of the storage object, it doesn't exist in its own right. When an object is destroyed, its access control properties are destroyed with it. Because of this, when Terraform recreates the object the object's ACL properties are destroyed as a side effect.

If you were to run Terraform apply a second time, it would recognise that the storage object's access control has gone when it refreshes the state and will reapply the access control resource. The issue is seen on such a wide range of different resources because it affects any resource which can be destroyed as a side effect of the removal of another resource and interpolates on a field which isn't computed, where that field does not change.

In summary

A google_storage_object_access_control resource is actually a property of the google_storage_bucket_object itself within GCP whereas in TF they are modelled as two independent entities. So the google_storage_object_access_control was automatically deleted behind the scenes when the google_storage_bucket_object was destroyed and recreated. Therefore when the key rotator came to use the ACL it no longer existed. Running Terraform a second time fixes this by triggering the creation of the ACL against the new config object.

Eleanor Nicholson

View Comments...