前提

Terraform によってインフラをチームで管理しようとすると、backend をリモートで管理することになってきます。 AWS を使用する場合は、backend として S3 を使用するのが定石のようでした。

その前に state とは何か

state は、Terraform の設定と、現実のインフラとの対応付けをするデータベースです。

現実のインフラの多く(特にクラウドプロバイダが提供してくれるインフラ)は API 経由で情報が取れるのだから、データベース不要ではないでしょうか。事実、Terraform のプロトタイプではそのような実装もされていたようです。しかし、結局は state というデータベースが採用されました。

その理由として以下のものが挙げられています (State | Terraform by HashiCorp参照)

リソース間の依存関係の追随
1. 例えばリソースを削除するときを考えます。Terraform の設定ファイル上で resource を削除してしまうと、state 無しにはその削除リソースに依存していたリソースがわからなくなります
パフォーマンス
1. クラウドプロバイダによっては、一度に現在のインフラリソースを取得する API がありません。これらの情報を知るために逐一 API を呼ぶのでは、パフォーマンスが劣化します。
チームでインフラ管理をする上では、各個人が同じ状態を見ることができなければなりません。

state の S3 管理

チームメンバーそれぞれで state file にアクセスするためには、以下の条件を満たさなければなりません。

全員が参照できる場所 (remote) に配置されること
state file に対する同期制御 (locking) が可能なこと
秘匿すべき情報 (DB のパスワード等) が漏れないこと
最新化忘れのようなミスが起こらないこと

これを満たす存在として、GCP なら Google Cloud Strage、Azure なら Azure Storage、AWS なら S3 があります。もちろん、Terraform Cloud などもありますが。ぼくは当面 AWS をターゲットにするので、S3 を選択するのが最適でしょう。

If you’re using Terraform with AWS, Amazon S3 (Simple Storage Service), which is Amazon’s managed file store, is typically your best bet as a remote backend

How to manage Terraform state. A guide to file layout, isolation, and… | by Yevgeniy Brikman | Gruntwork

backend となる S3 を Terraform 管理のインフラと一緒に管理して良いのか

一方で、backend となる S3 の bucket まで Terraform で管理して良いのかという疑問が湧いてきます。調べたところ、公式ドキュメントにおいては「Terraform が管理するインフラの外側で管理しろ」という答えが明記されていました。

Terraform is an administrative tool that manages your infrastructure, and so ideally the infrastructure that is used by Terraform should exist outside of the infrastructure that Terraform manages. This can be achieved by creating a separate administrative AWS account which contains the user accounts used by human operators and any infrastructure and tools used to manage the other accounts.

Backend Overview - Configuration Language | Terraform by HashiCorp

その方法として、管理用 AWS アカウントを別に作ってそこで管理すれば良いよ、という記載もあります。これは AWS Organizations を使用すればおそらくは達成できるのですが、まだ Organizations を勉強中ということもあり、まずは S3 bucket を自力で構成してみます。

backend の構築

backend を構築する main.tf を以下のように定義してみました。上記で述べた、backend にあった方が良いとされる「locking」については S3 だけでなく DynamoDB にて提供されるため、S3 と Dynamo DB が構築対象になります。

terraform {
  required_version = "~> 0.12.24"
  required_providers {
    aws = "~> 2.60.0"
  }
}

provider "aws" {
  region = "ap-northeast-1"
}

resource "aws_s3_bucket" "terraform_state" {
  bucket_prefix = "terraform-state"
  acl           = "private"

  versioning {
    enabled = true
  }

  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "terraform_state_lock" {
  name           = "terraform_state_lock"
  read_capacity  = 1
  write_capacity = 1
  hash_key       = "lock_id"

  attribute {
    name = "lock_id"
    type = "S"
  }
}

S3 については ACL でアクセスを制限することにします。バージョニングを有効化することで、もし state file がおかしくなったりしたときも戻せるようにしておきます。また、秘匿情報が含まれるということもあり、サーバーサイド暗号化は有効化しておきました ¹。

ライフサイクルルールを設定し、古いバージョンの state file は Glacier にでも移した方が良いのかなと考えました。しかし、state file は大容量にはならない (むしろ、移行させたほうが高くつく…?) と思われるので、設定しないようにしています。また、lifecycle Meta-argument で暗黙的な削除はできないようにしました。

backend の適用 (locking なし)

それでは、この backend を実際に使ってみます。単純に EC2 を作ってみる main.tf を apply します。

ポイントは、backend.s3 の設定に dynamodb_table 設定がないこと。つまり、locking を有効にしていないことを意味しています。

terraform {
  required_version = "~> 0.12.24"
  required_providers {
    aws = "~> 2.60.0"
  }

  backend "s3" {
    bucket = "kiririmode-terraform-state"
    key    = "test/ec2"
    region = "ap-northeast-1"
  }
}

resource "aws_instance" "example" {
  ami           = "ami-0f310fced6141e627"
  instance_type = "t2.micro"

  user_data = <<EOF
    #!/bin/bash
    yum install -y httpd
    systemctl start httpd.service
  EOF

  tags = {
    Name = "Example"
  }
}

$ terraform apply --auto-approve
$ terraform destroy --auto-approve

S3 上の state file はきちんとバージョン管理されました。apply 時、destroy 時それぞれの state file が記録されています。

一方で 2 つのターミナルからそれぞれ main.tf をほぼ同時に apply すると、それぞれが正常終了してしまい、2 つの EC2 インスタンスが作られてしまいました。これは問題です。

backend の適用 (locking あり)

locking を有効にするため、dynamodb_table を設定します。以下が全体の backend 設定です。

  backend "s3" {
    bucket         = "kiririmode-terraform-state"
    key            = "test/ec2"
    region         = "ap-northeast-1"
    dynamodb_table = "terraform_state_lock"
  }

このように locking を有効化した後、先程と同様に 2 つのターミナルから terraform apply を実行してみました。すると、なんということでしょう。きちんと一方はエラーを返却してくれました。

$ terraform apply

Error: Error locking state: Error acquiring the state lock: ConditionalCheckFailedException: The conditional request failed
        status code: 400, request id: L6GE8NQBVE9JBIHPSOEG2G396BVV4KQNSO5AEMVJF66Q9ASUAAJG
Lock Info:
  ID:        f473a293-da17-8262-6e9b-7aa7cc93ff7f
  Path:      kiririmode-terraform-state/test/ec2
  Operation: OperationTypeApply
  Who:       kiririmode@kiririmodenoMacBook-Air.local
  Version:   0.12.24
  Created:   2020-05-04 04:23:52.157867 +0000 UTC
  Info:


Terraform acquires a state lock to protect the state from being written
by multiple users at the same time. Please resolve the issue above and try
again. For most commands, you can disable locking with the "-lock=false"
flag, but this is not recommended.

このようにして、排他制御が可能な backend を構築できました。