Instance Scheduler with AWS EventBridge

Save some money by scheduling your EC2 instances to stop and start with AWS EventBridge and Terraform

Updated April, 2024

Introduction

There was a day that stopping and starting your EC2 instances had to be done via a Lambda function and a CloudWatch event that looked at tags, it worked well enough but it was a bit of a hassle to set up and I would hardly call it Terraform friendly. You would think a hyper-scale cloud provider that kind of set the bar when it comes to paying for infrastructure by the hour, would have made this a point-and-click exercise years ago, but here we are.

Thankfully, we now have something that is as close to point & click as you're going to get in an API-driven, self-service environment, I guess they didn't want to make it too easy. So if you have a few dev servers you shouldn't be paying for through the night or a workload that only gets used during office hours, this is for you. A combination of AWS EventBridge, an IAM Policy & Role and a few tags on your EC2 instances, and you're good to go. So let's get into it.

I'm not going to go over the Pre-Requisites but if you need a guide on getting set up, I've covered a lot of this in my series an Introduction to Advanced Terraform, particularly the first couple of pages. For the bean counters out there, the general rule of thumb is that when your stuff is shut down, you're not paying for compute, but you'll still pay for storage.

Variables, Locals and EC2 Instances

Let's use the format from a previous article, Terraform data structures, Objects, Maps, Lists, Sets and Tuples. We'll start with a variable for the environment, a locals block for the EC2 instances and a resource block for the EC2 instances, but now we'll add some shutdown and start values to the locals and some tags to the resource block. Note that I'm defaulting to UTC, you can use the schedule_expression_timezone attribute to change this if you wish.

variable "environment" {
  type        = string
  default     = "dev"
  description = "The environment to deploy the resources"
}

locals {
  ec2 = {
    dev = {
      arryw-web = {
        region           = "eu-west-1"
        vpc              = "arryw-dev-dublin"
        count            = 1
        az               = ["eu-west-1a", "eu-west-1b"]
        key_pair         = "arryw"
        root_volume_size = 20
        shutdown_time    = "22"
        start_time       = "07"
      }
    }
  }

  global_tags = {
    Environment = var.environment
    Department  = "devops"
    Owner       = "arryw"
  }
}

resource "aws_instance" "ec2_dub" {
  for_each = {
    for k, v in local.ec2_map : k => v
    if v.region == "eu-west-1"
  }
  provider             = aws.dublin
  ami                  = try(each.value.ami, data.aws_ami.ubuntu_dub.id)
  instance_type        = try(each.value.instance_type, "t4g.micro")
  subnet_id            = try(each.value.subnet, aws_subnet.public_subnet_dub["${each.value.vpc}-${each.value.az}"].id)
  iam_instance_profile = aws_iam_instance_profile.ec2_profile[each.value.ec2_key].name

  user_data                   = ""
  user_data_replace_on_change = try(each.value["user_data_replace_on_change"], false)
  key_name                    = each.value["key_pair"]

  root_block_device {
    encrypted             = try(each.value["root_volume_encrypted"], true)
    volume_type           = try(each.value["root_volume_type"], "gp3")
    volume_size           = try(each.value["root_volume_size"], 30)
    delete_on_termination = try(each.value["root_delete_on_termination"], true)
  }

  tags = merge(local.global_tags, {
    Name         = each.key
    ShutdownTime = each.value.shutdown_time
    StartTime    = each.value.start_time
  })
}

This is a good base to start from, the locals definitions are nice and readable, we're setting out some global tags and the ec2 block is marked as regional from the off, allowing for expansion to other regions in the future. The tags are attached to the instance along with the global tags, all is good with the world.

If you need to look at setting up providers, this is covered in Terraform setup, Providers and Multi-Region Deployments

IAM Policy and Role

EventBridge need permissions to stop and start instances, this is not AI (yet), so we don't expect it to go rogue any time soon, but logic states that it needs the rights to do the job you're setting it up to do. We'll need a policy, a role and an attachment.

resource "aws_iam_policy" "eventbridge_ec2_policy" {
  name        = "eventbridge_ec2_policy"
  description = "Policy to allow EventBridge to stop and start EC2 instances"
  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = [
          "ec2:StartInstances",
          "ec2:StopInstances",
          "ec2:RebootInstances"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_iam_role" "eventbridge_ec2_role" {
  name               = "eventbridge_ec2_role"
  assume_role_policy = jsonencode({
    Version   = "2012-10-17"
    Statement = [
      {
        Effect    = "Allow"
        Principal = {
          Service = "events.amazonaws.com"
        }
        Action    = "sts:AssumeRole"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "eventbridge_ec2_attachment" {
  role       = aws_iam_role.eventbridge_ec2_role.name
  policy_arn = aws_iam_policy.eventbridge_ec2_policy.arn
}

The "ec2:RebootInstances" action is probably not required, but there is a concept of retry_attempts in EventBridge, so why not. The role allows EventBridge to assume it, and we attach the two together. Simple.

EventBridge Rules

Like the EC2 Instance block, we'll use a for_each so that we're not repeating ourselves, we'll also add a depends_on to the rule so that it doesn't try to run before the EC2 Instance is created. Note, as with most things, there are plenty more Arguments than I'm passing in here, you can pass JSON to the target to invoke APIs, you can pass in SageMaker parameters and SQS config, but for now, I'll keep it to what I think is probably common usage.

resource "aws_scheduler_schedule" "ec2_start" {
  for_each = {
    for k, v in local.ec2_map : k => v
    if v.region == "eu-west-1"
  }
  provider = aws.dublin

  name                = "${each.key}-start"
  description         = "Start ${each.key} at ${each.value.start_time}"
  schedule_expression = "cron(0 ${each.value.start_time} ? * MON-FRI *)"
  flexible_time_window {
    mode = "OFF"
  }
  target {
    arn = "arn:aws:scheduler:::aws-sdk:ec2:startInstances"
    role_arn = aws_iam_role.eventbridge_ec2_role.arn

    input = jsonencode({
      InstanceIds = [
        aws_instance.ec2_dub[each.key].id
      ]
    })
    retry_policy {
      maximum_event_age_in_seconds = 60
      maximum_retry_attempts       = 2
    }
  }
  depends_on = [aws_instance.ec2_dub]
}

resource "aws_scheduler_schedule" "ec2_stop" {
  for_each = {
    for k, v in local.ec2_map : k => v
    if v.region == "eu-west-1"
  }
  provider = aws.dublin

  name                = "${each.key}-stop"
  description         = "Stop ${each.key} at ${each.value.shutdown_time}"
  schedule_expression = "cron(0 ${each.value.shutdown_time} ? * MON-FRI *)"
  flexible_time_window {
    mode = "OFF"
  }
  target {
    arn = "arn:aws:scheduler:::aws-sdk:ec2:stopInstances"
    role_arn = aws_iam_role.eventbridge_ec2_role.arn

    input = jsonencode({
      InstanceIds = [
        aws_instance.ec2_dub[each.key].id
      ]
    })
    retry_policy {
      maximum_event_age_in_seconds = 60
      maximum_retry_attempts       = 2
    }
  }
  depends_on = [aws_instance.ec2_dub]
}

This is a bit messy, but hopefully you can follow the logic, there are start and stop rules, the flexible time window is required. The target is the EC2 action, and we're passing in the role it needs to complete the action. The input is the instance ID, and we're setting a retry policy, just in case it AWS is oversubscribed at the time... it does happen.

Code

The full code, which is combined with the code from my Advanced Terraform Warap-Up article, is available on GitHub.

Conclusion

There we have it, by shutting your servers down for about 9 hours per day, you're saving nearly 40% of your compute costs, and it's a set-and-forget thing so seems like a no brainer. You can expand on this by adding more regions, more instances, more schedules, you could scale down your kubernetes cluster or stop your RDS instances, there are a lot of possibilities. I hope this has been useful. If you have any feedback, want me to cover something else or just want to say hi, please feel free to reach out to me on Email, X or Threads