Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to avoid drift in AMI IDs #4460

Open
dgokcin opened this issue Mar 7, 2025 · 8 comments
Open

Question: How to avoid drift in AMI IDs #4460

dgokcin opened this issue Mar 7, 2025 · 8 comments

Comments

@dgokcin
Copy link

dgokcin commented Mar 7, 2025

Hello and thank you for this great project. I have the following multi-runner configuration where I use an AMI by runs-on since they are really close to the GitHub Actions runners and are updated regularly. One thing I have noticed is once the terraform module is applied, if the ami-id is deleted by the ami provider, the builds get stuck. And I need to re-apply the module to update the id in the launch template.

Is there a way to make this more dynamic or easier to maintain?

Thanks!

        #... multi_runner_config  
        enable_runner_binaries_syncer = true
        ami_owners                    = ["135269210855"] # runs-on ami account id

        userdata_template             = "./templates/user-data-runs-on.sh" #userdata for runs-on
        ami_filter = {
          name  = ["runs-on-v2.2-ubuntu24-full-x64-*"],
          state = ["available"]
        }
@npalm
Copy link
Member

npalm commented Mar 16, 2025

The only option the module currently supportss is to refer the AMI ref via an SSM paramater. In that case the one in the launch template is ingored. Terraform apply will still drift, but the AMI can be managed without runnig terraform.

@dgokcin
Copy link
Author

dgokcin commented Mar 17, 2025

I have written a lambda function that triggers daily and updates the ami ref in the ssm parameter. As you said, there is still drift but at least I don't have to remember updating AMI IDs every now and then.

@npalm do you think that this is a good approach and could be added to the base module? I can work on a PR if you think this is a good addition to the module capabilities.

@npalm
Copy link
Member

npalm commented Mar 19, 2025

I think we should make, also have a look at #4464

  • May we can move the AMI id completly to SSM.
  • Otherwise would be great if we find a way to avoid the drift if SSM is used in the Lambda.

Can you eloborate a bit more what the Lambda function does. Adding it to the base module could be fine. But let's also see if we can work on simply it all.

@claytonolley any thoughts?

@claytonolley
Copy link

Some background on my use case:

  • I build custom AMI's weekly using packer and they automatically update some SSM parameters marking them as "dev" runners
  • I have GitHub actions running periodic tests on these "dev" runners as well as some dev only workflows in other repos
  • If at the end of the week (prior to the next scheduled AMI build), all these tests are successful, then I have another action that will promote the "dev" SSM parameter value to the "prd" parameter which is what the bulk of my repos will use for their runners.
  • I don't use the ami_id at all, only ssm, but maybe we could just have some kind of toggle to not supply any ami_id to the launch template to prevent unnecessary noise in TF plans.

@dgokcin
Copy link
Author

dgokcin commented Mar 20, 2025

@npalm I reviewed #4464 and although I think the usage with image_id=resolve:ssm:parameter-name in the launch template is a more elegant way of using the existing ami_id than a data block lookup, there still needs to be an automation to update the ami id.

Attaching my workaround function and terraform resources.

AMI Updater Function
import boto3
import logging
import os
from botocore.exceptions import ClientError

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Get dry run mode from environment variable
DRY_RUN = os.environ.get("DRY_RUN", "true").lower() == "true"

# SSM parameter name
SSM_PARAMETER_NAME = "/github-action-runners/latest_ami_id"


def get_latest_ami(ec2_client):
    """Get the latest AMI ID based on the filter criteria."""
    try:
        response = ec2_client.describe_images(
            Owners=["135269210855"],
            Filters=[
                {"Name": "name", "Values": ["runs-on-v2.2-ubuntu24-full-x64-*"]},
                {"Name": "state", "Values": ["available"]},
            ],
        )

        if not response["Images"]:
            raise Exception("No matching AMIs found")

        # Sort by creation date to get the latest
        sorted_images = sorted(
            response["Images"], key=lambda x: x["CreationDate"], reverse=True
        )

        return sorted_images[0]["ImageId"]

    except ClientError as e:
        logger.error(f"Error getting latest AMI: {str(e)}")
        raise


def get_current_ami_id(ssm_client):
    """Get the current AMI ID from SSM Parameter Store."""
    try:
        response = ssm_client.get_parameter(Name=SSM_PARAMETER_NAME)
        return response["Parameter"]["Value"]
    except ClientError as e:
        if e.response["Error"]["Code"] == "ParameterNotFound":
            logger.info(f"Parameter {SSM_PARAMETER_NAME} not found")
            return None
        logger.error(f"Error getting current AMI ID from SSM: {str(e)}")
        raise


def update_ami_parameter(ssm_client, ami_id):
    """Update the SSM parameter with the new AMI ID."""
    try:
        current_ami = get_current_ami_id(ssm_client)

        if current_ami == ami_id:
            logger.info(f"SSM parameter already contains latest AMI ID: {ami_id}")
            return True, "Already using latest AMI"

        if DRY_RUN:
            logger.info(
                f"[DRY RUN] Would update SSM parameter from AMI {current_ami} to {ami_id}"
            )
            return True, "Would update AMI (Dry Run)"

        ssm_client.put_parameter(
            Name=SSM_PARAMETER_NAME, Value=ami_id, Type="String", Overwrite=True
        )

        logger.info(
            f"Successfully updated SSM parameter from AMI {current_ami} to {ami_id}"
        )
        return True, "Updated successfully"

    except ClientError as e:
        error_msg = str(e)
        logger.error(f"Error updating SSM parameter: {error_msg}")
        return False, f"Error: {error_msg}"


def lambda_handler(event, context):
    """Main Lambda handler."""
    ec2_client = boto3.client("ec2")
    ssm_client = boto3.client("ssm")

    logger.info(f"Running in {'DRY RUN' if DRY_RUN else 'LIVE'} mode")

    try:
        # Get the latest AMI ID
        latest_ami_id = get_latest_ami(ec2_client)
        logger.info(f"Found latest AMI: {latest_ami_id}")

        # Update SSM parameter
        success, message = update_ami_parameter(ssm_client, latest_ami_id)

        return {
            "statusCode": 200 if success else 500,
            "body": {
                "dry_run": DRY_RUN,
                "overall_success": success,
                "latest_ami": latest_ami_id,
                "message": f"{'[DRY RUN] ' if DRY_RUN else ''}{message}",
            },
        }

    except Exception as e:
        logger.error(f"Error in lambda execution: {str(e)}")
        return {
            "statusCode": 500,
            "body": {"dry_run": DRY_RUN, "error": str(e), "overall_success": False},
        }
Terraform Resources
data "aws_iam_policy_document" "ami_updater_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

data "aws_iam_policy_document" "ami_updater_permissions" {
  statement {
    actions = [
      "ec2:DescribeImages"
    ]
    resources = ["*"]
  }

  statement {
    actions = [
      "ssm:PutParameter",
      "ssm:GetParameter"
    ]
    resources = ["arn:aws:ssm:*:*:parameter/github-action-runners/latest_ami_id"]
  }

  statement {
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:*:*:*"]
  }
}

resource "aws_iam_role" "ami_updater" {
  name               = "${var.runner_prefix}-ami-updater"
  assume_role_policy = data.aws_iam_policy_document.ami_updater_assume_role.json
}

resource "aws_iam_role_policy" "ami_updater" {
  name   = "${var.runner_prefix}-ami-updater-policy"
  role   = aws_iam_role.ami_updater.id
  policy = data.aws_iam_policy_document.ami_updater_permissions.json
}

data "archive_file" "ami_updater" {
  type        = "zip"
  source_file = "${path.module}/ami_updater.py"
  output_path = "${path.module}/ami_updater.zip"
}

resource "aws_lambda_function" "ami_updater" {
  filename         = data.archive_file.ami_updater.output_path
  source_code_hash = data.archive_file.ami_updater.output_base64sha256
  function_name    = "${var.runner_prefix}-ami-updater"
  role             = aws_iam_role.ami_updater.arn
  handler          = "ami_updater.lambda_handler"
  runtime          = "python3.9"
  timeout          = 30

  environment {
    variables = {
      LOG_LEVEL = "INFO"
      DRY_RUN   = "false" # Set to "false" to enable actual updates
    }
  }

  tags = merge(local.github_runner_additional_tags, var.tags)
}

resource "aws_cloudwatch_event_rule" "ami_updater" {
  name                = "${var.runner_prefix}-ami-updater"
  description         = "Trigger AMI updater Lambda function daily"
  schedule_expression = "rate(1 day)"
}

resource "aws_cloudwatch_event_target" "ami_updater" {
  rule      = aws_cloudwatch_event_rule.ami_updater.name
  target_id = "TriggerAMIUpdaterLambda"
  arn       = aws_lambda_function.ami_updater.arn
}

resource "aws_lambda_permission" "ami_updater" {
  statement_id  = "AllowEventBridgeInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.ami_updater.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.ami_updater.arn
}

resource "aws_ssm_parameter" "latest_ami_id" {
  name        = "/github-action-runners/latest_ami_id"
  description = "Latest AMI ID for GitHub runners"
  type        = "String"
  value       = "placeholder" # Will be updated by Lambda function

  tags = merge(local.github_runner_additional_tags, var.tags)
}

I also have created a draft PR in #4488 and would love to contribute. I tried making the code look as close as possible to other functions but will probably need some guidance in testing. (I am still using an old version of the module so didn't want to apply and change in my running infrastructure.)

Looking forward for your input!

@npalm
Copy link
Member

npalm commented Mar 27, 2025

Here my thought for the direction we should move witht the module

  1. Let the launch template use a ssm paramater to look up the AMI, and remove the AMI complete from the launch template. This will solve the drift here.
  2. Introduce an object to let the user set a self managed paramater, or provide an AMI filter.
    • When a ssm parameter ref is provided. THe module will not manage any life-cycle at all. It is just to the user to create and update the value
    • Remain the filter option to allow an easy path, and keep the module backwards compatiblle. To gether with the filter the user should be able to specify to let the module manage the parameter. The filter will result in the module creates a paramater. When not managed the lifecyle hook of the parameter should set to ignore_changes. If managed terraform will evoluate the and keep the value up-to-date. On top of this still the lambda can be added to keep the paramater up to date. Should be a seperate option. Since it can break the runner as well without control or testing.
    • This new config object will be a breaking change. Since we should remove the old variables. We could keep them. But I also what to keep this complex module managable. In gneral I am again breaking changes. We could try to keep them for a short time and fill the object with locals based on the old variables for AMI id and filter.

Wondering what you think about this direction?

@npalm
Copy link
Member

npalm commented Mar 27, 2025

Some background on my use case:

  • I build custom AMI's weekly using packer and they automatically update some SSM parameters marking them as "dev" runners
  • I have GitHub actions running periodic tests on these "dev" runners as well as some dev only workflows in other repos
  • If at the end of the week (prior to the next scheduled AMI build), all these tests are successful, then I have another action that will promote the "dev" SSM parameter value to the "prd" parameter which is what the bulk of my repos will use for their runners.
  • I don't use the ami_id at all, only ssm, but maybe we could just have some kind of toggle to not supply any ami_id to the launch template to prevent unnecessary noise in TF plans.

We do more or less the same. We have automated piplelines that runs on our test environment piples after updating our AMI's once all good. THe prod got updated with the the new AMI id via SSM.

@npalm
Copy link
Member

npalm commented Mar 31, 2025

Created a PR to let the launch template use the AMI ID via SSM, See #4517

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants