Enabled instance_refresh for the ASG

Added additional script to the user data scripts to handle waiting for an available volume Enabled always running the user data scripts (reboot, stop/start) Fixed type-o Changed lb_health_check_path to /protocol fixed issue with creating new disks when disk was already attached. Updated README.md Added checks for cluster quorum. Moved the logic for disk mounting so there are no redundant operations. Moved instance_refresh_checkpoint_delay and enable_userdata_scripts_on_reboot variables to top level. Renamed 08_node_rejoin.sh.tpl to 08_node_join.sh.tpl and removed unnecessary logic. Refactored 01_disk_management.sh.tpl Updated 02_dns_provisioning.sh.tpl node naming Added tagging for EC2 instances AWS module upgrade to 5.49.0
Ontotext-AD · May 15, 2024 · 226dc4e · 226dc4e
1 parent 7d1a7d7
commit 226dc4e
Show file tree

Hide file tree

Showing 16 changed files with 404 additions and 248 deletions.
diff --git a/.terraform.lock.hcl b/.terraform.lock.hcl
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ for more details.
 - [Inputs](#inputs)
 - [Usage](#usage)
 - [Examples](#examples)
-- [Updating configurations on an active deployment](#updating-configurations-on-an-active-deployment)
+- [Updating configurations and GraphDB version on an active deployment](#updating-configurations-and-graphdb-version-on-an-active-deployment)
 - [Local Development](#local-development)
 - [Release History](#release-history)
 - [Contributing](#contributing)
@@ -318,19 +318,45 @@ s3_enable_replication_rule = true
 
 ## Updating configurations on an active deployment
 
-In case your license has expired, and you need to renew it, or you need to make some changes to the `graphdb.properties`
-file, or other GraphDB related configurations, you will need to apply the changes via `terraform apply` and then either:
+### Updating Configurations
 
-- Terminate the instances one by one, starting with the follower nodes, and leaving the leader node to be the last
-  instance to be terminated
-- Scale down to 0 and back to number of nodes you originally had.
+When faced with scenarios such as an expired license, or the need to modify the graphdb.properties file or other
+GraphDB-related configurations, you can apply changes via `terraform apply` and then you can either:
+
+- Manually terminate instances one by one, beginning with the follower nodes and concluding with the leader node
+as the last instance to be terminated.
+- Scale in the number of instances in the scale set to zero and then scale back up to the original number of nodes.
+- Set the graphdb_enable_userdata_scripts_on_reboot variable to true. This ensures that user data scripts are executed
+on each reboot, allowing you to update the configuration of each node.
+The reboot option would essentially achieve the same outcome as the termination and replacement approach, but it is still experimental.
 
 ```text
-Please be aware that the latter option will result in some downtime.
+Please note that the scale in and up option will result in greater downtime than the other options, where the downtime should be less.
 ```
 
-Both actions would trigger the user data script to be run again and update all files and properties overrides with the
-updated values.
+Both actions will trigger the user data script to run again, updating files and properties overrides with the new values.
+Please note that changing the `graphdb_admin_password` via `terraform apply` will not update the password in GraphDB.
+Support for this will be introduced in the future.
+
+### Upgrading GraphDB Version
+
+To automatically update the GraphDB version with `terraform apply`, you need to set `enable_instance_refresh` to `true`
+in your `tfvars` file. This configuration will refresh your already running instances with new ones,
+replacing them one at a time.
+
+Please note that by default, the instance refresh process will wait for one hour before moving on to update the next instance.
+This is a precautionary measure as GraphDB may need time to sync with the other nodes.
+You can control this delay by updating the value of `instance_refresh_checkpoint_delay`.
+
+It's important to note that if you have made changes to any GraphDB configurations,
+they will be applied during the instance refresh process with the exception for the `graphdb_admin_password`.
+Support for this will be introduced in the future.
+
+**Important:** Having `enable_instance_refresh` enabled when scaling up the GraphDB cluster may lead to data
+replication issues, as existing instances will still undergo the refresh process.
+Depending on the data size, the new nodes might fail in joining the cluster due to the instance refresh.
+
+**You must set `enable_instance_refresh` to `false` when scaling up the cluster!**
 
 ## Local Development
 

diff --git a/main.tf b/main.tf
@@ -209,4 +209,9 @@ module "graphdb" {
 
   logging_enable_replication = var.logging_enable_bucket_replication
   backup_enable_replication  = var.backup_enable_bucket_replication
+
+  # ASG instance deployment options
+  asg_enable_instance_refresh               = var.asg_enable_instance_refresh
+  asg_instance_refresh_checkpoint_delay     = var.asg_instance_refresh_checkpoint_delay
+  graphdb_enable_userdata_scripts_on_reboot = var.graphdb_enable_userdata_scripts_on_reboot
 }
diff --git a/modules/graphdb/iam.tf b/modules/graphdb/iam.tf
@@ -55,7 +55,7 @@ resource "aws_iam_role_policy_attachment" "graphdb_systems_manager_policy" {
 }
 
 resource "aws_iam_role_policy" "graphdb_instance_ssm_iam_role_policy" {
-  name   = var.resource_name_prefix
+  name   = "${var.resource_name_prefix}-describe-ssm_params"
   role   = aws_iam_role.graphdb_iam_role.id
   policy = data.aws_iam_policy_document.graphdb_instance_ssm.json
 }
@@ -68,7 +68,31 @@ data "aws_iam_policy_document" "graphdb_instance_ssm" {
       "ssm:DescribeParameters"
     ]
 
-    resources = ["arn:aws:ssm:${var.aws_region}:${var.aws_subscription_id}:*"]
+    resources = [
+      "arn:aws:ssm:${var.aws_region}:${var.aws_subscription_id}:*"
+    ]
+  }
+}
+
+resource "aws_iam_role_policy" "graphdb_describe_resources_iam_role_policy" {
+  name   = "${var.resource_name_prefix}-describe-resources"
+  role   = aws_iam_role.graphdb_iam_role.id
+  policy = data.aws_iam_policy_document.graphdb_describe_resources.json
+}
+
+data "aws_iam_policy_document" "graphdb_describe_resources" {
+  statement {
+    effect = "Allow"
+
+    actions = [
+      "ec2:DescribeInstanceStatus",
+      "ec2:DescribeInstances",
+      "autoscaling:DescribeInstanceRefreshes"
+    ]
+
+    resources = [
+      "*"
+    ]
   }
 }
 
@@ -80,8 +104,8 @@ data "aws_iam_policy_document" "graphdb_instance_volume" {
       "ec2:CreateVolume",
       "ec2:AttachVolume",
       "ec2:DescribeVolumes",
-      "ec2:DescribeInstances",
-      "ec2:MonitorInstances"
+      "ec2:MonitorInstances",
+      "ec2:CreateTags"
     ]
 
     resources = [

diff --git a/modules/graphdb/main.tf b/modules/graphdb/main.tf
@@ -62,7 +62,7 @@ resource "aws_launch_template" "graphdb" {
     aws_security_group.graphdb_security_group.id
   ]
 
-  ebs_optimized = "true"
+  ebs_optimized = true
 
   iam_instance_profile {
     name = aws_iam_instance_profile.graphdb_iam_instance_profile.id
@@ -73,10 +73,10 @@ resource "aws_launch_template" "graphdb" {
     http_tokens   = "required"
   }
 
-  update_default_version = "true"
+  update_default_version = true
 }
 
-resource "aws_autoscaling_group" "graphdb_auto_scalling_group" {
+resource "aws_autoscaling_group" "graphdb_auto_scaling_group" {
   name                = var.resource_name_prefix
   min_size            = var.graphdb_node_count
   max_size            = var.graphdb_node_count
@@ -90,6 +90,24 @@ resource "aws_autoscaling_group" "graphdb_auto_scalling_group" {
     version = aws_launch_template.graphdb.latest_version
   }
 
+  dynamic "instance_refresh" {
+    for_each = var.asg_enable_instance_refresh ? [1] : []
+    content {
+      strategy = "Rolling"
+
+      preferences {
+        min_healthy_percentage = var.asg_instance_refresh_min_healthy_percentage
+        instance_warmup        = var.asg_instance_refresh_instance_warmup
+        skip_matching          = var.asg_instance_refresh_skip_matching
+        checkpoint_delay       = var.asg_instance_refresh_checkpoint_delay
+        checkpoint_percentages = [
+          for i in range(var.graphdb_node_count) :
+          floor((i + 1) * 100 / var.graphdb_node_count)
+        ]
+      }
+    }
+  }
+
   dynamic "tag" {
     for_each = data.aws_default_tags.current.tags
     content {

diff --git a/modules/graphdb/outputs.tf b/modules/graphdb/outputs.tf
@@ -24,7 +24,7 @@ output "s3_iam_role_name" {
 
 output "asg_name" {
   description = "Name of autoscaling group"
-  value       = aws_autoscaling_group.graphdb_auto_scalling_group.name
+  value       = aws_autoscaling_group.graphdb_auto_scaling_group.name
 }
 
 output "launch_template_id" {

diff --git a/modules/graphdb/templates/00_wait_node_count.sh.tpl b/modules/graphdb/templates/00_wait_node_count.sh.tpl
@@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+
+# This script performs the following actions:
+# * Waits for ASG EC2 count to match the expected node count
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+# This handles instance refreshing where new and old nodes are both present.
+# Waiting until the ASG nodes are equal to the expected node count and proceeding with the provisioning afterwards.
+IMDS_TOKEN=$(curl -Ss -H "X-aws-ec2-metadata-token-ttl-seconds: 6000" -XPUT 169.254.169.254/latest/api/token)
+AZ=$(curl -Ss -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" 169.254.169.254/latest/meta-data/placement/availability-zone)
+ASG_NAME=${name}
+
+instance_refresh_status=$(aws autoscaling describe-instance-refreshes --auto-scaling-group-name "$ASG_NAME" --query 'InstanceRefreshes[?Status==`InProgress`]' --output json)
+
+if [ "$instance_refresh_status" != "[]" ]; then
+  echo "An instance refresh is currently in progress for the ASG: $ASG_NAME"
+  echo "$instance_refresh_status" | jq '.'
+
+  IMDS_TOKEN=$(curl -Ss -H "X-aws-ec2-metadata-token-ttl-seconds: 6000" -XPUT 169.254.169.254/latest/api/token)
+  INSTANCE_ID=$(curl -Ss -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" 169.254.169.254/latest/meta-data/instance-id)
+
+  echo "Waiting for default EC2 status check to pass for instance $INSTANCE_ID..."
+
+  # Loop until the default status check passes
+  while true; do
+    # Get the status of the default status checks for the instance
+    instance_status=$(aws ec2 describe-instance-status --instance-ids $INSTANCE_ID --query 'InstanceStatuses[0].InstanceStatus.Status' --output text)
+    system_status=$(aws ec2 describe-instance-status --instance-ids $INSTANCE_ID --query 'InstanceStatuses[0].SystemStatus.Status' --output text)
+
+    # Check if the status is "ok"
+    if [[ "$instance_status" == "ok" && $system_status == "ok" ]]; then
+      echo "Default EC2 status check passed for instance $INSTANCE_ID."
+      break
+    fi
+
+    # Sleep for a while before checking again
+    sleep 5
+  done
+
+  echo "Waiting for an available volume in $AZ"
+  # TODO This will hang forever when scaling out.
+  while true; do
+    # Get the list of volumes in the current availability zone
+    available_volumes=$(aws ec2 describe-volumes --filters "Name=availability-zone,Values=$AZ Name=status,Values=available Name=volume-type,Values=gp3" --query "Volumes[*].VolumeId" --output text)
+
+    # Check if any volumes are available
+    if [ -n "$available_volumes" ]; then
+      echo "Found an available volume in $AZ."
+      break
+    fi
+
+    sleep 5
+  done
+else
+  echo "No instance refresh is currently in progress for the ASG: $ASG_NAME"
+fi