Skip to content

Commit

Permalink
Enabled instance_refresh for the ASG
Browse files Browse the repository at this point in the history
Added additional script to the user data scripts to handle waiting for an available volume
Enabled always running the user data scripts (reboot, stop/start)
Fixed type-o
Changed lb_health_check_path to /protocol
fixed issue with creating new disks when disk was already attached.
Updated README.md
Added checks for cluster quorum.
Moved the logic for disk mounting so there are no redundant operations.
Moved instance_refresh_checkpoint_delay and enable_userdata_scripts_on_reboot variables to top level.
Renamed 08_node_rejoin.sh.tpl to 08_node_join.sh.tpl and removed unnecessary logic.
Refactored 01_disk_management.sh.tpl
Updated 02_dns_provisioning.sh.tpl node naming
Added tagging for EC2 instances
AWS module upgrade to 5.49.0
  • Loading branch information
viktor-ribchev committed May 15, 2024
1 parent 7d1a7d7 commit 226dc4e
Show file tree
Hide file tree
Showing 16 changed files with 404 additions and 248 deletions.
94 changes: 44 additions & 50 deletions .terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

44 changes: 35 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ for more details.
- [Inputs](#inputs)
- [Usage](#usage)
- [Examples](#examples)
- [Updating configurations on an active deployment](#updating-configurations-on-an-active-deployment)
- [Updating configurations and GraphDB version on an active deployment](#updating-configurations-and-graphdb-version-on-an-active-deployment)
- [Local Development](#local-development)
- [Release History](#release-history)
- [Contributing](#contributing)
Expand Down Expand Up @@ -318,19 +318,45 @@ s3_enable_replication_rule = true

## Updating configurations on an active deployment

In case your license has expired, and you need to renew it, or you need to make some changes to the `graphdb.properties`
file, or other GraphDB related configurations, you will need to apply the changes via `terraform apply` and then either:
### Updating Configurations

- Terminate the instances one by one, starting with the follower nodes, and leaving the leader node to be the last
instance to be terminated
- Scale down to 0 and back to number of nodes you originally had.
When faced with scenarios such as an expired license, or the need to modify the graphdb.properties file or other
GraphDB-related configurations, you can apply changes via `terraform apply` and then you can either:

- Manually terminate instances one by one, beginning with the follower nodes and concluding with the leader node
as the last instance to be terminated.
- Scale in the number of instances in the scale set to zero and then scale back up to the original number of nodes.
- Set the graphdb_enable_userdata_scripts_on_reboot variable to true. This ensures that user data scripts are executed
on each reboot, allowing you to update the configuration of each node.
The reboot option would essentially achieve the same outcome as the termination and replacement approach, but it is still experimental.

```text
Please be aware that the latter option will result in some downtime.
Please note that the scale in and up option will result in greater downtime than the other options, where the downtime should be less.
```

Both actions would trigger the user data script to be run again and update all files and properties overrides with the
updated values.
Both actions will trigger the user data script to run again, updating files and properties overrides with the new values.
Please note that changing the `graphdb_admin_password` via `terraform apply` will not update the password in GraphDB.
Support for this will be introduced in the future.

### Upgrading GraphDB Version

To automatically update the GraphDB version with `terraform apply`, you need to set `enable_instance_refresh` to `true`
in your `tfvars` file. This configuration will refresh your already running instances with new ones,
replacing them one at a time.

Please note that by default, the instance refresh process will wait for one hour before moving on to update the next instance.
This is a precautionary measure as GraphDB may need time to sync with the other nodes.
You can control this delay by updating the value of `instance_refresh_checkpoint_delay`.

It's important to note that if you have made changes to any GraphDB configurations,
they will be applied during the instance refresh process with the exception for the `graphdb_admin_password`.
Support for this will be introduced in the future.

**Important:** Having `enable_instance_refresh` enabled when scaling up the GraphDB cluster may lead to data
replication issues, as existing instances will still undergo the refresh process.
Depending on the data size, the new nodes might fail in joining the cluster due to the instance refresh.

**You must set `enable_instance_refresh` to `false` when scaling up the cluster!**

## Local Development

Expand Down
5 changes: 5 additions & 0 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -209,4 +209,9 @@ module "graphdb" {

logging_enable_replication = var.logging_enable_bucket_replication
backup_enable_replication = var.backup_enable_bucket_replication

# ASG instance deployment options
asg_enable_instance_refresh = var.asg_enable_instance_refresh
asg_instance_refresh_checkpoint_delay = var.asg_instance_refresh_checkpoint_delay
graphdb_enable_userdata_scripts_on_reboot = var.graphdb_enable_userdata_scripts_on_reboot
}
32 changes: 28 additions & 4 deletions modules/graphdb/iam.tf
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ resource "aws_iam_role_policy_attachment" "graphdb_systems_manager_policy" {
}

resource "aws_iam_role_policy" "graphdb_instance_ssm_iam_role_policy" {
name = var.resource_name_prefix
name = "${var.resource_name_prefix}-describe-ssm_params"
role = aws_iam_role.graphdb_iam_role.id
policy = data.aws_iam_policy_document.graphdb_instance_ssm.json
}
Expand All @@ -68,7 +68,31 @@ data "aws_iam_policy_document" "graphdb_instance_ssm" {
"ssm:DescribeParameters"
]

resources = ["arn:aws:ssm:${var.aws_region}:${var.aws_subscription_id}:*"]
resources = [
"arn:aws:ssm:${var.aws_region}:${var.aws_subscription_id}:*"
]
}
}

resource "aws_iam_role_policy" "graphdb_describe_resources_iam_role_policy" {
name = "${var.resource_name_prefix}-describe-resources"
role = aws_iam_role.graphdb_iam_role.id
policy = data.aws_iam_policy_document.graphdb_describe_resources.json
}

data "aws_iam_policy_document" "graphdb_describe_resources" {
statement {
effect = "Allow"

actions = [
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"autoscaling:DescribeInstanceRefreshes"
]

resources = [
"*"
]
}
}

Expand All @@ -80,8 +104,8 @@ data "aws_iam_policy_document" "graphdb_instance_volume" {
"ec2:CreateVolume",
"ec2:AttachVolume",
"ec2:DescribeVolumes",
"ec2:DescribeInstances",
"ec2:MonitorInstances"
"ec2:MonitorInstances",
"ec2:CreateTags"
]

resources = [
Expand Down
24 changes: 21 additions & 3 deletions modules/graphdb/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ resource "aws_launch_template" "graphdb" {
aws_security_group.graphdb_security_group.id
]

ebs_optimized = "true"
ebs_optimized = true

iam_instance_profile {
name = aws_iam_instance_profile.graphdb_iam_instance_profile.id
Expand All @@ -73,10 +73,10 @@ resource "aws_launch_template" "graphdb" {
http_tokens = "required"
}

update_default_version = "true"
update_default_version = true
}

resource "aws_autoscaling_group" "graphdb_auto_scalling_group" {
resource "aws_autoscaling_group" "graphdb_auto_scaling_group" {
name = var.resource_name_prefix
min_size = var.graphdb_node_count
max_size = var.graphdb_node_count
Expand All @@ -90,6 +90,24 @@ resource "aws_autoscaling_group" "graphdb_auto_scalling_group" {
version = aws_launch_template.graphdb.latest_version
}

dynamic "instance_refresh" {
for_each = var.asg_enable_instance_refresh ? [1] : []
content {
strategy = "Rolling"

preferences {
min_healthy_percentage = var.asg_instance_refresh_min_healthy_percentage
instance_warmup = var.asg_instance_refresh_instance_warmup
skip_matching = var.asg_instance_refresh_skip_matching
checkpoint_delay = var.asg_instance_refresh_checkpoint_delay
checkpoint_percentages = [
for i in range(var.graphdb_node_count) :
floor((i + 1) * 100 / var.graphdb_node_count)
]
}
}
}

dynamic "tag" {
for_each = data.aws_default_tags.current.tags
content {
Expand Down
2 changes: 1 addition & 1 deletion modules/graphdb/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ output "s3_iam_role_name" {

output "asg_name" {
description = "Name of autoscaling group"
value = aws_autoscaling_group.graphdb_auto_scalling_group.name
value = aws_autoscaling_group.graphdb_auto_scaling_group.name
}

output "launch_template_id" {
Expand Down
59 changes: 59 additions & 0 deletions modules/graphdb/templates/00_wait_node_count.sh.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env bash

# This script performs the following actions:
# * Waits for ASG EC2 count to match the expected node count

set -o errexit
set -o nounset
set -o pipefail

# This handles instance refreshing where new and old nodes are both present.
# Waiting until the ASG nodes are equal to the expected node count and proceeding with the provisioning afterwards.
IMDS_TOKEN=$(curl -Ss -H "X-aws-ec2-metadata-token-ttl-seconds: 6000" -XPUT 169.254.169.254/latest/api/token)
AZ=$(curl -Ss -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" 169.254.169.254/latest/meta-data/placement/availability-zone)
ASG_NAME=${name}

instance_refresh_status=$(aws autoscaling describe-instance-refreshes --auto-scaling-group-name "$ASG_NAME" --query 'InstanceRefreshes[?Status==`InProgress`]' --output json)

if [ "$instance_refresh_status" != "[]" ]; then
echo "An instance refresh is currently in progress for the ASG: $ASG_NAME"
echo "$instance_refresh_status" | jq '.'

IMDS_TOKEN=$(curl -Ss -H "X-aws-ec2-metadata-token-ttl-seconds: 6000" -XPUT 169.254.169.254/latest/api/token)
INSTANCE_ID=$(curl -Ss -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" 169.254.169.254/latest/meta-data/instance-id)

echo "Waiting for default EC2 status check to pass for instance $INSTANCE_ID..."

# Loop until the default status check passes
while true; do
# Get the status of the default status checks for the instance
instance_status=$(aws ec2 describe-instance-status --instance-ids $INSTANCE_ID --query 'InstanceStatuses[0].InstanceStatus.Status' --output text)
system_status=$(aws ec2 describe-instance-status --instance-ids $INSTANCE_ID --query 'InstanceStatuses[0].SystemStatus.Status' --output text)

# Check if the status is "ok"
if [[ "$instance_status" == "ok" && $system_status == "ok" ]]; then
echo "Default EC2 status check passed for instance $INSTANCE_ID."
break
fi

# Sleep for a while before checking again
sleep 5
done

echo "Waiting for an available volume in $AZ"
# TODO This will hang forever when scaling out.
while true; do
# Get the list of volumes in the current availability zone
available_volumes=$(aws ec2 describe-volumes --filters "Name=availability-zone,Values=$AZ Name=status,Values=available Name=volume-type,Values=gp3" --query "Volumes[*].VolumeId" --output text)

# Check if any volumes are available
if [ -n "$available_volumes" ]; then
echo "Found an available volume in $AZ."
break
fi

sleep 5
done
else
echo "No instance refresh is currently in progress for the ASG: $ASG_NAME"
fi
Loading

0 comments on commit 226dc4e

Please sign in to comment.