Skip to content

Latest commit

 

History

History
21 lines (14 loc) · 2.69 KB

2022-07-28-aws.md

File metadata and controls

21 lines (14 loc) · 2.69 KB

CI Failures in AWS

  • 17:03 UTC - Our monitoring notices problems in AWS us-east-2
  • 17:09 UTC - We notice starved CI jobs because VM's can't be launched
  • 17:12 UTC - A few CI jobs fail that are hosted in us-east-2
  • 17:18 UTC - We decide to move all CI work over to us-west-2
  • 17:43 UTC - us-west-2 is provisioned and CI jobs are scheduled there
  • 19:30 UTC - Some issues were detected with the provisioning of new NFS servers. Issues were corrected manually and DevOps scripts were updated for future issues.

NOTE - Container builds should continue working like usual. OE builds will be slightly slower becaause they will lack the shared-state cache that lives in us-east-2 NFS servers.

AWS reporting from health.aws.com:

Instance Impairments

10:49 AM PDT We continue to see recovery of EC2 instances that were affected by the loss of power in a single Availability Zone in the US-EAST-2 Region. At this stage, the vast majority of affected EC2 instances and EBS volumes have returned to a healthy state and we continue to work on the remaining EC2 instances and EBS volumes. Elastic Load Balancing has shifted traffic away from the affected Availability Zone. Single-AZ RDS databases were also affected and will recover as the underlying EC2 instance recovers. Multi-AZ RDS databases would have mitigated impact by failing away from the affected Availability Zone. While the vast majority of Lambda functions continue operating normally, some functions are experiencing invocation failures and latencies, but we expect this to improve over the next 30 minutes. Power has been restored to all affected resources and remains stable. We expect the recovery of EC2 instances and EBS volumes to continue to improve over the next 45 minutes. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.

10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.

10:11 AM PDT We are investigating network connectivity issues for some instances and increased error rates and latencies for the EC2 APIs within the US-EAST-2 Region.