An important rule when building a highly available and highly reliable system is to design for failure. Almoayyed Computers uses AWS EC2 auto recovery feature to build the great systems to recover from failure. We arrange for automatic recovery of an EC2 instance when a system status check of the underlying hardware fails. The instance will be rebooted (on new hardware if necessary) but will retain its Instance Id, IP Address, Elastic IP Addresses, EBS Volume attachments, and other configuration details. In order for the recovery to complete, we also make sure that the instance automatically starts up any services or applications as part of its initialization process.
Examples of problems that cause system status checks to fail include:
This walkthrough involves the following AWS services:
EC2 server before configuring auto-recovery
ACME has built a CloudFormation template which creates CloudWatch Alarm for instance auto-recovery
Below is the configuration of alarm which can even be done using AWS console.
After successful creation of CloudFormation stack, alarm has been created for EC2 server
On the detection of state change alarm will trigger AWS SNS topic which in turn send notifications to PagerDuty tool which is SaaS incident response platform.
As soon as notification is received to support engineer, he will take necessary actions in case manual intervention is required.
To test whether instance recovery works, execute articial harware failure command:
Note: Instance should be able to :\Users\Username> aws cloudwatch set-alarm-state \ --alarm-name "NAME-OF-YOUR-ALARM" \ --state-value ALARM \ --state-reason "Simulate an EC2 HW System Failure" --region me-south-1
Notification is sent out to respective engineer using SNS over mail and PagerDuty.
Following screenshot represents after execution of command instance state has changed which changed Alarm state from OK to “in Alarm”, thereafter auto recovery of instance has been completed and again Alarm state is back to OK from “in Alarm”