Using CloudFormation to enable Auto recovery for instances

An important rule when building a highly available and highly reliable system is to design for failure. Almoayyed Computers uses AWS EC2 auto recovery feature to build the great systems to recover from failure. We arrange for automatic recovery of an EC2 instance when a system status check of the underlying hardware fails. The instance will be rebooted (on new hardware if necessary) but will retain its Instance Id, IP Address, Elastic IP Addresses, EBS Volume attachments, and other configuration details. In order for the recovery to complete, we also make sure that the instance automatically starts up any services or applications as part of its initialization process.

Examples of problems that cause system status checks to fail include:

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host that impact network reachability

This walkthrough involves the following AWS services:

  1. EC2 — Elastic compute cloud
  2. CloudWatch — Cloud & Network Monitoring Services
  3. SNS — Simple Notification Service
  4. CloudFormation -Infra as a code
  5. PagerDuty tool

EC2 server before configuring auto-recovery


ACME has built a CloudFormation template which creates CloudWatch Alarm for instance auto-recovery

Below is the configuration of alarm which can even be done using AWS console.


After successful creation of CloudFormation stack, alarm has been created for EC2 server



Notifications

On the detection of state change alarm will trigger AWS SNS topic which in turn send notifications to PagerDuty tool which is SaaS incident response platform.


As soon as notification is received to support engineer, he will take necessary actions in case manual intervention is required.

To test whether instance recovery works, execute articial harware failure command:

Note: Instance should be able to :\Users\Username> aws cloudwatch set-alarm-state \ --alarm-name "NAME-OF-YOUR-ALARM" \ --state-value ALARM \ --state-reason "Simulate an EC2 HW System Failure" --region me-south-1

Notification is sent out to respective engineer using SNS over mail and PagerDuty.


Following screenshot represents after execution of command instance state has changed which changed Alarm state from OK to “in Alarm”, thereafter auto recovery of instance has been completed and again Alarm state is back to OK from “in Alarm”