AWS Clean-Up Automation

With the help of Amazon’s huge infrastructure, Enterprises today can make use of numerous AWS tools, Auto Scaling, and Elastic Load Balancing mechanisms to scale their applications up or down based on demand.  In general, while using AWS, organizations focus on automated bootstrapping of the VMs and deploying application on it in automated manner. However, when the purpose is met, very few bother to clean-up the resources when EC2 instance or AWS resource is terminated/deleted. This becomes more critical especially while working with Auto Scaling groups. Enterprises might require to perform different clean-up activities when EC2 instance is terminated. This may include clean-up of Chef nodes, clean-up for agent registrations (e.g Zabbix agents), security keys (e.g OSSEC), instance snapshots, s3 buckets data deletion and many such dependent clean-up activities. Following section describes few such approaches to perform clean-up with pros and cons of each approach. There is no “fit for all” solution here but it actually depends on your own environment and requirements.

The following approaches are described on the basis of “Reliability”, “Complexity” and “Urgency” of the clean-up activities. Based on your needs you should pick right approach.

Sr.
Clean-up flow
Pros
Cons
Comments
1.

Cloudwatch Event -> Lambda

  1. CloudWatch triggers lambda function on instance termination
  2. Lambda performs the clean-up activity e.g Chef node cleanup, S3 data deletion etc
  • Simple flow
  • Easy to setup
  1. If clean-up activity requires certain access through keys, we need to maintain/secure keys with lambda. We can use KMS to encrypt/decrypt such keys.
  2. Un-reliable – If lambda function fails, we lose the event
2.

Cloudwatch -> SQS -> Cron jobs

  1. CloudWatch posts a message in AWS SQS
  2. There will be cron running  on any of the server which periodically fetches messages from SQS and performs clean-up activities
  • Reliable – Delete message  from queue only when clean-up is performed
  1. Need to have cron running on desired server.
  2. Clean-up is not immediate – it will be done every x minutes

This is simple and reliable solution if we don’t need immediate clean-up.Example: For chef node clean-up, cron can run on chef server and delete nodes using knife. We don’t need to separately maintain chef private keys in lambda as described in #1.

3.

Cloudwatch -> RunCommand

  1. Cloudwatch triggers Lambda function
  2. Lambda calls AWS RunCommand
  3. RunCommand triggers remote command on desired server to perform clean-up
  • Simple flow
  1. With our experience EC2 Run command hasn’t been super reliable in the past. We have seen commands not being completed for long time and getting timed out. However over the time, its reliability has improved. However, you must do fair amount of testing.

AWS has recently added support for cloudwatch integration with EC2 RunCommand. This approach is similar to #1 but responsibility of clean-up is with target EC2 instance rather than lambda.

http://docs.aws.amazon.com/systems-manager/latest/userguide/rc-cwe.html

4.

Cloudwatch -> SQS -> Lambda -> EC2 RunCommand

  1. Cloudwatch posts message to AWS SQS
  2. Lambda fetches message from SQS and executes EC2 RunCommand
  3. RunCommand execute remote command on desired EC2 machine to perform clean-up
  4. Lambda deletes SQS message
  • Reliable as we are using SQS which can retain messages until required action is performed
  • Immediate clean-up
  1. Lot of moving parts (SQS, Lambda, EC2 Run command)
  2. Complex

This addresses all concerns but involves more AWS services.

You can also eliminate EC2 run command and have lambda perform the clean-up but it shifts the responsibility of securing required keys etc to lambda similar to #1.

In order to mitigate reliability aspect of the given solutions you could additionally have a batch script which performs regular clean-up activities irrespective of real time events from CloudWatch. For example, for chef node clean-up, you could also have a batch job which runs nightly, lists all registered chef nodes, queries AWS for existence of given nodes, and if node is not alive, it simply deletes that node. This makes sure that even if you fail to process instance termination event, such script will perform the required clean-up at a later time.