AWS Cost Optimization – Tools, Tips and Techniques

AWS Cost depends on various services in use. The cost of each of these services in turn depends on many factors. In this blog we will look at a few common workloads, the AWS services they typically use and how to contain cost for each one of them. AWS well architected framework has a Cost Optimization pillar. It gives high level guidance and points at several tools for this purpose. We will look at how some of these services are priced. Finally, we will go through some practical tips based on production experience for how to shave off a few dollars from the bill.

Cost Optimization Pillar

Cost optimization is a continual process of refinement and improvement over the span of a workload’s lifecycle. It is based on five design principles:

  • We need to build new capability of practicing Cloud Financial Management (CFM). It is based on
    1. Establishing functional ownership of cost optimization
    2. Building partnership between finance and technology
    3. Defining cloud budgets and forecasts. It must be dynamic to accommodate trends as well as business drivers.
    4. Implementing cost awareness in our organizational processes.
    5. Developing cost aware culture.
    6. Quantifying business value addition by cost optimization.
  • We need to understand organization’s costs and drivers. Its key factors are

    1. Governance: it involves developing and enforcing policies. It also involves setting cost and usage goals and targets for the organization. The accounts should be structured as master and members. Groups and roles should be used for users in organization. Notifications should be setup when cost or usage violate set policies.
    2. Monitor cost and usage: The information should be gathered in most granular form for effective analysis. Costs should be attributed through all phases including learning, staff development, idea creation. It should be given business context of the organization. Billing and cost optimization tools should be configured with reports, notifications, current state, trending, forecasts, tracking against goal or target, analysis.
    3. Tracking resources over entire lifetime and decommissioning them automatically when no longer needed.
  • We should make use of cost-effective resources. It includes:
    1. Evaluating cost when selecting services. It should
      1. Identify organization requirements
      2. Analyze all workload components
      3. Use Managed Services effectively
      4. Make use of Serverless or Application-level Services
      5. Analyze the workload for different usage over time
      6. Try to eliminate Licensing costs by making use of open source as much as possible
    2. Selecting the correct resource type, size, and number. This right sizing can be done by
      1. Performing cost modeling by benchmark activities for the workload under
        different predicted loads and comparing the costs. The monitoring must
        accurately reflect the end-user experience. We should select the correct
        granularity for the time period of analysis that is required to cover any workload
        cycles.
      2. Metrics or data-based selection Metrics or data-based selection depending on resource characteristics.
    3. Selecting the best pricing model by understanding and making use of AWS price models of:
      1. On demand: this is default pay as you go pricing model.
      2. Spot: it can offer up to 90% discount on regular on demand prices above.
      3. Commitment discounts: these can be as deep as 72%.
      4. Geographic selection and
      5. Third party agreements and pricing
    4. Considering data transfer cost in the optimization.
  • Manage demand and supplying resources: You can modify the demand using a throttle, buffer, or queue to smooth the demand and serve it with less resources. You can also leverage the elasticity of the cloud (typically auto scaling or elastic load balancing) to supply resources to meet changing demand. You can also have resources available at the time they are required as per the schedule of your workload.
  • Regularly review the workloads and implement new services and features as required.
AWS Tools

To this end AWS offers numerous tools and services like:

  • AWS Cost Explorer
  • AWS Trusted Advisor
  • Cost and Usage Report (CUR) in Athena and QuickSight
  • AWS Budgets which can trigger notifications
  • Service Quotas
  • AWS Cost Management
  • AWS Config or AWS Systems Manager which provide a detailed inventory of your AWS resources and configuration.
  • AWS Tag Editor that allows you to add, delete, and manage tags of multiple resources.
  • AWS Cost Categories allows you to assign organization meaning to your costs, without requiring tags on resources.
  • AWS Compute Optimizer can assist with cost modeling for running workloads.

Tips for Different workloads

Standard Workload

AWS Services used
  • EC2 (Elastic Compute 2)
  • EBS (Elastic Block Store)
  • EFS (Elastic File Store)
  • RDS (Relational Data Services)
Checks
  • We should monitor idle or underutilized resources and delete or right size them, respectively.
  • We should monitor usage of CPU, network, memory as well as disk to make any setting recommendations (e.g. less memory version of EC2 instance type, less provisioned IOPs on EBS volume etc.).
  • Based on the overall CPU, GPU, memory, network and disk usage we can recommend the most optimal EC2 instance type for the workload.
  • We can use tags like schedule or deleteafter. Schedule tag will help keeping a VM up only during our working time as per our time zone as specified the schedule. Deleteafter tag will help delete a VM after required specified time, say a month.
  • We can implement a timeout based on inactivity, say shutdown a VM if not active for the last
    hour.
  • EC2
    1. Use latest instance types e.g. M5 rather than M4. Often newer instance types cost
      lesser.
    2. Consider using EFS for data storage rather than EBS when higher latency is acceptable.
  • EFS
    1. It is charged by the storage size and duration. It has a provision of charging less for data
      that is accessed infrequently.
    2. Storage size defaults to some available IOPs. Provision additional IOPs if needed.
    3. Configure infrequent-access parameters to reduce cost of infrequently accessed data.
  • RDS
    1. Provision IOPs based on application requirements.

Containerized applications

Additional Services
  • ECS (Elastic Container Service)
  • EKS (Elastic Kubernetes Service)
Additional checks

Make Sure that:

  • Each container is a part of automated orchestration (e.g. ECS, EKS or custom)
  • ECS or EKS is auto scaled
  • Auto scaling settings of min, desired and max are optimized
  • If a deployment EC2 instance in ECS / EKS is idle, recommend launch type to be FARGATE rather
    than EC2
  • Stateful / dedicated instance to be run on reserved instance where we can make some long
    term commitment
  • Use spot instances if no real time response is required.

ML & Data science

Additional services
  • Sagemaker
  • Lex
Additional checks

Make Sure that:

  • Automatically launch notebook when required and terminate when no longer required
  • Monitoring idle endpoints and optionally terminating them

Big data

Additional services
  • Athena
  • Glue
  • Red shift, Redshift Spectrum
  • DynamoDB
  • QuickSight
  • Kinesis
  • Glue
  • EMR (Elastic MapReduce)
Additional checks

Make Sure that:

  • Athena
    1. It is charged by amount of scanned data
    2. Use limits on number of rows returned by queries.
    3. Compress data e.g. by using parquet format.
    4. Make use of partitions effectively as per the most common query patterns.
    5. Preprocess tables generating smaller intermediate tables which can be returned as
      results of queries.
  • Glue:It is charged by the time for which the crawling runs. Partition data optimally (say
    monthly) – not too little partitions (say yearly) nor too many partitions (say daily).
  • Redshift, Redshift Spectrum
    1. Redshift is charged by storage and compute.
    2. Make use of new RA3 nodes (as opposed to earlier DS ones) which allow scaling
      compute and storage independently.
    3. Redshift Spectrum is the query engine for Redshift. It can work with external storage, in
      say s3. Queries with acceptable higher latencies are good fit to move to Redshift
      Spectrum to save on storage cost as s3 is cheaper than Redshift for storage part.
  • Check provisioned IOPs of all DynamoDB tables are in line with their usage.
  • Kinesis
    1. It is charged by ingest throughput, consumer fan out throughput and storage size and
      duration.
    2. We can use number shards judiciously to control ingest throughput at our price –
      performance tradeoff mark.
    3. Similarly we can use enhanced fan out to control consumer throughput.

Serverless

Additional services
  • Lambda
  • S3 (Simple Storage Service)
  • API (Application Programming Interface) Gateway
  • CloudFront
  • Fargate
  • SQS (Simple Queue Service)
Additional checks
  • Lambda
    1. Its charge is based on memory – time (Gb – sec)
    2. It uses the memory specified for costing irrespective of actual memory used. Hence, we
      should specify as much memory as required by the execution. However, it can make use
      of parallelism within a call, wherever applicable, with more memory. So, use the
      memory setting judiciously which might even require some experimentation.
    3. For time it uses actual time taken and not the timeout specified. Hence timeout value
      can be a little liberal to avoid timeout errors.
    4. We can also make use of provisioned concurrency settings to specify initial warmed up
      instances. It is typically required by java implementations to guard against cold start.
  • S3
    1. Use storage tiering based on access pattern.
    2. Move infrequently accessed (once a month) data to S3IA. Move rarely accessed (once a
      year) data to glacier.
    3. S3 gets additionally charged by accesses. Hence aggregate data in S3 objects based on
      your data and access patterns.
  • API Gateway
    1. It is charged based on number of public API calls and the amount of data transferred
      through these APIs.
    2. We can optimize on the number of calls by aggregating some calls.
    3. We can evaluate use of private APIs wherever applicable.
  • Fargate
    1. Do not use fargate if your resource is going to run 24 x 7. It is typically useful in case of
      spikey, intermittent, varying workload.
  • SQS
    1. Do not poll too frequently as it adds to the cost.
Sameer-Mahajan
Author
Sameer Mahajan | Principal Architect

Sameer Mahajan has 25 years of experience in the software industry. He has worked for companies like Microsoft and Symantec across areas like machine learning, storage, cloud, big data, networking and analytics in the United States & India.

Sameer holds 9 US patents and is an alumnus of IIT Bombay and Georgia Tech. He not only conducts hands-on workshops and seminars but also participates in panel discussions in upcoming technologies like machine learning and big data. Sameer is one of the mentors for the Machine Learning Foundations course at Coursera.