Skip to content

aws-samples/sample-spot-interruption-insights

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Spot Interruption Insights - Serverless Event-Driven Spot Interruption Monitoring and Analytics Dashboard

Introduction

Spot Interruption Insights is a serverless, event-driven monitoring and analytics dashboard designed to provide near real-time visibility into EC2 Spot Instance interruptions. Spot Instances offer significant cost savings, but their unpredictable interruptions can lead to operational challenges. This solution captures Spot Interruption warnings via Amazon EventBridge, routes them through SQS, and processes them with AWS Lambda, storing rich event data in OpenSearch for visualization. This dashboard helps teams understand interruption patterns, improve capacity planning, design fallback mechanisms or diversify workloads more effectively, and make informed workload placement decisions.

Architecture

Diagram

Solution Overview

The architecture leverages a serverless, event-driven approach utilizing AWS native services for robust interruption monitoring. EC2 Spot interruption notices are captured via an Amazon EventBridge rule and routed to an SQS queue for reliable message handling. A Lambda function processes the events, fetching EC2 instance metadata and Auto Scaling Group (ASG) details by making optimized batch calls to the EC2 and Auto Scaling APIs. This design minimizes throttling risks on the control plane APIs, ensuring scalability. The Lambda function is appropriately configured with batching and concurrency limits to prevent overwhelming the API endpoints and the OpenSearch bulk indexing process. After processing, events are bulk-indexed into Amazon OpenSearch Service, enabling near real-time visibility and analytics. A Dead Letter Queue (DLQ) ensures no data is lost in case of failures, while IAM roles enforce least-privilege access between all components.

OpenSearch is deployed within the private subnets of a VPC, ensuring it is not publicly accessible. Access to OpenSearch Dashboards is routed through an Application Load Balancer (ALB) configured with an HTTPS listener, which forwards traffic to an NGINX proxy running on EC2 instances in an Auto Scaling group. This setup provides secure and scalable access. Authentication and authorization are enforced using OpenSearch’s internal user database, ensuring that only authorized users can access the dashboards.

OpenSearch Dashboards visualize interruption metrics, delivering actionable insights to support effective capacity planning and workload placement.

Extensibility and Alternative Analytics Tools

While this solution uses Amazon OpenSearch Service for storing and visualizing Spot Interruption data, the architecture is flexible and can be extended to support other analytics and observability platforms. You can modify the Lambda function to forward data to tools such as Amazon QuickSight, Amazon Timestream, Amazon Redshift, or external services like Datadog, Splunk, or Elastic Cloud, depending on your analytics and compliance needs. This enables teams to use their preferred tooling for building visualizations, setting alerts, or integrating with existing dashboards.

Important Note:

This application uses multiple AWS services, and there are associated costs beyond the Free Tier usage. Please refer to the AWS Pricing page for specific details. You are accountable for any incurred AWS costs. This example solution does not imply any warranty.

Requirements

Create an AWS account if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
AWS CLI installed and configured
Git Installed
AWS Serverless Application Model (AWS SAM) installed

Deployment Instructions

Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:

git clone https://github.com/aws-samples/sample-spot-interruption-insights

Change directory to the solution directory:

cd sample-spot-interruption-insights

Prerequisites Checklist for Deployment

This section lists the required setup and configurations before deploying the solution stack using AWS SAM.

  • VPC Created - Ensure a VPC exists with DNS hostnames and DNS resolution enabled. You will need the VPC ID during deployment
  • Public Subnets (2 or more) - Configure two or more public subnet IDs from different Availability Zones.
  • Private Subnets (2 or more) - Configure two or more private subnet IDs from different Availability Zones.
  • Outbound Internet Access for Private Subnets - Ensure NAT Gateway access as nginx proxy will be installed on EC2 instance in private subnet.
  • ALB Access - CIDR IP range allowed to access ALB (e.g., 1.2.3.4/32). This is for accessing the dashboard.
  • Certificate ARN for ALB HTTPS Listner - To configure HTTPS listener.
  • AMIId - Valid EC2 AMI ID for the region.

Deploy the solution - From the command line, use AWS SAM to build and deploy the AWS resources as specified in the template.yml file.

sam build
sam deploy --guided

During the prompts:

  • Stack Name: {Enter your preferred stack name}
  • AWS Region: {Enter your preferred region}
  • Parameter DomainName: {Enter the name of the OpenSearch domain where the index will be created and data will be pushed for analytics}
  • Parameter MasterUsername: {Admin username to login to the OpenSearch dashboard}
  • Parameter MasterUserPassword: {Admin password to login to the OpenSearch dashboard}
  • Parameter IndexName: {Index name where Spot interrupted instance related data will be pushed}
  • Parameter EventRuleName: {Amazon EventBridge rule name to capture EC2 Spot interruption notices}
  • Parameter CustomEventRuleName: {Amazon EventBridge custom rule name to capture EC2 Spot interruption notices. This will be used for verifying the solution}
  • Parameter TargetQueueName: {EventBridge Rule target SQS name}
  • Parameter SQSDLQQueueName: {Target SQS Dead Letter Queue name}
  • Parameter LambdaDLQQueueName: {Lambda Dead Letter Queue name}
  • Parameter VPCId: {Enter the VPCId where the resources will be deployed}
  • Parameter PublicSubnetIds: {Enter 2 or more Public SubnetIDs separated by comma}
  • Parameter PrivateSubnetIds: {Enter 2 or more Private SubnetIDs separated by comma}
  • Parameter RestrictedIPCidr: {IP address/CIDR for restricting ALB access in CIDR format (e.g., x.x.x.x/32)}
  • Parameter CertificateArn: {Certificate ARN for configuring ALB HTTPS Listener}
  • Parameter AMIId: {Valid EC2 AMI ID for the region}
  • Confirm changes before deploy: Y
  • Allow SAM CLI IAM role creation: Y
  • Disable rollback: N
  • Save arguments to configuration file: Y
  • SAM configuration file: {Press enter to use default name}
  • SAM configuration environment: {Press enter to use default name}

Note:- The complete solution may take approximately 15-20 minutes to deploy successfully. After the deployment is complete, there are a few manual steps that need to be performed to ensure the solution functions as expected. Please find the post-deployment steps listed below.

Post Deployment Instructions

The following steps need to be performed in the OpenSearch Dashboard after logging in:

Get the DNS Name of the Application Load Balancer endpoint from the deployment output section or the ALB console. Access the OpenSearch dashboard using the ALB DNS name as follows -

https://[ALB-DNS-NAME]/_dashboards

You will be redirected to the login page. Log in using the Admin username and password you specified during deployment.

If this is the first time you are logging in then you may see a Welcome screen. Click on 'Explore on my own' on the Welcome screen. Click 'Dismiss' on the next screen. If the 'Select your tenant' dialog appears with 'Global' preselected, click 'Confirm'. Otherwise, select 'Global' first and then and click 'Confirm'.

Create Index and attribute mapping

  • On the Home screen select the Hamburger Menu icon on the top left and select 'Dev Tools' at the bottom of the menu.

Diagram

On the dev tools console, copy the below PUT command and execute the request by clicking on the 'Click to send request' arrow. Note- The index name should match what you entered during the deployment. Please change the name accordingly before creating the index.

    PUT /spot-interruption-events
    {
        "mappings": {
            "properties": {
            "instance_id": {
                "type": "keyword"
            },
            "instance_name": {
                "type": "keyword"
            },
            "instance_type": {
                "type": "keyword"
            },
            "asg_name": {
                "type": "keyword"
            },
            "timestamp": {
                "type": "date"
            },
            "region": {
                "type": "keyword"
            },
            "availability_zone": {
                "type": "keyword"
            },
            "private_ip": {
                "type": "ip"
            },
            "public_ip": {
                "type": "ip"
            }
            }
        }
    }

Image shared below for reference

Diagram

Confirm if the index got created sucessfully.

Diagram

Create Index Pattern and export

Access the Hamburger Menu icon on the top left and select 'Dashboard Management' from the bottom menu. Click on 'Index Patterns' and click on "Create Index Pattern"

Diagram

Enter the Index pattern name and click Next step

Diagram

Select 'timestamp' in primary Timefield and click on 'Create index pattern'

Diagram

Click on the star icon to make the index pattern default

Diagram

Map Lambda execution rule to 'all_access' role to perform operations on the index created

Copy the Lambda execution role ARN from the Lambda function responsible for pushing Spot interruption event data to the OpenSearch index.

Access the Hamburger Menu icon on the top left and select 'Security' from the bottom menu.

Diagram

Select the 'all_access' role and Click on 'Mapped User' tab

Diagram

Click on 'Manage Mapping'

Diagram

In the 'Backend roles' add the Lambda execution role ARN copied above and click on 'Map'

Diagram

You can create more users in the internal database and grant appropriate access to the visualisations and dashboards. The following steps show how to create a read only role and to create a internal user and grant read only access.

Create a new user and a role with read-only access, then assign the role to the user to grant them read-only access to the Spot Interruption dashboard and visualizations.

Access the Hamburger Menu icon on the top left and select 'Security' from the bottom menu and select 'Internal Users' and then select 'Create Internal user'

Diagram

Enter username and set a Password

Diagram

Access the Hamburger Menu icon on the top left and select 'Security' from the bottom menu and select 'Roles' and then select 'Create Role'

Set the Cluster Permissions and Index Permissions as seen in the image.

Diagram

Set the Tenant Permissions as seen in the image.

Diagram

Select 'Mapped Users' tab and click on 'Manage Mapping'

Diagram

Select the user created above in 'Users' and click on 'Map'

Diagram

Configure and deploy sample visualisations and dashboard

Sample visualizations and a starter dashboard are provided under the data folder in the file named spot-interruption-dashboard-visualisations.ndjson. To import them:

  • Navigate to Saved Objects under Dashboard Management in OpenSearch Dashboards.

  • Import the spot-interruption-dashboard-visualisations.ndjson file.

  • During the import, you may encounter index pattern conflicts. Select the index pattern you created from the dropdown and click "Confirm all changes".

Diagram

Once imported, the sample visualizations and dashboard linked to your index pattern will be available. You can view the Spot Interruption Dashboard, which includes visualizations based on Availability Zones, Regions, Instance Types, Auto Scaling Groups (ASGs), and Interruptions over time. You can further customize by creating your own visualizations using the attributes available in the index or by editing/creating new dashboards.

Testing

A temporary event rule is created during deployment to simulate matching EC2 Spot interruption notices. The rule name will be the name you specified during deployment for parameter "CustomEventRuleName"

To verify the solution, you can send sample events as shown in the image below. Replace the instance-id with the actual instance id that is associated with an ASG

Diagram

  • Once the event is sent successfully, you can log in to the OpenSearch Dashboard and view the Spot Interruption Dashboard, where instance-related visualizations should appear. Alternatively, you can navigate to the "Discover" section via the hamburger menu to view the raw event details. Ensure the correct index pattern is selected, and adjust the time range if necessary (e.g., to the last 15 minutes) to view the latest data.

Sample visualisations and Dashboard

Once the OpenSearch Dashboard is set up and the sample visualizations are imported, you can explore the "Spot Interruption Dashboard", which has been pre-built using the indexed event data. This dashboard provides insights across key dimensions such as Availability Zones, Regions, Instance Types, Auto Scaling Groups, and interruption trends over time. Use it as a starting point to understand the kind of insights possible, and feel free to customize or create new visualizations based on the fields available in the index.

Diagram

Security and Cost Optimizations

This solution is designed to be secure and cost-efficient by default, but there are some more optimizations you can apply to further reduce cost and enhance security:

Security Best Practices

  • Amazon Cognito Authentication : Integrate Amazon Cognito with OpenSearch Dashboards to manage user authentication, enable MFA, and avoid hardcoding admin credentials.

  • Logging and Threat Detection: Enable AWS CloudTrail and Amazon GuardDuty to monitor for unauthorized activity or anomalies.

Cost Optimizations

  • Bulk Indexing with Throttling Controls: Lambda processes batches and respects throttling limits to avoid excessive OpenSearch usage.

  • Short Retention for CloudWatch Logs: Tune log retention periods to avoid unnecessary storage costs.

  • Optimize Visualizations: Design saved visualizations to avoid expensive queries (like wide time ranges and large aggregations).

  • Index Lifecycle Management (ILM) : Configure ILM policies in OpenSearch to delete or archive older interruption data.

Conclusion

Spot Interruption Insights empowers teams with the visibility and agility needed to operate confidently with EC2 Spot Instances. By combining a serverless, event-driven architecture with secure, scalable analytics, this solution enables organizations to proactively monitor interruption events, identify trends, and optimize workload strategies for resilience and cost-efficiency. With real-time data at their fingertips, teams can make smarter infrastructure decisions and maximize the benefits of Spot capacity while minimizing disruption risks.

Cleanup

Run the following command to delete the resources

sam delete

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages