This project provides two different approaches for troubleshooting EKS (Elastic Kubernetes Service) issues:
- RAG-based Chatbot: A document retrieval chatbot that uses OpenSearch to retrieve relevant logs and provides intelligent troubleshooting assistance
- Strands-based Agentic Troubleshooting: An intelligent agent using AWS Strands Agent framework with EKS MCP server integration for real-time troubleshooting
Both solutions are deployed using Terraform, which provisions the necessary AWS resources including EKS cluster, monitoring, and application-specific infrastructure.
├── apps/ # Application code
│ ├── agentic-troubleshooting/ # Strands-based agentic troubleshooting agent
│ │ ├── src/agents/ # Strands agent implementations
│ │ ├── src/tools/ # EKS MCP tools integration
│ │ ├── helm/ # Kubernetes deployment charts
│ │ └── main.py # Strands agent entry point
│ └── chatbot/ # RAG-based chatbot with Gradio UI
│ ├── clients/ # OpenSearch, LLM, and K8s clients
│ ├── utils/ # Logging and utilities
│ └── app.py # Gradio application entry point
├── terraform/ # Infrastructure as Code
│ ├── main.tf # Main EKS cluster configuration
│ ├── agentic.tf # Strands agent deployment resources
│ ├── modules/ # Terraform modules
│ ├── variables.tf # Terraform variables
│ └── outputs.tf # Terraform outputs
├── static/ # Static assets
└── demo/ # Demo scripts and manifests
Before running this project, make sure you have the following installed:
- Terraform
- AWS CLI
- Python 3.8+
- Docker (for agentic deployment)
- Helm (for agentic deployment)
- Slack Webhook (Alert Manager notifications):
- Create incoming webhook in your Slack workspace
- Note the webhook URL and target channel name
- Slack Bot Configuration:
-
Create a Slack app with the following Bot Token Scopes:
app_mentions:read
- View messages mentioning the botchannels:history
- View messages in public channelschannels:read
- View basic channel informationchat:write
- Send messages as the botgroups:history
- View messages in private channelsgroups:read
- View basic private channel informationim:history
- View direct messagesim:read
- View basic DM information
-
Event Subscriptions (enable these events):
app_mention
- Bot mentionsmessage.channels
- Channel messagesmessage.groups
- Private channel messagesmessage.im
- Direct messages
-
Enable Socket Mode for real-time events
-
Note the Bot Token (
xoxb-...
), App Token (xapp-...
), and Signing Secret
-
The RAG-based approach uses an ingestion pipeline with Kinesis, OpenSearch, and intelligent troubleshooting capabilities.
-
Clone the repository:
git clone https://github.com/aws-samples/sample-eks-troubleshooting-rag-chatbot cd sample-eks-troubleshooting-rag-chatbot
-
Configure Terraform variables: Create
terraform/terraform.tfvars
file:deployment_type = "rag" # This is the default slack_webhook_url = "https://hooks.slack.com/services/[YOUR-WEBHOOK]" slack_channel_name = "alert-manager-alerts"
-
Deploy infrastructure:
cd terraform/ ./install.sh
-
Deploy test pods and access the chatbot:
cd ../ ./provision-delete-error-pods.sh -p db-migration # Forward the Chatbot service port to your local machine kubectl port-forward -n agentic-chatbot service/agentic-chatbot 7860:7860
-
Access the chatbot: Open your browser to
http://localhost:7860
The agentic approach uses the AWS Strands Agent framework with EKS MCP server integration for intelligent, real-time troubleshooting.
graph LR
U[User] --> S[Slack Interface]
S --> OA[K8s Orchestrator Agent]
OA --> MA[Memory Agent]
OA --> KS[K8s Specialist]
MA --> S3V[S3 Vectors DB]
KS --> EKS[EKS MCP Server]
KS --> K8S[Local K8s Tools]
EKS --> AWS[AWS EKS Cluster]
K8S --> KUBE[Kubernetes API]
style U fill:#1976d2,color:#fff
style S fill:#f57c00,color:#fff
style OA fill:#ff6f00,color:#fff
style MA fill:#7b1fa2,color:#fff
style KS fill:#388e3c,color:#fff
style EKS fill:#d32f2f,color:#fff
style S3V fill:#9c27b0,color:#fff
-
Set your AWS region:
export AWS_REGION="us-east-1" # Change to your preferred region
-
Create ECR repository manually:
# Create ECR repository aws ecr create-repository --repository-name eks-llm-troubleshooting-agentic-agent --region $AWS_REGION # Get the repository URI export ECR_REPO_URL=$(aws ecr describe-repositories --repository-names eks-llm-troubleshooting-agentic-agent --region $AWS_REGION --query 'repositories[0].repositoryUri' --output text) echo "ECR Repository URL: $ECR_REPO_URL"
-
Create S3 vector bucket and index:
# Create S3 vector bucket with unique name export VECTOR_BUCKET="eks-llm-troubleshooting-vector-storage-$(date +%s)" aws s3vectors create-vector-bucket \ --vector-bucket-name $VECTOR_BUCKET \ --region $AWS_REGION # Create S3 Vectors index with 1024 dimensions aws s3vectors create-index \ --vector-bucket-name $VECTOR_BUCKET \ --index-name "k8s-troubleshooting" \ --dimension 1024 \ --data-type float32 \ --distance-metric cosine \ --region $AWS_REGION echo "Vector bucket: $VECTOR_BUCKET" echo "Index name: k8s-troubleshooting"
-
Build and push the Docker image:
cd apps/agentic-troubleshooting/ # Login to ECR aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_REPO_URL # Build and tag the image docker build --platform linux/amd64 -t $ECR_REPO_URL . # Push to ECR docker push $ECR_REPO_URL
-
Configure Terraform variables: Create
terraform/terraform.tfvars
file (replace with your actual values):deployment_type = "agentic" agentic_image_repository = "your-account.dkr.ecr.us-east-1.amazonaws.com/eks-llm-troubleshooting-agentic-agent" agentic_image_tag = "latest" slack_webhook_url = "https://hooks.slack.com/services/[YOUR-WEBHOOK]" slack_channel_name = "alert-manager-alerts" slack_bot_token = "xoxb-your-bot-token" slack_app_token = "xapp-your-app-token" slack_signing_secret = "your-signing-secret" bedrock_model_id = "us.anthropic.claude-3-7-sonnet-20250219-v1:0" vector_bucket_name = "eks-llm-troubleshooting-vector-storage-1234567890" # Use the bucket created above vector_index_name = "k8s-troubleshooting"
-
Deploy infrastructure:
cd terraform/ terraform init terraform apply -auto-approve
The agentic deployment will automatically:
- Create IAM roles with EKS MCP permissions
- Set up Pod Identity associations
- Deploy the Helm chart with the troubleshooting agent
- Configure Slack integration
- Log processing pipeline with OpenSearch indexing
- Semantic search through log data with Gradio interface
- Intelligent troubleshooting with kubectl command execution
- Multi-agent orchestration with EKS MCP integration
- S3 Vectors storage for tribal knowledge
- Slack bot integration with Pod Identity security
- Real-time cluster monitoring and troubleshooting
- deployment_type:
"rag"
(default) or"agentic"
- name: Project name (default:
"eks-llm-troubleshooting"
) - slack_webhook_url: Slack webhook for alerts (both deployments)
- slack_channel_name: Slack channel name (both deployments)
- agentic_image_repository: ECR repository for agent image (Agentic only)
- slack_bot_token: Slack bot token (Agentic only)
- bedrock_model_id: Bedrock model identifier (Agentic only)
- vector_bucket_name: S3 vector bucket name (Agentic only)
- Deploy problematic pods:
./provision-delete-error-pods.sh -p db-migration
- Access Gradio interface at
http://localhost:7860/
- Query the chatbot about EKS issues
See Demo Script for complete testing instructions and example scenarios.
-
Destroy infrastructure:
cd terraform/ terraform destroy --auto-approve
-
Clean up additional resources (Agentic only):
# Delete ECR repository aws ecr delete-repository --repository-name eks-llm-troubleshooting-agentic-agent --force --region $AWS_REGION # Delete S3 vector bucket (if created) aws s3vectors delete-index --vector-bucket-name $VECTOR_BUCKET --index-name k8s-troubleshooting --region $AWS_REGION aws s3vectors delete-vector-bucket --vector-bucket-name $VECTOR_BUCKET --region $AWS_REGION
- EKS Cluster with Fluent Bit → Kinesis → OpenSearch → Gradio Interface
- Bedrock for AI-powered responses
- Multi-agent system with EKS MCP integration
- S3 Vectors for knowledge storage
- Slack bot with Pod Identity security
- Access Denied (Bedrock): Ensure your AWS account has access to the specified Bedrock model
- Image Pull Errors: Verify ECR repository exists and credentials are correct
- Slack Integration: Check bot tokens and permissions
- Pod Identity: Ensure EKS Pod Identity Agent is enabled
This project uses:
- Gradio for the user interface
- Terraform AWS EKS Blueprints for infrastructure
- AWS Strands Agent Framework for multi-agent orchestration (Agentic deployment)
- EKS MCP Server for Kubernetes integration via Model Context Protocol (Agentic deployment)
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.