A Slack bot that enables self-service production diagnostics by integrating Claude Code CLI with Kubernetes, Loki, and observability tools. The bot provides autonomous investigation capabilities through YAML-based templates and custom MCP tools, delivering professional PDF reports directly in Slack.
- Claude Code Integration: Uses Claude Code CLI with MCP server for advanced tool use capabilities
- MCP Server: Custom tools for Loki queries, K8s access, GitHub integration, ECR scanning, and PDF generation
- Third-Party API Integration: Configuration-driven system for adding read-only API integrations via YAML β no Go code required
- Investigation Skills: YAML-based templates define structured investigation workflows
- Kubernetes Access: Read-only access to pods, logs, ConfigMaps, Deployments, and Flux CRDs
- PDF Report Generation: Automated professional reports with company branding via Pandoc + LaTeX
- Log Sanitization: Comprehensive PII and secret redaction (13 regex patterns)
- Slack Socket Mode: Supports app mentions, threads, and conversational follow-ups
- Conversation State: In-memory tracking with 24-hour expiry and automatic file cleanup
- Prometheus Metrics: Full observability with counters for investigations, API calls, K8s queries, and tool usage
- Linter Compliance: 100% clean - all golangci-lint and namedreturns checks pass
- Test Coverage: Unit tests with race detection, coverage reporting to Codecov
- CI/CD: Automated linting, testing, Docker builds, and security scanning via GitHub Actions
- Multi-arch Support: Docker images for amd64 and arm64
- Security Scanning: Trivy vulnerability scanning with SARIF upload
diagnostic-slackbot/
βββ cmd/
β βββ bot/main.go # Main Slack bot entry point
β βββ mcp-server/main.go # MCP server for Claude Code tools
βββ pkg/
β βββ bot/ # Slack bot logic
β β βββ bot.go # Core bot initialization
β β βββ handlers.go # Event handlers
β β βββ claudecode.go # Claude Code CLI integration
β β βββ tools.go # Dynamic tool availability config
β β βββ state.go # Conversation state management
β βββ claude/ # Direct Claude API integration (legacy)
β β βββ client.go # API client with tool use support
β β βββ tools.go # Tool definitions for Claude
β β βββ prompts.go # System prompt construction
β βββ investigations/ # Investigation skill system
β β βββ template.go # Skill data structures and loading
β β βββ matcher.go # Message matching logic
β βββ k8s/ # Kubernetes and Loki clients
β β βββ agent.go # K8s resource access
β β βββ loki.go # Loki log query client
β β βββ sanitizer.go # Log sanitization
β βββ apiconfig/ # Third-party API integration framework
β β βββ config.go # YAML config schema and loader
β β βββ client.go # Generic HTTP client (auth, retry, rate limiting)
β β βββ validate.go # Path parameter validation (traversal, injection)
β β βββ redact.go # PII field redaction from JSON responses
β β βββ tools.go # MCP tool generation and dispatch
β βββ mcp/ # MCP server implementation
β β βββ server.go # MCP protocol, tool registration, and handlers
β β βββ types.go # MCP protocol types
β β βββ http_server.go # HTTP/SSE transport server
β β βββ cloudwatch.go # CloudWatch Logs tools
β β βββ prometheus.go # Prometheus/PromQL tools
β β βββ grafana.go # Grafana dashboard tools
β β βββ database.go # Database query tools
β β βββ ecr.go # ECR vulnerability scanning
β β βββ auth/ # MCP server authentication
β βββ metrics/ # Prometheus metrics
β βββ metrics.go # Metric definitions
β βββ server.go # HTTP metrics server
βββ apis/ # Third-party API configs (YAML)
β βββ bitgo.yaml # Example: BitGo read-only wallet API
βββ investigations/ # YAML investigation skills
β βββ modsecurity-block.yaml
β βββ atlas-migration.yaml
β βββ general-diagnostic.yaml
β βββ ecr-vulnerability-scan.yaml
βββ latex-templates/ # PDF report templates
βββ docs/
β βββ ECR_INTEGRATION.md # ECR vulnerability scanning guide
β βββ PROJECT_SPEC.md # Original project specification
βββ .github/workflows/ # CI/CD pipelines
β βββ ci.yaml # Lint, test, build, Docker push
β βββ release.yaml # GoReleaser for versioned releases
βββ Dockerfile # Multi-stage Alpine build
βββ Makefile # Build and lint targets
βββ .golangci.yml # Linter configuration (strict enforcement)
βββ .mcp.json # MCP server configuration
βββ go.mod # Go module dependencies
The bot includes example investigation skills in the investigations/ directory. These are generic examples with substitution variables for adaptation to your environment.
Investigates Web Application Firewall blocks with the following capabilities:
- Queries Loki for ModSecurity audit logs
- Analyzes OWASP CRS rule triggers
- Categorizes blocks (false positive vs legitimate)
- Provides exact remediation with rule IDs and config snippets
- Performs IP geolocation via whois
Trigger patterns: modsec, modsecurity, waf blocked, 403.*waf
Diagnoses why Atlas migrations are not being applied in GitOps environments:
- Checks AtlasMigration CRD status
- Verifies GitRepository version pinning
- Inspects ConfigMap contents
- Analyzes Flux Kustomization reconciliation
- Identifies root cause (tag pinning, ConfigMap not regenerated, etc.)
Trigger patterns: atlas.*migration, migration.*not.*applied, migration.*failure
A systematic approach for general production issues:
- Defines investigation phases (scope, context gathering, hypothesis formation, targeted investigation, root cause analysis, remediation)
- Provides examples for querying Prometheus, Thanos, Loki, and Alertmanager
- Covers common patterns (WAF blocks, database migrations, Flux failures, pod crashes)
- Emphasizes security-first and GitOps principles
Trigger patterns: investigate, diagnostic, troubleshoot, issue, problem, error
Investigations are defined in YAML files with the following structure:
name: "Investigation Name"
description: "Brief description of what this investigation does"
trigger_patterns:
- "pattern1"
- "pattern2.*regex"
- "specific phrase"
initial_prompt: |
Multi-line prompt that Claude will use as context for this investigation.
This should include:
- Role definition ("You are a diagnostic agent for...")
- Core principles (security, read-only, systematic approach)
- Investigation methodology (phases, steps)
- Available tools and when to use them
- Output format expectations
- Critical patterns to recognize
- Communication style guidelines
kubernetes_resources:
- type: "pod"
namespace: "namespace-name"
selector: "app=component"
container: "container-name"
since: "1h"
grep: "ERROR"
tail_lines: 500
require_approval: false- name: Human-readable name shown in Slack
- description: Brief description (prefix with "EXAMPLE:" if this is a template for customization)
- trigger_patterns: Regex patterns matched against user messages. More specific patterns (longer solid pattern length) take precedence.
- initial_prompt: The system prompt Claude receives. This is the core of your investigation workflow.
- kubernetes_resources: (Optional) List of K8s resources to pre-fetch. Useful for common queries.
- require_approval: (Optional) If true, bot asks user for confirmation before starting investigation.
Use {{VARIABLE_NAME}} syntax for environment-specific values that need to be customized for production deployment:
initial_prompt: |
Query Prometheus at {{PROMETHEUS_URL}} for metrics.
Check Loki at {{LOKI_URL}} for logs.
The application runs in namespace {{APP_NAMESPACE}}.Common substitution variables:
{{PROMETHEUS_URL}}- Prometheus endpoint{{THANOS_URL}}- Thanos query endpoint (federated metrics){{LOKI_URL}}- Loki log aggregation endpoint{{ALERTMANAGER_URL}}- Alertmanager endpoint{{REALM}}- Environment name (prod, staging, dev){{NAMESPACE}}- Kubernetes namespace{{DOMAIN}}- Application domain{{GH_ORG}}- GitHub organization{{GH_REPO}}- GitHub repository name
The bot uses a specificity algorithm to select the best matching investigation:
- Patterns are tested in order against the user's message
- If multiple patterns match, the most specific wins
- Specificity = length of longest solid pattern (non-regex characters)
Examples:
"atlas.*migration"β specificity = 5 (longest solid: "atlas")"migration.*not.*applied"β specificity = 9 (longest solid: "migration")"database migration failure"β specificity = 25 (entire phrase is solid)
Best practices:
- Use specific phrases for narrow investigations
- Use regex patterns for flexibility within a domain
- Longer solid patterns win ties
Your investigation prompt should reference these MCP tool names that Claude can autonomously use. Tool availability is dynamic β only tools with configured backing services are registered and shown in the prompt.
CRITICAL: Investigation skills MUST reference MCP tool names (e.g.,
cloudwatch_logs_query), NOT external CLI commands (e.g.,aws logs start-query). Claude Code runs in
Logging (Loki) β requires LOKI_ENDPOINT:
query_lokiβ Query Loki for cluster logs using LogQL syntaxParameters: query, start, end (optional), limit (optional)
CloudWatch Logs β requires CLOUDWATCH_ACCOUNTS or CLOUDWATCH_ASSUME_ROLE:
cloudwatch_logs_queryβ Execute CloudWatch Logs Insights queries across log groupsParameters: query, log_groups, start_time, end_time (optional), region (optional), limit (optional), accounts (optional)cloudwatch_logs_list_groupsβ List available CloudWatch log groupsParameters: prefix (optional), region (optional), limit (optional), accounts (optional)cloudwatch_logs_get_eventsβ Get log events from a specific log streamParameters: log_group, log_stream, start_time (optional), end_time (optional), limit (optional), accounts (optional)
When CLOUDWATCH_ACCOUNTS is configured with multiple accounts, the accounts parameter filters which accounts to query. If omitted, all configured accounts are queried and results are labeled per account.
Prometheus/Metrics β requires PROMETHEUS_URL or PROMETHEUS_<NAME>_URL:
prometheus_queryβ Execute an instant PromQL queryprometheus_query_rangeβ Execute a range PromQL query for trend analysisprometheus_seriesβ Find time series matching label selectorsprometheus_label_valuesβ Get all values for a given label nameprometheus_list_endpointsβ List configured Prometheus endpoints
Grafana β requires GRAFANA_URL + GRAFANA_API_KEY:
grafana_list_dashboardsβ List all Grafana dashboardsgrafana_get_dashboardβ Get a specific dashboard by UIDgrafana_create_dashboardβ Create a new dashboard from queriesgrafana_update_dashboardβ Update an existing dashboardgrafana_delete_dashboardβ Delete a dashboard
Database β requires DATABASE_URL or DATABASE_<NAME>_URL:
database_queryβ Execute read-only SQL queries (SELECT, SHOW, DESCRIBE, EXPLAIN)database_listβ List available databases
GitHub β requires GITHUB_TOKEN:
github_get_fileβ Fetch a file from a GitHub repositorygithub_list_directoryβ List files in a repository directorygithub_search_codeβ Search code across repositories
ECR (Container Security) β requires AWS_REGION or AWS_DEFAULT_REGION:
ecr_scan_resultsβ Query ECR for container image vulnerability scan results
Third-Party APIs β dynamically loaded from apis/ directory YAML configs:
- Tools are named
{api_name}_{endpoint_name}(e.g.,bitgo_list_wallets) - Only available when the API's auth token env var is set
- See Third-Party API Integration for details
Utilities β always available:
whois_lookupβ IP geolocation, ISP, ASN lookupgenerate_pdfβ Generate a PDF report from Markdown content (auto-uploaded to Slack)
Note: The legacy direct K8s tool calls (get_k8s_pod_logs, get_k8s_resource, etc.) are deprecated in favor of the MCP server architecture. Investigation templates should use the MCP tools above.
WARNING: Always use MCP tool names in your prompts. Claude Code runs in
aws logs start-query" or "usekubectl get pods", Claude will either fail silently or hallucinate output. Instead, say "usecloudwatch_logs_query" or "usequery_loki". See the tool list above.
Your initial_prompt should:
-
Define the role clearly: "You are a diagnostic agent for X platform investigating Y issues..."
-
Set boundaries: Emphasize read-only access, security considerations, GitOps principles
-
Provide methodology: Step-by-step investigation phases or decision trees
-
Reference MCP tools by name: Tell Claude which MCP tools to use and when. Use the exact tool names from the "Available Claude Tools" section above (e.g.,
cloudwatch_logs_query,query_loki,prometheus_query). Never reference external CLIs likeaws,kubectl,psql, etc. -
Define output format: Specify structure (Summary, Timeline, Root Cause, Remediation, Prevention)
-
Include critical patterns: Known issues, false positive indicators, common root causes
-
Set communication style: Technical depth, conciseness, use of severity labels
-
Document substitution variables: List all
{{VARIABLES}}at the end with descriptions
-
Create your investigation YAML in
investigations/my-investigation.yaml -
Substitute variables for local testing:
# Use sed or your editor to replace {{VARIABLES}} with actual values sed -e 's|{{LOKI_URL}}|http://localhost:3100|g' \ -e 's|{{NAMESPACE}}|default|g' \ investigations/my-investigation.yaml > /tmp/test-investigation.yaml
-
Run validation tests:
# Tests validate YAML structure and pattern matching go test ./pkg/investigations/... -v
-
Test with bot locally:
export INVESTIGATION_DIR=/tmp export SLACK_BOT_TOKEN=xoxb-your-token export SLACK_APP_TOKEN=xapp-your-token export ANTHROPIC_API_KEY=sk-your-key export LOKI_ENDPOINT=http://localhost:3100 go run cmd/bot/main.go
-
Send test message in Slack matching your trigger patterns
For production deployment using Vault-backed secrets:
-
Substitute production values in your investigation YAML (replace all
{{VARIABLES}}) -
Store in Vault (example for HashiCorp Vault):
vault kv put infra/diagnostic-slackbot-inv-myinvestigation \ my-investigation.yaml=@investigations/my-investigation.yaml
-
Create VaultStaticSecret manifest:
apiVersion: secrets.hashicorp.com/v1beta1 kind: VaultStaticSecret metadata: name: diagnostic-slackbot-inv-myinvestigation namespace: diagnostic-slackbot spec: type: kv-v2 mount: infra path: diagnostic-slackbot-inv-myinvestigation destination: name: diagnostic-slackbot-inv-myinvestigation create: true refreshAfter: 30s vaultAuthRef: diagnostic-slackbot rolloutRestartTargets: - kind: Deployment name: diagnostic-slackbot
-
Mount secret in Deployment:
volumeMounts: - name: inv-myinvestigation mountPath: /app/investigations/my-investigation.yaml subPath: my-investigation.yaml readOnly: true volumes: - name: inv-myinvestigation secret: secretName: diagnostic-slackbot-inv-myinvestigation
-
Apply with GitOps - commit manifests and let Flux reconcile
Let's create an investigation for Nginx ingress controller issues:
name: "Nginx Ingress Troubleshooting"
description: "Diagnoses Nginx ingress controller issues (502/504, SSL, routing)"
trigger_patterns:
- "nginx.*ingress"
- "502.*bad.*gateway"
- "504.*gateway.*timeout"
- "ingress.*not.*working"
initial_prompt: |
You are diagnosing Nginx ingress controller issues in a Kubernetes environment.
## Investigation Steps
1. **Check ingress controller pods**:
- Use list_k8s_pods with namespace={{INGRESS_NAMESPACE}}, selector="app.kubernetes.io/name=ingress-nginx"
- Look for restarts, OOMKilled, CrashLoopBackOff
2. **Query recent logs**:
- Use query_loki with: {namespace="{{INGRESS_NAMESPACE}}"} |= "error" or "warn"
- Time range: last 1 hour
- Look for upstream errors, SSL handshake failures, timeout messages
3. **Check ingress resource**:
- Use get_k8s_resource for the specific Ingress
- Verify: backend service name, port, TLS config, annotations
4. **Check backend pods**:
- Use list_k8s_pods for backend service namespace
- Verify pods are Running and Ready
5. **Analyze error patterns**:
- 502: Backend not responding (pod down, wrong service/port)
- 504: Backend timeout (slow response, deadlock)
- SSL errors: Certificate issues, TLS version mismatch
## Output Format
[One-line description]
[What the user is experiencing]
[Specific cause identified]
[Exact steps to fix with commands/configs]
## Substitution Variables
- {{INGRESS_NAMESPACE}}: Namespace where ingress controller runs (e.g., "ingress-nginx")
kubernetes_resources:
- type: "pod"
namespace: "{{INGRESS_NAMESPACE}}"
selector: "app.kubernetes.io/name=ingress-nginx"
container: "controller"
since: "1h"
grep: "error"
tail_lines: 200
require_approval: false
After creating this, substitute {{INGRESS_NAMESPACE}} with your actual namespace before deploying to production.
The bot uses Claude Code CLI with a custom MCP (Model Context Protocol) server that provides tools for autonomous investigation. Tool availability is dynamic β only tools backed by configured services are registered.
| Category | Env Var Required | Tools |
|---|---|---|
| Loki (Logging) | LOKI_ENDPOINT |
query_loki |
| CloudWatch Logs | CLOUDWATCH_ACCOUNTS or CLOUDWATCH_ASSUME_ROLE |
cloudwatch_logs_query, cloudwatch_logs_list_groups, cloudwatch_logs_get_events |
| Prometheus | PROMETHEUS_URL or PROMETHEUS_<NAME>_URL |
prometheus_query, prometheus_query_range, prometheus_series, prometheus_label_values, prometheus_list_endpoints |
| Grafana | GRAFANA_URL + GRAFANA_API_KEY |
grafana_list_dashboards, grafana_get_dashboard, grafana_create_dashboard, grafana_update_dashboard, grafana_delete_dashboard |
| Database | DATABASE_URL or DATABASE_<NAME>_URL |
database_query, database_list |
| GitHub | GITHUB_TOKEN |
github_get_file, github_list_directory, github_search_code |
| ECR | AWS_REGION or AWS_DEFAULT_REGION |
ecr_scan_results |
| Third-Party APIs | Per-API token env var (e.g., BITGO_ACCESS_TOKEN) |
Dynamically generated from YAML configs in apis/ directory |
| Utilities | (always available) | whois_lookup, generate_pdf |
See pkg/bot/tools.go for the env var detection logic and pkg/mcp/server.go for conditional tool registration.
The MCP server has two transport modes:
-
Stdio (for Claude Code subprocess): Claude Code runs in
--printmode, which does not load MCP servers registered viaclaude mcp add. Instead, the bot passes--mcp-configwith the/app/mcp-serverbinary using stdio transport. This spawns a dedicated MCP server process per investigation. -
HTTP/SSE (for external clients): When
MCP_HTTP_ENABLED=true, the bot starts a persistent HTTP/SSE server on the configured port (default 8090). This serves external MCP clients like IDE integrations or other services.
Both transports use the same Server struct and tool implementations. The stdio binary and the HTTP server register identical tools based on the same environment configuration.
The bot is configured via environment variables:
Required:
SLACK_BOT_TOKEN- Bot OAuth token (xoxb-...)SLACK_APP_TOKEN- App-level token for Socket Mode (xapp-...)ANTHROPIC_API_KEY- Claude API key (sk-ant-...)
Optional:
KUBECONFIG- Path to kubeconfig (default: uses in-cluster config)INVESTIGATION_DIR- Path to investigation skills (default:./investigations)CLAUDE_MD_PATH- Path to engineering standards (default:./docs/CLAUDE.md)COMPANY_NAME- Company name for PDF report branding (default:Company)FILE_RETENTION- File cleanup interval (default:24h)MCP_HTTP_ENABLED- Enable HTTP/SSE MCP server (default:false, set totruefor production)MCP_HTTP_PORT- Port for HTTP MCP server (default:8090)
MCP Server Authentication (multiple methods supported, configure one or more):
MCP_AUTH_TOKEN- Static bearer token for simple authentication (default: empty = no auth)MCP_JWT_SECRET- JWT signing secret for JWT bearer token authenticationMCP_JWT_ALGORITHM- JWT algorithm (default:HS256, also supportsRS256)MCP_API_KEYS- API key authentication in formatkey1:user1,key2:user2MCP_OIDC_ISSUER_URL- OIDC issuer URL for token validation (e.g., Dex endpoint)MCP_OIDC_AUDIENCE- Expected OIDC audience claimMCP_OIDC_ALLOWED_GROUPS- Comma-separated list of authorized groupsMCP_OIDC_SKIP_ISSUER_VERIFY- Skip issuer verification (default:false, use only for testing)MCP_MTLS_CA_CERT_PATH- Path to CA certificate for mutual TLS authenticationMCP_MTLS_VERIFY_CLIENT- Verify client certificates against CA (default:true)
Tool Backing Services (each enables a set of MCP tools β see Tool Categories):
LOKI_ENDPOINT- Loki gateway endpoint (enablesquery_loki)CLOUDWATCH_ACCOUNTS- JSON map of friendly name to full IAM role ARN for multi-account CloudWatch access (enables CloudWatch tools). Example:{"dev":"arn:aws:iam::111:role/dev-reader","prod":"arn:aws:iam::222:role/prod-reader"}CLOUDWATCH_ASSUME_ROLE- IAM role ARN to assume for single-account CloudWatch queries (legacy, enables CloudWatch tools). UseCLOUDWATCH_ACCOUNTSfor multi-account support.CLOUDWATCH_EXTERNAL_ID- External ID for cross-account role assumption (optional, used with bothCLOUDWATCH_ACCOUNTSandCLOUDWATCH_ASSUME_ROLE)PROMETHEUS_URL- Prometheus endpoint (enables Prometheus tools). Multiple endpoints: usePROMETHEUS_<NAME>_URLpatternGRAFANA_URL- Grafana instance URL (requiresGRAFANA_API_KEY)GRAFANA_API_KEY- Grafana API key (required withGRAFANA_URL, enables Grafana tools)DATABASE_URL- Database connection string (enables Database tools). Multiple databases: useDATABASE_<NAME>_URLpatternGITHUB_TOKEN- Personal access token for GitHub toolsAWS_REGIONorAWS_DEFAULT_REGION- AWS region (enables ECR tools)AWS_*- Standard AWS credentials for ECR vulnerability scanningAPI_CONFIG_DIR- Directory containing third-party API YAML configs (default:./apis)
The bot supports adding read-only API integrations via YAML configuration files β no Go code required. Drop a YAML file in the apis/ directory, set the auth token env var, and the MCP server automatically registers the endpoints as tools that Claude can call.
- On startup, the MCP server loads all
.yamlfiles fromAPI_CONFIG_DIR(default./apis/) - Each config defines an API with endpoints, parameters, auth, and rate limiting
- Endpoints become MCP tools named
{api_name}_{endpoint_name}(e.g.,bitgo_list_wallets) - If the auth token env var is not set, that API's tools are silently skipped
- Claude can call these tools during investigations just like built-in tools
name: myapi # API name (used as tool name prefix)
description: "My API description"
base_url: https://api.example.com
auth:
type: bearer # "bearer", "header", or "none"
token_env: MY_API_TOKEN # env var containing the auth token
headers: # optional custom headers
User-Agent: "diagnostic-slackbot/1.0"
rate_limit:
max_concurrent: 5 # semaphore size (default: 5)
retry_on_429: true # honor Retry-After header (default: true)
max_retries: 3 # max 429 retries (default: 3)
defaults:
limit: 25 # default pagination limit
max_limit: 100 # server-enforced max limit
endpoints:
- name: list_items # becomes tool "myapi_list_items"
description: "List all items"
method: GET
path: /api/v1/items
params:
- name: status
type: string
description: "Filter by status"
required: false
redact_fields: [email, phone] # PII fields to redact from responses
- name: get_item
description: "Get item by ID"
method: GET
path: /api/v1/items/{item_id} # path parameters use {placeholder} syntax
params:
- name: item_id
type: string
description: "Item ID"
required: true
in: path # "path" or "query" (default: "query")
validate: "[a-f0-9]{24,}" # regex validation pattern
redact_fields: [email, phone, ssn]- Read-only: Only GET requests are supported
- Path traversal protection: All path parameters are validated against
../and\..patterns - Query injection protection: Path parameters are blocked from containing
?,&,# - Regex validation: Per-parameter regex patterns reject invalid input before any HTTP request
- PII redaction: Configurable field-level redaction walks nested JSON and replaces sensitive values with
[redacted] - Rate limiting: Semaphore-based concurrency control prevents Claude's fan-out from overwhelming APIs
- 429 retry: Honors
Retry-Afterheaders with exponential backoff (capped at 30s) - Response size cap: Responses limited to 5MB
- Graceful degradation: Missing auth token means tools are hidden, not erroring
- Create a YAML file in
apis/(e.g.,apis/fireblocks.yaml) - Set the auth token env var in your deployment (e.g.,
FIREBLOCKS_API_TOKEN) - Deploy β tools appear automatically in Claude's tool list
No PRs to the core repo needed. Investigation skills can reference the new tools immediately.
The apis/bitgo.yaml config provides read-only access to BitGo custodial wallet APIs:
| Tool | Description |
|---|---|
bitgo_list_enterprises |
List accessible enterprises |
bitgo_list_wallets |
List wallets with filters |
bitgo_get_wallet |
Get wallet details |
bitgo_get_wallet_balance |
Get balances for a coin |
bitgo_list_wallet_addresses |
List receive addresses |
bitgo_list_wallet_transfers |
List transfers with filters |
bitgo_get_transfer |
Get transfer details |
bitgo_list_enterprise_transfers |
Enterprise-wide transfers |
bitgo_list_pending_approvals |
List pending multi-sig approvals |
bitgo_get_pending_approval |
Get pending approval details |
Requires BITGO_ACCESS_TOKEN env var. All endpoints redact email, phone, and IP address fields.
# Run linters
make lint
# Run tests
make test
# Build binary
make build
# Build Docker image
make docker-build@bot help- List available investigation types@bot <problem description>- Start an investigation- Reply in thread to continue conversation
ModSecurity Investigation:
User: @bot I'm getting 403 errors from the WAF
Bot: π Starting ModSecurity WAF Block Diagnosis investigation...
Bot: [Queries Loki for recent 403s, analyzes rules, categorizes blocks]
Bot: Found 3 blocked requests in the last hour. Rule 930120 (OS File Access Attempt)
triggered on cookie data. This appears to be a FALSE POSITIVE - the session
cookie contains Base64 data that matches the pattern.
Recommended fix:
```nginx
SecRuleUpdateTargetById 930120 "!REQUEST_COOKIES:/__Secure-session-token/"
```
User: Can you check if this is happening in staging too?
Bot: [Queries Loki for staging realm, analyzes results]
Atlas Migration Investigation:
User: @bot migration 20251029093301 not showing up in dev-01
Bot: π Starting Atlas Migration Troubleshooting investigation...
Bot: [Checks GitRepository, ConfigMap, AtlasMigration CRD]
Bot: Root cause identified:
dev-01 environment is pinned to tag 0.0.667 in control-kubernetes
Migration 20251029093301 was added in tag 0.0.673
Resolution: Update config to reference tag 0.0.673
- Read-Only Access: Bot has no write permissions to Kubernetes
- Log Sanitization: All logs are sanitized before sending to Claude API
- API keys, tokens, passwords redacted
- JWT tokens redacted
- Email addresses redacted
- Credit card numbers redacted
- Private keys redacted
- Audit Trail: All K8s queries and Claude API calls are logged
- GitOps Principles: Never suggests direct cluster modifications
Deploy as a Kubernetes Deployment with:
- Single replica (in-memory state)
- Non-root container user (UID 1000)
- ServiceAccount with minimal RBAC:
get,liston pods, pods/log, configmaps, servicesget,liston deploymentsget,liston Flux CRDs (gitrepositories, kustomizations)get,liston Atlas CRDs (atlasmigrations)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: diagnostic-slackbot
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "configmaps", "services"]
verbs: ["get", "list"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list"]
- apiGroups: ["source.toolkit.fluxcd.io"]
resources: ["gitrepositories"]
verbs: ["get", "list"]
- apiGroups: ["kustomize.toolkit.fluxcd.io"]
resources: ["kustomizations"]
verbs: ["get", "list"]
- apiGroups: ["db.atlasgo.io"]
resources: ["atlasmigrations"]
verbs: ["get", "list"]The bot exposes Prometheus metrics on port :9090 for scraping:
investigations_started_total{type}- Total investigations initiated by template typeinvestigations_resolved_total{type}- Total investigations completedclaude_api_calls_total{status}- Claude API call counts (success/error)claude_api_tokens_total{type="input|output"}- Token usage trackingk8s_queries_total{namespace,resource_type}- Kubernetes query audit trailloki_queries_total{status}- Loki query trackingtool_executions_total{tool_name,status}- MCP tool usage statisticsconversations_active- Current number of active conversations (gauge)
Structured JSON logging via log/slog with:
- Severity levels (Debug, Info, Warn, Error)
- Context-aware logging (conversation IDs, user IDs)
- Automatic correlation for investigations
Docker images are automatically built and published to GitHub Container Registry:
# Pull latest image
docker pull ghcr.io/nikogura/diagnostic-slackbot:latest
# Pull specific version
docker pull ghcr.io/nikogura/diagnostic-slackbot:v1.0.0
# Run locally
docker run -e SLACK_BOT_TOKEN=xoxb-... \
-e SLACK_APP_TOKEN=xapp-... \
-e ANTHROPIC_API_KEY=sk-... \
ghcr.io/nikogura/diagnostic-slackbot:latestThe project uses GitHub Actions for continuous integration and deployment:
-
CI Pipeline (
.github/workflows/ci.yaml):- Runs on every push and pull request
- Linting (golangci-lint + namedreturns)
- Testing with race detection and coverage
- Binary build
- Multi-arch Docker image build (amd64, arm64)
- Security scanning with Trivy
- Publishes to
ghcr.io/nikogura/diagnostic-slackbot
-
Release Pipeline (
.github/workflows/release.yaml):- Triggered on version tags (
v*) - GoReleaser for multi-platform binaries
- Docker images with semantic versioning
- Automated changelog generation
- GitHub Release creation
- Triggered on version tags (
-
Dependabot (
.github/dependabot.yml):- Weekly dependency updates
- Grouped updates for AWS SDK and Kubernetes
Apache License 2.0
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linters:
make test && make lint - Submit a pull request