| name | datadog-insights |
|---|---|
| description | Investigate Gram production health and post a digest to Slack |
You are producing a health report for Gram's production services. The report must be actionable and visually structured — critical issues must stand out immediately, tabular data must use code blocks, and every section must be separated by a divider.
Before starting: activate the datadog skill for Gram service names, MCP tools, and query guidelines.
⚠️ MANDATORY FORMAT RULES — READ BEFORE COMPOSING THE MESSAGE:
- Every major section MUST be preceded by a Unicode divider line:
──────────────────────────────────────on its own line, with a blank line above and below.- Top endpoints, error type breakdowns, and latency tables MUST use triple-backtick code blocks — never bullet points for tabular data.
- Code block tables must have aligned columns using spaces. Minimum widths: endpoint 38 chars, count 8 chars, err% 6 chars, p95 8 chars.
- Each monitor in alert MUST get its own paragraph — never combine multiple monitors into one block.
- Do NOT collapse or omit data to save space. If there are 8 monitors, show all 8.
These take priority over everything else. If any exist, they become the top of the digest.
- Open incidents —
search_datadog_incidentsforstate:(active OR stable)in the last 24h - Monitors in alert —
search_datadog_monitorsforstatus:alert - Error spikes — Use
analyze_datadog_logswith SQL:Filter:SELECT service, status, count(*) FROM logs GROUP BY service, status ORDER BY count(*) DESC
env:prod status:(error OR critical OR alert OR emergency), last 24h. Compare the last 6h vs. the previous 18h to detect spikes.
If there are critical issues, investigate each one:
- Get a sample of the actual error logs (
search_datadog_logs) - Follow trace IDs with
get_datadog_traceto find root causes Grepinserver/internal/for the error message to find the source code location
For top error message breakdown, use analyze_datadog_logs:
SELECT message, count(*) as cnt
FROM logs
WHERE service = 'gram-server' AND status IN ('error', 'critical')
GROUP BY message
ORDER BY cnt DESC
LIMIT 10Use search_datadog_spans for service:gram-server env:prod over the last 24h, or:
sum:trace.http.server.request.hits{service:gram-server,env:prod} by {resource_name}.rollup(sum, 86400)
Collect the top 10 endpoints with:
- Request count
- Error rate (% of requests returning 4xx/5xx)
- p95 latency
Compare traffic between two 12h windows:
- Current 12h:
from: now-12h, to: now - Previous 12h:
from: now-24h, to: now-12h
Use get_datadog_metric with:
sum:trace.http.server.request.hits{service:gram-server,env:prod}.rollup(sum, 43200)
Report:
- Total requests in the last 24h
- % change between the two 12h periods (flag if > 30% change)
- Per-service breakdown (
gram-server,gram-worker,gram,fly)
p50:trace.http.server.request{service:gram-server,env:prod} by {resource_name}
p95:trace.http.server.request{service:gram-server,env:prod} by {resource_name}
p99:trace.http.server.request{service:gram-server,env:prod} by {resource_name}
Over the last 24h with .rollup(avg, 86400).
Report:
- Global latency: p50, p95, p99 across all endpoints
- Slowest 5 endpoints by p95 latency (with their p50 for comparison)
- Flag any endpoint where p95 > 2s or p99 > 5s
Call create_datadog_notebook with name "Gram Health Digest — <DAY> <DATE>" (e.g. "Gram Health Digest — Fri 2026-03-27"). Use absolute_time: true with start_time = 24h ago and end_time = now. One notebook is created per run — old ones accumulate and can be manually deleted periodically.
The notebook cells must be wrapped in {"cells": [...]}. Include:
- Summary markdown cell:
{ "type": "notebook_cells", "attributes": { "definition": { "type": "markdown", "text": "One paragraph verdict with key numbers." } } } - Error rate timeseries cell:
{ "type": "notebook_cells", "attributes": { "definition": { "type": "timeseries", "title": "gram-server Error Rate (1h buckets)", "requests": [ { "q": "sum:trace.http.server.request.errors{service:gram-server,env:prod}.rollup(sum, 3600)", "display_type": "bars", "style": { "palette": "warm" } } ], "show_legend": true, "yaxis": { "scale": "linear" }, "markers": [ { "value": "y = 500", "display_type": "warning dashed", "label": "Elevated" } ] } } } - Traffic volume timeseries cell:
{ "type": "notebook_cells", "attributes": { "definition": { "type": "timeseries", "title": "gram-server Traffic Volume (1h buckets)", "requests": [ { "q": "sum:trace.http.server.request.hits{service:gram-server,env:prod}.rollup(sum, 3600)", "display_type": "area", "style": { "palette": "dog_classic" } } ], "show_legend": true, "yaxis": { "scale": "linear" } } } } - p95 latency by endpoint timeseries cell:
{ "type": "notebook_cells", "attributes": { "definition": { "type": "timeseries", "title": "Top Endpoint p95 Latency", "requests": [ { "q": "p95:trace.http.server.request{service:gram-server,env:prod} by {resource_name}.rollup(avg, 3600)", "display_type": "line", "style": { "palette": "dog_classic" } } ], "show_legend": true, "yaxis": { "scale": "linear" }, "markers": [ { "value": "y = 2", "display_type": "error dashed", "label": "2s threshold" } ] } } } - gram-worker error rate timeseries cell:
{ "type": "notebook_cells", "attributes": { "definition": { "type": "timeseries", "title": "gram-worker Error Rate (1h buckets)", "requests": [ { "q": "sum:trace.http.server.request.errors{service:gram-worker,env:prod}.rollup(sum, 3600)", "display_type": "bars", "style": { "palette": "warm" } } ], "show_legend": true, "yaxis": { "scale": "linear" } } } } - gram (frontend) trace errors timeseries cell —
gramis an APM service, so use trace metrics:{ "type": "notebook_cells", "attributes": { "definition": { "type": "timeseries", "title": "gram (frontend) Trace Errors (1h buckets)", "requests": [ { "q": "sum:trace.http.server.request.errors{service:gram,env:prod}.rollup(sum, 3600)", "display_type": "bars", "style": { "palette": "warm" } } ], "show_legend": true, "yaxis": { "scale": "linear" } } } } - fly (functions) error log stream cell —
flyis a log source (not an APM service), so use a log stream, not a trace metric:{ "type": "notebook_cells", "attributes": { "definition": { "type": "log_stream", "title": "fly (functions) Error Logs (24h)", "query": "source:fly env:prod status:error", "columns": ["timestamp", "host", "message"], "message_display": "inline", "show_date_column": true, "show_message_column": true, "sort": { "column": "timestamp", "order": "desc" } } } } - Slow endpoints + top errors markdown table cell with the real data from Steps 1–4.
- All Gram services error log stream cell — includes
source:flyfor Gram Functions logs:{ "type": "notebook_cells", "attributes": { "definition": { "type": "log_stream", "query": "(service:(gram-server OR gram-worker OR gram) OR source:fly) env:prod status:error", "columns": ["timestamp", "host", "service", "message"], "message_display": "inline", "show_date_column": true, "show_message_column": true, "sort": { "column": "timestamp", "order": "desc" } } } }
Save the notebook URL — you will link it in the Slack message footer.
Based on all the data gathered, write one concrete recommendation for the on-call engineer. Be specific:
- If errors are spiking: name the error type, the likely cause based on code grep, and the first action to take.
- If a slow endpoint is flagged: name it and suggest where to look (query, external call, etc.).
- If everything is healthy: say "No action needed. Monitor X for Y."
- If errors are declining: say so and advise continued monitoring.
This recommendation goes into the Slack message as a dedicated section.
Use the slack_send_message MCP tool with the message parameter (Slack markdown). The tool does not support Block Kit blocks — use plain Slack markdown only.
Slack markdown syntax: *bold*, _italic_, `code`, ```code block```, <URL|link text>.
The message opens with three lines — no separate header block:
*Gram Health Digest — <DAY> <DATE>*
<VERDICT_EMOJI> *<One-line overall verdict>*
<!subteam^S09EXM6DPCY|dev-mcp-oncall>
Then append sections in this exact order, each preceded by the Unicode divider line on its own line with a blank line above and below:
──────────────────────────────────────
Each monitor gets its own paragraph. Do NOT combine multiple monitors.
──────────────────────────────────────
🚨 *MONITORS IN ALERT*
🔴 *<Monitor name>* — <type>
<What it means and why it matters>
*Notifying:* `#<channel>`
🔴 *<Next monitor name>* — <type>
<What it means>
*Notifying:* `#<channel>`
──────────────────────────────────────
⚠️ *Error Spike (last 6h vs prior 18h)*
• `gram-server`: X errors in 6h (Y/h) vs Z in prior 18h (W/h) — *~Nx spike*
*Top gram-server error types (24h):*
```
error message count pct
mcp server not found 1,288 17.8%
token exchange failed 447 6.2%
not found 273 3.8%
```
⚠️ The error type breakdown MUST be a code block table with aligned columns. Never use bullet points for this data.
──────────────────────────────────────
📊 *Traffic (24h)*
• `gram-server`: ~Xk requests — ↑Y% vs previous 12h
• `gram` (frontend): ~X error log events
• `gram-worker`: ~Xk
• `fly` (functions): X errors
──────────────────────────────────────
🔝 *Top Endpoints — gram-server (last ~Xh)*
```
Endpoint Hits Err% p95
POST /mcp/{mcpSlug} 191,311 0.1% 21ms
GET /.well-known/oauth-protected-resource 39,262 0.1% 3ms
GET /.well-known/oauth-authorization-server 39,158 0.0% 5ms
GET /mcp/{mcpSlug} 13,211 0.0% 2ms
POST /rpc/hooks.claude 1,267 0.0% 80ms
```
⚠️ This section MUST use a triple-backtick code block with space-aligned columns. NEVER use bullet points for the endpoint table.
Column widths: endpoint 48 chars, hits 8 chars, err% 6 chars, p95 8 chars, notes remaining. Flag any endpoint with Err% > 5% with
──────────────────────────────────────
🐢 *Slow Endpoints (p95 > 2s)*
```
Endpoint p95 p50 Hits
GET /rpc/usage.getperiodusage 2,040ms 817ms 6 ⚠️
GET /rpc/auth.info 638ms 8ms 6
```
*Global:* p50: Xms · p95: Xms · p99: Xms
⚠️ The slow endpoint table MUST use a code block with aligned columns. Never use bullet points.
If all endpoints are fast, use instead:
⚡ *Latency* — All endpoints healthy. p50: Xms · p95: Xms · p99: Xms 🟢
──────────────────────────────────────
💡 *Recommendation*
<Specific, concrete recommendation for the on-call engineer. One or two sentences. Name the action and where to look.>
──────────────────────────────────────
🔴 Critical 🟡 Warning 🟢 Healthy | <NOTEBOOK_URL|View charts in Datadog> | Generated by Claude Code + Datadog MCP
Replace NOTEBOOK_URL with the notebook URL from Step 5.
- 🔴 if any active incidents or monitors in ALERT state
- 🟡 if elevated error rates (>2x normal), notable latency, or monitors in WARNING state
- 🟢 if everything looks healthy
Use slack_send_message with channel_id: C0AKLE930BX and the composed message string. If the MCP tool is unavailable, print the message to the terminal instead.