| name | observability-k8s-investigation | ||||
|---|---|---|---|---|---|
| description | Investigate Kubernetes workload, node, and control-plane issues using OTel telemetry (EDOT). Use when diagnosing pod failures (CrashLoopBackOff, OOMKilled, Error), node pressure, resource exhaustion, image pull failures, admission rejections, autoscaling anomalies, or correlating K8s state with application signals. OTel ingest path only — the legacy ECS Kubernetes integration shape is out of scope. | ||||
| metadata |
|
Diagnose Kubernetes issues using OTel telemetry collected via EDOT (Elastic Distribution of OpenTelemetry) and the kube-stack collector. Correlate cluster state, pod runtime metrics, K8s events, application logs, and APM to identify root cause across the workload, node, and control-plane layers.
In scope: OTel-receiver-namespaced indices (metrics-kubeletstatsreceiver.otel-*,
metrics-k8sclusterreceiver.otel-*, logs-k8seventsreceiver.otel-*, logs-k8sobjectsreceiver.otel-*) and OTel
semantic conventions (k8s.pod.name, k8s.namespace.name, k8s.container.restarts).
Out of scope:
- The legacy Elastic Agent Kubernetes integration (
metrics-kubernetes.*,logs-kubernetes.*,kubernetes.*fields). Being deprecated — do not author queries against these paths. - APM-layer analysis (service SLO breaches, transaction error rates, upstream dependency health). Different domain — once a K8s root cause is ruled in or out, APM investigation continues outside this skill.
- Cluster provisioning, capacity planning, cost optimization. Different domain.
These apply to every investigation. When in doubt, re-read them before writing the synthesis.
Absence of evidence is not evidence. Do not confabulate from empty results. If log queries return 0 rows, logs are
likely not collected or the pod has no recent lines — this does not mean "dependency unavailable" or any other
specific failure mode. Report no_logs_available and weight remaining signals accordingly.
Empty dependency data ≠ upstream healthy. Services without APM instrumentation (load generators, workers) emit no
destination metrics. Report insufficient_dependency_data, not "upstreams OK."
Co-symptoms are not causes. Two services degrading simultaneously usually share an upstream, not a causal link. Only attribute causation when (a) one service's degradation clearly precedes the other's, and (b) the delta is large (>5× error rate, >3× latency).
OOMKilled ≠ memory leak by default. The limit may simply be undersized for the workload's working set. Compare against a 7-day baseline at the same hour-of-day before claiming a leak.
Error-termination ≠ application bug by default. Check k8s.pod.cpu_limit_utilization first. CFS throttling driving
liveness probe timeouts is the most common misdiagnosis in this space.
Average CPU hides throttling. A pod can look healthy at 40–60% average cpu_limit_utilization while being throttled
severely at p99. Linux enforces CPU limits in 100ms periods; bursty workloads hit quota mid-period and stall. Look at
max and p95, not just average.
Restart count is boolean, not a counter. k8s.container.restarts is pulled directly from the K8s API and may be
pruned by the kubelet at any time, so the absolute value is unreliable. Treat it as == 0 (no recent restarts) vs > 0
(recently restarting); do not derive backoff timing or "linear vs exponential" patterns from it. Confirm the restart
pattern via K8s Killing / BackOff events instead.
Prefer to report uncertainty over manufacturing confidence. If the evidence is ambiguous, the synthesis should say so. Competing hypotheses are a valid output.
| Signal | Index pattern | Use |
|---|---|---|
| Pod/container runtime | metrics-kubeletstatsreceiver.otel-* |
CPU, memory, network, filesystem. Utilization ratios. |
| Cluster state | metrics-k8sclusterreceiver.otel-* |
Restarts, phase, last-terminated reason, HPA, quota, node condition |
| K8s events | logs-k8seventsreceiver.otel-* |
Killing, BackOff, FailedScheduling, Evicted, image pull events |
| K8s object snapshots | logs-k8sobjectsreceiver.otel-* |
Deployment/service/configmap state over time |
| Application logs | logs-*.otel-* |
body.text, severity_text, filtered by k8s.pod.name |
| APM | traces-*.otel-*, metrics-service_*.otel-default |
Correlate via service.name + K8s resource attrs |
| ML anomalies | .ml-anomalies-* |
Memory-growth, restart-rate, throttle jobs (if configured) |
Flat OTel paths work in ES|QL. Prefer the flat form for readability; the nested resource.attributes.* form is for raw
log documents only.
| Field | Index | What it is |
|---|---|---|
k8s.pod.name |
all k8s | Pod name |
k8s.namespace.name |
all k8s | Namespace |
k8s.container.name |
all k8s | Container within pod |
k8s.deployment.name |
k8sclusterreceiver + others | Parent deployment |
k8s.pod.phase |
k8sclusterreceiver | Pending=1/Running=2/Succeeded=3/Failed=4/Unknown=5 |
k8s.container.restarts |
k8sclusterreceiver | Total container restart count |
k8s.container.status.last_terminated_reason |
k8sclusterreceiver | OOMKilled, Error, Completed, ContainerCannotRun |
k8s.pod.status_reason |
k8sclusterreceiver | Pod-level reason (Evicted, NodeLost) |
k8s.pod.memory_limit_utilization |
kubeletstatsreceiver | 0.0–1.0+ (can exceed 1 transiently before OOM) |
k8s.pod.cpu_limit_utilization |
kubeletstatsreceiver | 0.0–N (frequently >1 under CFS throttling) |
k8s.pod.memory.usage / .working_set |
kubeletstatsreceiver | Bytes |
k8s.node.condition_memory_pressure |
k8sclusterreceiver | 1 = pressure, 0 = ok |
k8s.node.condition_ready |
k8sclusterreceiver | 0 = NotReady |
k8s.hpa.current_replicas / .desired_replicas |
k8sclusterreceiver | HPA state |
attributes.k8s.event.reason |
k8seventsreceiver | Event reason (filter on this) |
body.text |
k8seventsreceiver / logs | Event message / log message |
k8s.object.name |
k8seventsreceiver | involvedObject name (log attribute, use flat form) |
Several fields above are off by default in stock kube-stack collectors and require explicit configuration. Verify presence before relying on them; if absent, fall back as noted and call out the substitution in the synthesis.
| Field | Why it might be missing | Fall-back |
|---|---|---|
k8s.container.status.last_terminated_reason |
Optional metric in k8sclusterreceiver; gated behind metrics_collected.metadata config. |
Infer from K8s Killing / OOMKilling events in logs-k8seventsreceiver.otel-* and exit codes in app logs. |
k8s.pod.status_reason |
Same — optional metric on k8sclusterreceiver. | Infer from events: Evicted, NodeLost, Preempted. |
k8s.pod.cpu_limit_utilization / memory_limit_utilization |
Only emitted when the pod has the corresponding limit set, and the kubeletstatsreceiver metric is enabled. | Compute manually as k8s.pod.cpu.usage / <limit> from k8sclusterreceiver, or use absolute usage trending against a baseline. |
k8s.node.condition_memory_pressure |
Gated behind k8sclusterreceiver node_conditions_to_report (default omits this). |
Compare k8s.node.memory.usage against k8s.node.allocatable_memory, or look for Evicted events on the node. |
If a fall-back is used, note it in the synthesis (e.g. (via memory.usage; limit_utilization not collected)) so the
reader knows the signal is indirect.
Before writing queries, know these. Each of them silently produces wrong answers rather than failing loudly.
VALUES() returns scalar for single distinct value, array for multiple. Templating that assumes array shape (e.g.
| first) extracts the first character of the string when scalar. Use MV_FIRST(VALUES(...)) or handle both.
PERCENTILE does not work on OTel histogram type (as of 8.15). For APM duration percentiles, use AVG on the
aggregate_metric_double summary field (AVG(transaction.duration.summary) divides sum by value_count). For true
percentiles, fall back to Kibana Query DSL.
COUNT(agg_metric_double) returns value_count (events), not doc count. SUM(field) gives the sum component;
AVG(field) gives sum/value_count. Do not use SUM(transaction.duration.summary) as an event-count proxy — it returns
total duration.
K8s metrics use flat OTel field paths in ES|QL. k8s.pod.name, not resource.attributes.k8s.pod.name. The nested
form is for raw log documents.
Vocabulary for classification, not a decision tree. Use the pivotal-signal column to recognize which mode you're looking at; use "Investigate" to know what else should corroborate.
| Mode | Pivotal signal | Investigate |
|---|---|---|
| OOMKilled | last_terminated_reason == "OOMKilled" + memory_limit_utilization → 1.0 |
Monotonic rise (leak) vs. load-driven spike? Compare current trend to 7-day baseline. Check heap metrics (JVM, Go, Node) for GC pressure. |
| CPU throttling → Error exit | cpu_limit_utilization > 1.0 + last_terminated_reason == "Error" |
Liveness/readiness probe timeouts from CFS throttling. Average CPU can look fine (40–60%) while p99 throttle is severe. Check probe timeouts vs observed startup/health latency. |
| Liveness probe misconfiguration | Restarts without resource pressure; initialDelaySeconds < startup time |
K8s events show Unhealthy / Killing. kubectl logs --previous typically shows healthy startup before kill. |
| CrashLoopBackOff (generic) | BackOff events + rising k8s.container.restarts |
Branch on last_terminated_reason — this is a meta-mode. OOMKilled → memory path; Error → logs + throttling; ContainerCannotRun → image/exec. |
| ImagePullBackOff | K8s events Failed with image name + 429 or not found |
Registry rate limit? Missing tag? Wrong imagePullSecret? Check recency of Pulling/Pulled events. |
| Stuck rollout | New pods Pending/not-Ready > progressDeadlineSeconds; old pods still serving |
Check k8s.deployment.available vs .desired. Admission rejection? Readiness probe failing on new pods? HPA not scaling? |
| Termination signal race | Brief 5xx bursts correlated with rolling deploys | Endpoint removal races termination. New requests can hit the pod after SIGTERM starts. NGINX gotcha: STOPSIGNAL SIGTERM triggers fast shutdown, not graceful — use STOPSIGNAL SIGQUIT for graceful drain. Check ingress 502 rate vs rollout timing. |
| Mode | Pivotal signal | Investigate |
|---|---|---|
| Node NotReady cascade | k8s.node.condition_ready == 0 + mass Evicted events |
Memory pressure? Disk pressure? Network partition from API server? Inspect kubelet logs, k8s.node.condition_* history. |
| Resource eviction | status_reason == "Evicted" + condition_memory_pressure == 1 on node |
Node-level noisy neighbor. QoS order: BestEffort → Burstable → Guaranteed. Identify which pod drove node memory up. |
| Node affinity/selector conflict | Mass unschedulable pods after label change | K8s events show FailedScheduling. Often triggered by cluster upgrades (e.g. node-role.kubernetes.io/master → control-plane). |
| Mode | Pivotal signal | Investigate |
|---|---|---|
| etcd I/O cascade | API server latency spike + cluster-wide kubelet heartbeat failures | Disk IOPS, fsync latency (must be <10ms). Cloud-burst-credit exhaustion is common. |
| Admission webhook block | Mass FailedCreate across namespaces; deployments frozen |
failurePolicy:Fail webhook pod crashed. Check webhook pod health + API server TCP connection cache (caches dead connections ~15 min). |
| Priority preemption storm | Production pods terminating with preempted-by annotation |
New PriorityClass with globalDefault:true caused cascade. Check kube-scheduler events. |
| PDB drain deadlock | Node drain stuck indefinitely; HTTP 429 from Eviction API | PDB minAvailable/maxUnavailable too strict. No default drain timeout. Manual PDB deletion unblocks. |
| Mode | Pivotal signal | Investigate |
|---|---|---|
| HPA unready-pod dampening | Load rising, HPA not scaling; unready pods included in calculation | HPA averages CPU across all replicas including unready (0% contribution). Check k8s.hpa.current_replicas vs .desired_replicas + pod readiness. |
| Resource quota silent 403 | Deployment stuck at n-1/n; FailedCreate on ReplicaSet |
Namespace quota exhausted (often CronJob accumulation). Check k8s.resource_quota.used vs .hard_limit. |
| Mode | Pivotal signal | Investigate |
|---|---|---|
| StatefulSet split-brain | Duplicate pod identities across partitioned nodes | Network partition + eviction timeout race. Two instances of same ordinal running. No fencing by default. |
| CoreDNS OOMKill | CoreDNS restarts + cluster-wide DNS timeouts in app logs | Default CoreDNS memory (~170Mi) insufficient under query amplification (ndots:5, each external lookup → ~10 lookups). |
Real incidents often match two modes. Examples:
- OOMKilled pod with simultaneous CPU throttling — memory usually drives the kill, but verify by checking whether memory or CPU hit limit first.
- Stuck rollout with HPA dampening and resource quota near-exhaustion — both can freeze a deploy. Check which constraint is binding.
- Node NotReady with pods that were already crashing — the node issue may be incidental.
When two modes fit, name both in the synthesis and say which one you believe is causal and why. Do not force a single hypothesis when the evidence supports two.
- Monotonic rise over 30–60 min → leak. Check GC metrics for the language: JVM
jvm.gc.duration, Goprocess.runtime.go.gc.pause_ns, Nodev8js_gc_duration. Rising GC frequency/pause with stable live-set is the canonical leak signature. - Diurnal / load-correlated spikes → load-driven, not leak. Consider HPA tuning or limit increase.
- Hits 1.0, then restart → OOMKilled confirmed. Exit code 137 (SIGKILL) in app logs consistent.
cpu_limit_utilization > 1.0sustained → CFS throttling. Node has spare CPU; the pod is quota-blocked.- Symptoms of throttling (not the throttle metric itself): liveness probe timeouts, p99 latency 4–16× p50, queue backpressure upstream, Error-reason container terminations.
- Average can look healthy while p95 is throttled. Do not trust average alone.
restarts > 0recently → workload has been restarting. Don't read magnitude into the count (see Restart count is boolean); confirm the pattern from K8sKilling/BackOffevent timestamps inlogs-k8seventsreceiver.otel-*.- Restarts correlated with memory pressure (
memory_limit_utilization → 1.0) → OOMKilled path. - Restarts without memory/CPU pressure → probe misconfig, app bug, or startup dependency failure. Pull events for
UnhealthyandKilling.
OOMKilled→ memory path.Error→ non-zero exit. Check app logs; if empty/minimal, check CPU throttling before attributing to app logic.Completed→ ran to completion. Normal for Jobs/CronJobs/init containers; anomalous otherwise.ContainerCannotRun→ runtime/image/exec issue. Check image pull events.
An investigation is not a checklist. The sections below describe a typical arc — compress, skip, or revisit them based on what you find. Terminate as soon as you have enough evidence to synthesize at a known confidence. Chasing signals past the point of diminishing returns is a failure mode, not thoroughness.
Resolve the target: k8s.pod.name, k8s.namespace.name, optionally k8s.deployment.name and service.name. If no
time window is given, default to the last hour for pod-level investigations, last 2 hours for event correlation, last 6
hours for ongoing/unresolved incidents.
If the alert payload already tells you the failure mode (e.g., it fires specifically on OOMKilled), note that and skip
classification; move to confirmation and baseline comparison.
Get the shape of the workload's recent behavior: restart count, termination reasons, phase, utilization. One or two queries usually suffice.
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts),
term_reasons = VALUES(k8s.container.status.last_terminated_reason),
phase = MAX(k8s.pod.phase)
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 15 minutes
| STATS mem_pct = ROUND(MAX(k8s.pod.memory_limit_utilization) * 100, 1),
cpu_pct = ROUND(MAX(k8s.pod.cpu_limit_utilization) * 100, 1)
Use the taxonomy. The pivotal signal should match; the "Investigate" column tells you what corroboration to seek.
When two modes fit, note both and proceed with the one that has the stronger pivotal signal. You may revise during corroboration.
Pull the evidence your classification predicts you'll find. Typical sources:
K8s events for the namespace and window:
FROM logs-k8seventsreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>"
AND @timestamp > NOW() - 2 hours
AND attributes.k8s.event.reason IN (
"BackOff", "Killing", "Unhealthy", "Failed",
"FailedScheduling", "Evicted", "SuccessfulRescale",
"Pulling", "Pulled", "Started", "Created"
)
| SORT @timestamp DESC
| KEEP @timestamp, attributes.k8s.event.reason, body.text, k8s.object.name
| LIMIT 30
Application logs if available — look at the 200 most recent lines before the termination timestamp. If absent, flag
no_logs_available; do not invent a log pattern.
APM if the pod runs an instrumented service — resolve service.name from pod resource attributes for later
correlation. SLO / latency / error-rate analysis itself is APM-layer work and out of scope for this skill.
Baseline comparison — for utilization-based findings, compare current values to 7-day-prior at the same hour-of-day. "High memory" is meaningful only relative to what's normal for this workload.
Only pursue if the symptom pattern suggests it. Threshold: upstream error rate >5× baseline or latency >3× baseline, AND degradation started before the symptom on the target service. Co-symptoms do not establish causation.
If metrics-service_destination.1m.otel-default has no rows for the service, report insufficient_dependency_data —
not "upstreams healthy."
SuccessfulCreate / Pulled events in the last 2 hours often correlate with deploys. logs-k8sobjectsreceiver.otel-*
shows configmap/secret/deployment spec changes. A change within 15 minutes of the symptom onset is a strong correlation,
but still a correlation — verify it plausibly explains the mode you've classified.
Synthesize as soon as you have enough evidence to support a hypothesis at known confidence. You do not need to complete every section above — investigation terminates when either:
- You have a high-confidence hypothesis with corroboration, or
- You have a low/medium-confidence hypothesis and further queries are unlikely to change the picture (e.g., logs are unavailable, APM isn't instrumented, no recent changes found).
Default structure:
HYPOTHESIS (confidence: high | medium | low)
<One paragraph: service, symptom, most likely cause. Name the failure mode from the taxonomy.>
EVIDENCE
- <Finding from characterization, with the concrete metric or value.>
- <Finding from events / logs / APM.>
- <Finding from baseline comparison, dependency check, or change correlation if pursued.>
CONFIDENCE NOTE
<Only if not 'high'. What specific evidence is missing or ambiguous.>
RECOMMENDED NEXT STEPS
1. <Most actionable — typically a config check or metric to observe.>
2. <Secondary.>
DOWNSTREAM IMPACT
<Services depending on this workload, or 'No downstream dependencies identified.'>
When two hypotheses are live: replace HYPOTHESIS with COMPETING HYPOTHESES; list both, say which you lean toward and why, and list the evidence that would disambiguate them.
When no incident is found (symptom resolved, or alert appears spurious): say so directly.
ALERT FIRED BUT SYSTEM APPEARS HEALTHY is a valid output. List what you checked and what you didn't find.
Start at high and downgrade based on what's missing:
- Downgrade to medium if: primary signal is clear but corroboration is missing (no logs, no APM, no baseline comparison possible). Or: two modes fit and you can't disambiguate.
- Downgrade to low if: only a single signal supports the hypothesis, signals conflict, or the mode requires evidence you couldn't fetch.
Never return high when application log data was absent and the hypothesis depends on application behavior. Absence of evidence does not corroborate a hypothesis.
FROM metrics-k8sclusterreceiver.otel-*
| WHERE k8s.namespace.name == "<ns>" AND @timestamp > NOW() - 1 hour
| STATS restarts = MAX(k8s.container.restarts) BY k8s.pod.name, k8s.container.status.last_terminated_reason
| WHERE restarts > 0
| SORT restarts DESC
| LIMIT 20
FROM metrics-kubeletstatsreceiver.otel-*
| WHERE k8s.pod.name == "<pod>" AND @timestamp > NOW() - 30 minutes
| STATS max_cpu_ratio = ROUND(MAX(k8s.pod.cpu_limit_utilization), 2),
avg_cpu_ratio = ROUND(AVG(k8s.pod.cpu_limit_utilization), 2),
max_cpu_cores = ROUND(MAX(k8s.pod.cpu.usage), 3)
Sustained ratio >1.0 = throttling. Transient >1.0 with avg <0.5 is usually benign burst.
FROM metrics-k8sclusterreceiver.otel-*
| WHERE @timestamp > NOW() - 15 minutes AND k8s.node.condition_memory_pressure == 1
| STATS ts = MAX(@timestamp) BY k8s.node.name
| SORT ts DESC
FROM logs-k8seventsreceiver.otel-*
| WHERE @timestamp > NOW() - 1 hour
AND (attributes.k8s.event.reason == "FailedCreate"
OR body.text LIKE "*admission webhook*"
OR body.text LIKE "*exceeded quota*")
| SORT @timestamp DESC
| KEEP @timestamp, k8s.namespace.name, attributes.k8s.event.reason, body.text
| LIMIT 30
GET /api/alerting/rules/_find?search=k8s&search_fields=tags&filter=alert.attributes.executionStatus.status:active
Characterize first: get restart count, termination reason, memory and CPU utilization.
- If
last_terminated_reason == "OOMKilled"and memory utilization hit 1.0 → memory path. Corroborate with 7-day baseline: monotonic rise over days = leak; spiky = load-driven. Check GC metrics if language is known. - If
last_terminated_reason == "Error"andcpu_limit_utilization > 1.0→ CPU throttling path. Corroborate with liveness probe config (initialDelaySeconds, timeoutSeconds) and K8s events forUnhealthy. - If
last_terminated_reason == "Error"and CPU is fine → application-logic path. Pull recent logs before termination. - If
last_terminated_reason == "ContainerCannotRun"→ image/exec path. Check K8s events forFailedpull events.
Synthesize with appropriate confidence. If logs were unavailable on the Error path, downgrade to medium and say so.
Authoritative signal: k8s.deployment.available < k8s.deployment.desired for > 10 minutes.
Diagnose the constraint:
- K8s events on the new ReplicaSet:
FailedCreate→ admission rejection (quota, webhook, PSP).FailedScheduling→ no node fits. - New-pod utilization: all at 0% memory → never started (image pull failure); high CPU with low memory → slow startup hitting readiness probe.
- HPA state: stable
current_replicas < desired_replicasunder load → unready-pod dampening.
Possible and worth naming explicitly. Check:
- Has the symptom resolved? Compare current utilization/restart rate to the alert trigger point.
- Was the alert a transient spike that's already decayed?
- Is the alert tuned appropriately (e.g., too-short evaluation window)?
Output: ALERT FIRED BUT SYSTEM APPEARS HEALTHY with what you checked. Recommend alert tuning if the pattern is
recurrent.
- Workflow:
K8s CrashLoopBackOff Investigation— alert-triggered automated version of the pod-level path above. Runs deterministic ESQL + branches; this skill provides the interpretation layer the workflow lacks. - Forge genome library: 16 K8s failure scenarios (OOMKill cascade, CPU throttling, probe misconfig, node NotReady, admission webhook block, etc.) validating this skill's coverage.