Skip to content

Latest commit

 

History

History
416 lines (313 loc) · 27.6 KB

File metadata and controls

416 lines (313 loc) · 27.6 KB
name kibana-anomaly-detection
description Elastic ML anomaly detection skill — investigation/RCA, score explanation, job operations (create, datafeed, start/stop, results), and troubleshooting (missing docs, memory limits, datafeed health, lifecycle). Operates against Kibana Agent Builder MCP tools (`ad_*`) on `.ml-anomalies-*`, `.ml-config`, `.ml-notifications-*`, `.ml-annotations-*`. Use when answering "what broke?"/"which entity?"/RCA, "why is score high/low?"/renormalization, "datafeed stopped"/"memory limit", or any request to set up or configure an ML anomaly detection job.
metadata
author version
elastic
0.2.0
compatibility Kibana 8.x–9.x with Agent Builder and Workflows; Elasticsearch 8.x–9.x with machine learning

Elastic ML Anomaly Detection

Single skill covering all anomaly detection work against Kibana Agent Builder MCP at {KIBANA_URL}/api/agent_builder/mcp. Use the Mode Selector below to pick the right approach for the user's question — modes share the same tool surface and concepts.

Platform

  • Read path: ES|QL against .ml-anomalies-*, .ml-config, .ml-notifications-*, .ml-annotations-*
  • Always-available: platform.core.execute_esql (plus additional platform tools for search, index mapping, and documentation — see scripts/agent_builder_constants.json)
  • ML API spec (if available): .kibana_ai_openapi_spec_elasticsearch — see references/anomaly-detection-openapi-spec-discover.md for discovery pattern.
  • Run ad_validate_ml_tool_permissions first when tools return empty/misleading results — missing privileges are the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.

Mode Selector

User intent Mode
"What broke?" / RCA / cross-job / blast radius / influencers / log categories Investigate
"Why score high/low?" / renormalization / model bounds / forecasts Explain
Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars Troubleshoot
Create a job / configure a datafeed / start analysis / retrieve results Manage
Security framing (attack chains, MITRE, exfil) Investigate + references/security-anomaly-expert.md
Observability/SRE framing (degradation, capacity, deployment regression) Investigate + references/observability-anomaly-expert.md

When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving on.


Score Quick Reference

  • record_score bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
  • multi_bucket_impact ≥ 3 → sustained shift (not a transient spike)
  • initial_record_score >> record_score → renormalization (model saw worse anomalies later)
  • actual << typical with count/low_count/low_mean → absence/outage, not just low value
  • Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity

Full score definitions, renormalization mechanics, and anomaly_score_explanation components: references/score-reference.md.

Core concepts

Treat .ml-anomalies-* as three layers, accessed via result_type:

  • bucket — bucket-level unusualness per bucket_span. anomaly_score is the aggregate across all detectors.
  • record — finest-grained rows with actual vs typical, probability, record_score, anomaly_score_explanation.
  • influencer — entity contributions ranked within a bucket (influencer_score).

Read scores this way:

  • anomaly_score / record_score = current normalized values (move as the model sees new extremes).
  • initial_anomaly_score / initial_record_score = immutable snapshots from detection time.
  • Compare actual to typical; use probability for raw likelihood.
  • Map entities via partition_field_value / by_field_value / over_field_value.
  • Read multi_bucket_impact (-5 to +5) to separate single-bucket spikes from sustained trends.

Mode: Investigate — RCA

When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.

Tool chain

Phase Tools
Discovery ad_get_available_metadata, ad_get_jobs, ad_discover_related_jobs, ad_discover_jobs_by_datafeed_index
Timeline / scope ad_query_anomaly_timeline
Cross-job / entities ad_rca_cross_job_entity_match, ad_rca_multi_job_entities, ad_rca_entity_profile
Records / influencers ad_query_anomaly_records, ad_query_influencers
RCA depth ad_rca_detector_fingerprint, ad_rca_correlation, ad_rca_blast_radius, ad_rca_score_reassessment
Evidence / categories ad_get_job_datafeed_config, ad_rca_source_evidence, ad_get_categories, ad_search_log_category_examples

Protocol

Follow the 14-step sequence in references/protocols/investigation.md. High level: ad_get_available_metadata → pair ad_discover_jobs_by_datafeed_index with ad_discover_related_jobsad_query_anomaly_timeline → rank with ad_rca_multi_job_entities (min_job_count=2) → ad_rca_detector_fingerprint → drill with ad_query_anomaly_records + ad_query_influencers (low min_score=25) → profile with ad_rca_entity_profile → order with ad_rca_correlation → confirm with ad_rca_source_evidence. When by_field_name == "mlcategory", compare with ad_get_categories + paired ad_search_log_category_examples (baseline vs. anomaly window).

Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class (resource/network/application) · severity · recommended actions. Worked example: references/worked-example.md. Full ES|QL templates and parameters: references/investigate-anomaly-esql-tools.md.

Rules

  1. Multi-job entities are prime suspects; single-job entities are usually victims. Use min_job_count=2.
  2. Earliest anomaly timestamp wins — sort ad_rca_correlation by timestamp; first-appearing entity = origin.
  3. multi_bucket_impact ≥ 3 = sustained behavioral shift, weight higher than transient spikes.
  4. Never close an RCA without ad_rca_source_evidence — raw source documents are ground truth.
  5. Use low min_score (25 or lower) for influencer queries — high thresholds miss correlated entities.

Mode: Explain — Score / model behavior

When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".

Score types

Field Scope Meaning
record_score Single record Normalized severity after renormalization.
initial_record_score Single record Score at detection time. Gap vs record_score = renormalization drift.
anomaly_score Bucket Aggregate severity across all detectors in a bucket.
influencer_score Entity × bucket How anomalous a specific entity is in that bucket.

anomaly_score_explanation components

Component Effect What it means
anomaly_length ↑ score More consecutive anomalous buckets
single_bucket_impact ↑ score Lower probability → higher impact
multi_bucket_impact ↑ score Sustained pattern contribution
anomaly_characteristics_impact ↑ score Mean shift vs. variance change
high_variance_penalty ↓ score Noisy data → wide bounds → anomaly less surprising
incomplete_bucket_penalty ↓ score Bucket has less data than expected (ingest lag, sparse data)

Why a score looks wrong

  • Unexpectedly low: high_variance_penalty, renormalization, <3 weeks training for weekly seasonality, bucket_span too large, wrong detector function (mean vs high_mean), incomplete_bucket_penalty, suppression by custom_rules.
  • Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per entity), use_null: true on a sparse field.

Tool chain

Purpose Tools
Records + explanation ad_query_anomaly_records (exact job_id_pattern)
Renormalization drift ad_rca_score_reassessment (score_drift = initial_record_score - record_score)
Model bounds (visual) ad_get_model_plot — actual outside model_lower/model_upper = anomaly
Forecast overlap ad_get_forecast_results
Influencer attribution ad_query_influencers
Config & detector ad_get_job_datafeed_configbucket_span, function, custom_rules, use_null
Categorization ad_get_categories
Model snapshots ad_get_model_snapshots
Structured diagnostic ad_wf_troubleshoot_anomaly_score (full decision tree)

Decision tree (ad_wf_troubleshoot_anomaly_score)

  1. ad_get_jobs — ≥3 weeks data for weekly seasonality?
  2. ad_ts_model_memory_healthmemory_status healthy?
  3. ad_ts_delayed_data_annotations — no incomplete buckets?
  4. ad_query_anomaly_records — compare record_score vs initial_record_score.
  5. ad_get_job_datafeed_configbucket_span, detector function, custom_rules, use_null.
  6. ad_get_model_plot — wide bounds → high_variance_penalty.
  7. ad_rca_score_reassessment — renormalization drift across history.
  8. Explain anomaly_score_explanation factors.

Rules

  1. Always show both initial_record_score and record_score — the gap is the renormalization story.
  2. Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs no config change.
  3. actual << typical with count/low_count is an absence anomaly — distinguish outages from value spikes.
  4. high_variance_penalty and incomplete_bucket_penalty explain most "low score" surprises without remediation.
  5. Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.

For detector function selection details, see references/anomaly-detection-functions.md.


Mode: Troubleshoot — Job ops

When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars, CCS.

Common issues → fast paths

Issue Fast path Full decision tree
Missing docs / query_delay warning ad_ts_delayed_data_annotationsad_ts_bucket_event_gapsad_ts_ingest_latency_estimatead_update_datafeed_query_delay ad_wf_troubleshoot_query_delay
Memory soft_limit / hard_limit ad_ts_model_memory_healthad_wf_ts_field_cardinalityad_estimate_memory_requirementad_update_model_memory_limit ad_wf_troubleshoot_memory_limit
Datafeed not running / job state ad_get_jobs (state) → ad_get_job_messagesad_manage_datafeed
CCS / remote_cluster: indices ad_ts_ccs_diagnostics
Score sanity check ad_wf_troubleshoot_anomaly_score

hard_limit corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixing query_delay.

Memory concepts

Field Meaning
model_bytes Current memory used
peak_model_bytes High-water mark since job opened
model_bytes_memory_limit Configured model_memory_limit
memory_status ok / soft_limit (pruning) / hard_limit (critical)
total_by_field_count > 100k by_field cardinality too high — dominant driver
total_partition_field_count > 10k Partition explosion
total_category_count > 10k Too many distinct log patterns

Prefer ad_estimate_memory_requirement (samples cardinality from source, calls Estimate Model Memory API) over heuristics like peak_model_bytes * 1.3 — the heuristic ignores pure influencer and categorization memory.

Datafeed & timing concepts

  • query_delay — how far behind real time the datafeed queries. Too small → missing docs; too large → slower alerts. Set to P95 ingest latency + buffer (default 60s120s).
  • delayed_data_check_config — how aggressively the datafeed checks for late data.
  • bucket_span — analysis interval. Align with data granularity and detection window.
  • frequency — defaults to min(query_delay, bucket_span / 2).

Lifecycle for config changes (memory limit, query_delay)

  1. Stop datafeed: ad_manage_datafeed (action=_stop)
  2. Close job
  3. Update config: ad_update_model_memory_limit, ad_update_datafeed_query_delay, ad_update_delayed_data_check_config
  4. Open job: ad_open_job
  5. Start datafeed: ad_manage_datafeed (action=_start)

Recover a corrupted period without resetting the whole model: ad_revert_model_snapshot.

Tool surface

Category Tools
Permissions / metadata ad_validate_ml_tool_permissions, ad_get_available_metadata, ad_get_jobs
Job + datafeed state ad_get_job_datafeed_config, ad_get_job_messages, ad_manage_datafeed, ad_preview_datafeed_with_latency
Timing / missing docs ad_ts_delayed_data_annotations, ad_ts_bucket_event_gaps, ad_ts_ingest_latency_estimate, ad_update_datafeed_query_delay, ad_update_delayed_data_check_config, ad_wf_troubleshoot_query_delay
Memory ad_ts_model_memory_health, ad_wf_ts_field_cardinality, ad_estimate_memory_requirement, ad_update_model_memory_limit, ad_wf_troubleshoot_memory_limit
Model / lifecycle ad_get_model_snapshots, ad_revert_model_snapshot, ad_open_job, ad_create_job
CCS ad_ts_ccs_diagnostics
Calendars ad_get_calendar_events, ad_create_calendar_event

Full parameter tables, ES|QL templates, and REST step lists: references/troubleshoot-anomaly-tool-reference.md.

Rules

  1. ad_validate_ml_tool_permissions first — missing privileges produce misleading empty results.
  2. Fix memory before query_delayhard_limit corrupts state; query_delay fixes on a memory-limited job are wasted.
  3. Stop the datafeed before updating it. Updating a running datafeed is rejected.
  4. Close the job before updating memory limit. Sequence above.
  5. Prefer workflow tools (ad_wf_*) over manually chaining diagnostics for complex decisions.
  6. ad_preview_datafeed_with_latency before starting — confirm the datafeed returns data after config changes.

Mode: Manage — Create / configure jobs

When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".

4-step workflow

PUT  _ml/anomaly_detectors/<job_id>          # 1. Define job        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. Define datafeed   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. Open job         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. Start datafeed   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. Read results

Process

  1. Build configs. Parse the user request into job + datafeed JSON with no null fields.

  2. Apply smart defaults:

    Field Default Override when
    bucket_span "15m" User specifies a different span
    time_field "@timestamp" User names a different timestamp field
    index "logs-*" User specifies an index or pattern
    datafeed_query {"match_all": {}} User mentions filters, processes, or time windows
    influencers by/over/partition fields from detectors User adds extra influencer fields
    job_id Generated from user description User provides an explicit ID
    query_delay "60s" P95 ingest latency is higher
  3. Choose detector function from user intent — full table in references/anomaly-detection-functions.md:

    • "high CPU" / "unusually large" → high_mean or high_sum
    • "rare logins" / "unusual values" → rare (variants below)
    • "too many requests" / "spike in count" → high_count

    rare variants:

    • Infrequent globally → rare by_field_name: X
    • Infrequent vs peers → rare by_field_name: X over_field_name: Y
    • Infrequent per segment → rare by_field_name: X partition_field_name: Y
    • Infrequent per segment vs peers → rare by_field_name: X over_field_name: Y partition_field_name: Z
  4. Validate. platform.core.get_index_mapping on the target index to verify field existence/types → ad_validate_job_spec. If errors, fix and re-validate (max 3 attempts).

  5. Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for approval once. If feedback, incorporate and re-present (up to 3 rounds).

  6. Deploy. After confirmation: ad_create_jobad_create_datafeedad_open_jobad_manage_datafeed (action=_start). Report final job_id and datafeed_id.

For batch analysis on historical data, pass start and end to the datafeed start call.

Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.

Rules

  1. Create job before datafeed. Datafeed references job by ID.
  2. Open job before starting datafeed. Start on a closed job is rejected.
  3. query_delay = P95 ingest latency + buffer (60s–120s safe default).
  4. Forecasts require non-population jobsover_field_name jobs cannot be forecasted; warn before attempting.
  5. by_field_name vs over_field_name: by compares entity to its own history; over compares to peer group in the same bucket. partition_field_name = fully independent sub-model with its own normalization.
  6. bucket_span matches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily patterns. Larger smooths short spikes; smaller increases noise.

Registration (Kibana Agent Builder)

Requires Node.js 18+. Defaults to elastic/changeme when no credentials are supplied.

cd skills/kibana/kibana-anomaly-detection

# tools → workflows → skills
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601

# HTTPS with self-signed cert
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure

all register runs tools register, then workflows register, then skills register. Kibana allows at most five tool_ids per skill; the script fills them by scanning SKILL.md for tool mentions (in document order), then appends ids from references/kibana/tools/esql/*.json until the cap (workflow-only tools omitted by default). If you run skills register alone, run tools register first so those ids exist.

Workflow tool exclusions and prefixes live in scripts/agent_builder_constants.json.

MCP API key permissions:

  • Kibana: read_onechat, space_read
  • Index: read, view_index_metadata on .ml-anomalies-*, .ml-annotations-*, .ml-notifications-*, .ml-config
  • For source evidence: read on source data indices

Tool inventory

ES|QL tool specs live under references/kibana/tools/esql/*.json; workflow definitions under references/kibana/workflows/*.yaml. Each Mode section above lists the tools it uses. Full surface: references/tools.md (ES|QL) and references/workflow-tools.md (workflows).

Key system indices

Index Relevant content
.ml-anomalies-* record, bucket, influencer, model_plot, model_forecast, model_snapshot, category_definition, model_size_stats
.ml-config job/datafeed documents (visible even for never-run jobs)
.ml-annotations-* delayed data (event == "delayed_data")
.ml-notifications-* job messages (level: info/warning/error)

Examples

RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate → ad_get_available_metadataad_query_anomaly_timelinead_rca_cross_job_entity_matchad_rca_multi_job_entities → RCA report.

Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain → ad_rca_score_reassessment for drift → explain renormalization if score_drift is large.

Memory limit: "Job status shows hard_limit and results look wrong." → Troubleshoot → ad_ts_model_memory_healthad_wf_ts_field_cardinalityad_estimate_memory_requirementad_update_model_memory_limit (lifecycle: stop datafeed → close → update → open → start).

New job: "Detect unusual error rates per host on nginx access logs." → Manage → high_count detector with by_field_name: "host.keyword" → validate → present → deploy.

Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the incident → Explain the score drift → Troubleshoot if hard_limit or delayed data is suspected.


Guidelines

  1. Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
  2. ad_validate_ml_tool_permissions first on empty results — privileges are the most common false-negative cause.
  3. Score bands are absolute thresholds: >75 critical, 50–75 warning, 25–50 minor, <25 informational.
  4. Multi-job entities are prime suspects. Use min_job_count=2 in ad_rca_multi_job_entities.
  5. Show initial_record_score alongside record_score — the gap tells the renormalization story.
  6. Fix memory before query_delay. hard_limit invalidates downstream diagnostics.
  7. Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query delay.
  8. Confirm RCAs with ad_rca_source_evidence. Raw source documents are ground truth.