Skip to content

[9.4] [Entity Store] Cap window size (#268170)#268446

Merged
romulets merged 2 commits into
elastic:9.4from
romulets:backport/9.4/pr-268170
May 8, 2026
Merged

[9.4] [Entity Store] Cap window size (#268170)#268446
romulets merged 2 commits into
elastic:9.4from
romulets:backport/9.4/pr-268170

Conversation

@romulets
Copy link
Copy Markdown
Member

@romulets romulets commented May 8, 2026

Backport

This will backport the following commits from main to 9.4:

Questions ?

Please refer to the Backport tool documentation

In lagging environments the entity-store extraction window grows
unboundedly. `getExtractionWindow` always sets `toDateISO = now - delay`
while `fromDateISO` advances only via `lastExecutionTimestamp /
paginationTimestamp`. If a run cannot keep up, each subsequent run sees
a wider window. The probe (`buildLogPaginationCursorProbeEsql`) sorts
every doc in that window in ES|QL, so probe cost grows with window size
— feeding a death-spiral where slow runs widen the window which slows
the next run further.

We want a hard cap on the width of each probe's window so that probe
cost stays bounded regardless of how far behind the engine is. The cap
is purely a cost-bounding device for the probe — it does not
artificially defer catch-up to a later run.

Within a single extractLogs execution, once a capped sub-window is
drained we immediately advance to the next sub-window and continue,
until we reach the effective window end (now - delay). Only when a run
is interrupted (crash, abort, hitting a slow probe) do we resume on the
next scheduled run from the last persisted lastExecutionTimestamp.

### How it works

When the gap between `fromDateISO` and the effective window end (`now -
delay`) exceeds `maxTimeWindowSize + GRACE_PERIOD` (default `15m +
30s`), the run processes the time range as a sequence of capped
`[fromSub, toSub]` sub-windows of width `maxTimeWindowSize`, advancing
within a single execution until the effective end is reached.
Sub-windows are an in-memory iteration concept — the saved-object schema
is unaware of them. Crash recovery uses the per-slice persistence
emitted by the inner outer-loop (last `paginationTimestamp` /
`checkpointTimestamp` written). Manual `specificWindow` /
`windowOverride` runs bypass capping and run as a single pass.

### Added

- `maxTimeWindowSize` parameter to the global configuration, available
on install and update paths. Also exposed via status api

### Why default 15m

`15m` seems to be an ok cap based on the default `3h` look back period.
A too short `1m` will cause 180 queries to elasticsearch. `15m` will
cause only 12.

This will need to be configured on heavy environments where `15m` worth
of data account for millions of logs.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 555237a)

# Conflicts:
#	src/core/server/integration_tests/ci_checks/saved_objects/check_registered_types.test.ts
#	x-pack/solutions/security/plugins/entity_store/server/routes/apis/status.ts
@romulets romulets requested a review from kibanamachine as a code owner May 8, 2026 13:43
@romulets romulets added the backport This PR is a backport of another PR label May 8, 2026
@romulets romulets enabled auto-merge (squash) May 8, 2026 13:43
@kibanamachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #118 / Cloud Security Posture - Group 3 (Dashboards + Vulns) Vulnerabilities Page - Grouping Default Grouping groups vulnerabilities by cloud account and sort by number of vulnerabilities desc
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - APM integration not installed but setup completed - Admin user
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Collector integration is not installed - collector integration missing
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Collector integration is not installed - Symbolizer integration is not installed
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Profiling is not setup and no data is loaded - Admin users
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Profiling is not setup and no data is loaded - Viewer users
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Profiling is setup and data is loaded - Admin user
  • [job] [logs] Scout Lane #9 - stateful-classic / default / local-stateful-classic - Profiling is setup and data is loaded - Viewer user
  • [job] [logs] Investigations - Security Solution Cypress Tests #5 / timeline overview search should show all timelines when no search term was entered should show all timelines when no search term was entered

Metrics [docs]

✅ unchanged

@romulets romulets merged commit dd70d94 into elastic:9.4 May 8, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport This PR is a backport of another PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants