[Security Solution] Randomize task schedules when bulk scheduling#269991
Conversation
Resolves: elastic#195136 Bring `TaskScheduling.bulkSchedule` in line with `bulkEnable`: the first task in a batch still runs immediately, but subsequent enabled recurring tasks are scheduled with a randomized `runAt` between 1ms and `min(interval, 5m)` from "now". Ad-hoc and disabled tasks are left untouched and run immediately as before. This prevents the polling queue from being flooded with simultaneous claims when many recurring tasks are bulk-scheduled in one call. `randomlyOffsetRunTimestamp` is replaced by a smaller pure helper `addJitter(interval?: string)` that returns just the timing fields, and the `bulkEnable` `i > 0` branch uses it too. Co-authored-by: Cursor <cursoragent@cursor.com>
fba596b to
7cde863
Compare
|
Pinging @elastic/response-ops (Team:ResponseOps) |
|
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
|
Pinging @elastic/security-solution (Team: SecuritySolution) |
|
Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management) |
| const enabled = modifiedTask.enabled ?? true; | ||
| // Run the first task now. Run all other tasks a random number of ms in the future, | ||
| // with a maximum of 5 minutes or the task interval, whichever is smaller. | ||
| const runAt = enabled && i > 0 ? addJitter(modifiedTask.schedule?.interval) ?? {} : {}; |
There was a problem hiding this comment.
nit: since the variable contain an object with more fields than runAt - { runAt: Date, scheduledAt: Date } shall we name it jitterSchedule or something similar?
There was a problem hiding this comment.
Thanks @darnautov, addressed in 08379a0
I went with var name scheduling to cover all situations, jittering and no jittering.
| }); | ||
| const enabled = modifiedTask.enabled ?? true; | ||
| // Run the first task now. Run all other tasks a random number of ms in the future, | ||
| // with a maximum of 5 minutes or the task interval, whichever is smaller. |
There was a problem hiding this comment.
nit: I guess this comment was carried over from bulkEnable, but bulkSchedule doesn't actually set runAt: new Date() for i == 0, it just spreads {} and lets the store default to now.
Could we either adjsut the wording to match what's happening, or mirror bulkEnable and set runAt: new Date(), scheduledAt: new Date() explicitly for i == 0?
I think it'd be easier to have consistent logic for both methods
There was a problem hiding this comment.
Thanks @darnautov, addressed in 08379a0
Logic is the same now across bulkEnable and bulkSchedule. You're right the comment was not clear that the magic was happening inside the store. Seeing the dates set explicitly makes this clear on this end.
There was a problem hiding this comment.
Also added this change to make sure that adhoc tasks are set with a date explicitly (to run now).
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
cc @sdesalas |
…astic#269991) **Resolves: elastic#195136** **Related to: elastic#264893** **Related to: elastic#269340** ## Summary `TaskScheduling.bulkSchedule` previously sent every task to the store with no `runAt`, which caused the store to default them all to "now". When a caller bulk-scheduled many recurring tasks at once, the polling queue was flooded with [simultaneous claims](https://en.wikipedia.org/wiki/Thundering_herd_problem). This PR brings `bulkSchedule` in line with `bulkEnable` (see elastic#172742): the first task in the batch still runs immediately, but subsequent enabled recurring tasks are scheduled with a randomized `runAt`, evenly distributed up to 5 minutes in the future. Ad-hoc tasks (no `schedule.interval`) and disabled tasks are left untouched and run immediately as before. This also helps unblock upcoming work on `RulesClient.bulkCreate` (elastic#264893), where a single API call may schedule a large number of detection-rule tasks at once. ## Whats included - The existing `randomlyOffsetRunTimestamp()` helper is replaced by a smaller pure helper `addJitter()` that returns `{ runAt, scheduledAt }` - or `undefined` when no interval is supplied - so callers control the spread. - `bulkSchedule` map callback now receives the index `i` and uses `addJitter()` when `enabled && i > 0`. - `bulkEnable`'s behavior continues exactly the same with the `i > 0` branch using the shared `addJitter()` helper. ## How to test > [!IMPORTANT] > There is no easy way to test TM `bulkSchedule()` randomizing without `v2` alerting. Because of this, the test below only covers task randomizing under `bulkEnable()`. If you apply debugging here, you will notice that enabling detection rules in bulk uses `bulkSchedule()` under the hood, but it does so in `enabled: false` state. In other words, no jitter will get applied until the following pass. This is expected, what the test below does primarily is to verify that existing behavior is unaffected. The changes to behavior in `bulkSchedule(enabled:true)` will become more meaningful with upcoming work on alerting `v2` and `RulesClient.bulkCreate`. 1. Start ES + Kibana from this branch. Make sure you have a clean ES with no rules. 2. In Kibana, navigate to **Security → Rules → Detection rules (SIEM)** and click **Add Elastic Rules** to install the prebuilt detection rule set (~1850 rules). Leave them disabled. > Note: This is a good time to place some breakpoints if you're debugging locally. 3. Go back to the rules management screen. Under "Installed Rules" click the checkbox to select first 20 rules then `Bulk actions` > `Enable`. You should see a message saying "Successfully enabled 20 rules" 4. Verify the `runAt` / `scheduledAt` distribution using [`check-task-runtime.sh`](https://github.com/sdesalas/kibana-knowledge/blob/main/scripts/check-task-runtime.sh): ```bash $ ./check-task-runtime.sh ``` Or if you are using ports different to the standard `5601` and `9200` ```bash $ KIBANA_DEV_PORT=5606 ES_DEV_PORT=9205 ./check-task-runtime.sh ``` 5. Expected output: counts match, and the first-20 task timestamps are **spread across several minutes** rather than all stamped with the same "now": ``` starting.. KIBANA_URL=http://localhost:5601/kbn ES_URL=http://localhost:9200 1. 2. 3. 4. 5. 6. rules: 1850 rules_enabled: 20 tasks: 20 tasks_enabled: 20 api_key_owner: 20 apiKey present: 20 first 20 tasks: taskType status enabled runAt scheduledAt alerting:siem.queryRule idle true 2026-05-19T16:20:08.323Z 2026-05-19T16:20:08.323Z alerting:siem.queryRule idle true 2026-05-19T16:21:28.689Z 2026-05-19T16:21:28.689Z alerting:siem.eqlRule idle true 2026-05-19T16:21:05.927Z 2026-05-19T16:21:05.927Z alerting:siem.queryRule idle true 2026-05-19T16:20:53.163Z 2026-05-19T16:20:53.163Z alerting:siem.queryRule idle true 2026-05-19T16:23:30.562Z 2026-05-19T16:23:30.562Z alerting:siem.esqlRule idle true 2026-05-19T16:23:45.295Z 2026-05-19T16:23:45.295Z ... ``` For every task, `runAt` should equal its matching `scheduledAt`. The timestamps should be distributed across the configured jitter window (`min(rule interval, 5m)`) - confirming jitter is applied per-task. On `main` without this PR, every task's `runAt` collapses to the same value. ## Callers of `bulkSchedule` For reference, the production callers of `TaskScheduling.bulkSchedule` and whether they exercise the new jitter: | Caller | What it schedules | Hits the new jitter? | |---|---|---| | [`alerting_v2/.../rules_client.bulkEnableRules`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting_v2/server/lib/rules_client/rules_client.ts) | enabled, recurring (`schedule.interval`, `enabled: true`) | **Yes** — primary path exercised by the manual test above | | [`alerting/.../bulk_enable_rules.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/application/rule/methods/bulk_enable/bulk_enable_rules.ts) (legacy) | recurring but `enabled: false` — the comment in that file explicitly says "we create the task as disabled, taskManager.bulkEnable will enable them by randomising their schedule datetime" | No (jitter applied later by `bulkEnable`) | | [`workflows_execution_engine/server/plugin.ts`](https://github.com/elastic/kibana/blob/main/src/platform/plugins/shared/workflows_execution_engine/server/plugin.ts) | enabled but ad-hoc (no `schedule`) | No (ad-hoc — runs immediately, correct) | | [`actions/create_execute_function.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/actions/server/create_execute_function.ts) | ad-hoc action tasks | No | | [`actions/create_unsecured_execute_function.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/actions/server/create_unsecured_execute_function.ts) | ad-hoc action tasks | No | | [`alerting/.../backfill_client.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/backfill_client/backfill_client.ts) | variable is literally `adHocTasksToSchedule` | No | Of these, only the alerting_v2 `bulkEnableRules` path schedules enabled recurring tasks in bulk, so it is the only caller whose runtime behavior changes with this PR. ## Release note skip ## Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### Identify risks - Behavior change is scoped to recurring tasks at `i > 0`; single-task `bulkSchedule` calls and ad-hoc tasks retain the existing "run now" semantics. - The `bulkEnable` path is unchanged in semantics; only the helper signature changed. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
…69991) **Resolves: #195136** **Related to: #264893** **Related to: #269340** ## Summary `TaskScheduling.bulkSchedule` previously sent every task to the store with no `runAt`, which caused the store to default them all to "now". When a caller bulk-scheduled many recurring tasks at once, the polling queue was flooded with [simultaneous claims](https://en.wikipedia.org/wiki/Thundering_herd_problem). This PR brings `bulkSchedule` in line with `bulkEnable` (see #172742): the first task in the batch still runs immediately, but subsequent enabled recurring tasks are scheduled with a randomized `runAt`, evenly distributed up to 5 minutes in the future. Ad-hoc tasks (no `schedule.interval`) and disabled tasks are left untouched and run immediately as before. This also helps unblock upcoming work on `RulesClient.bulkCreate` (#264893), where a single API call may schedule a large number of detection-rule tasks at once. ## Whats included - The existing `randomlyOffsetRunTimestamp()` helper is replaced by a smaller pure helper `addJitter()` that returns `{ runAt, scheduledAt }` - or `undefined` when no interval is supplied - so callers control the spread. - `bulkSchedule` map callback now receives the index `i` and uses `addJitter()` when `enabled && i > 0`. - `bulkEnable`'s behavior continues exactly the same with the `i > 0` branch using the shared `addJitter()` helper. ## How to test > [!IMPORTANT] > There is no easy way to test TM `bulkSchedule()` randomizing without `v2` alerting. Because of this, the test below only covers task randomizing under `bulkEnable()`. If you apply debugging here, you will notice that enabling detection rules in bulk uses `bulkSchedule()` under the hood, but it does so in `enabled: false` state. In other words, no jitter will get applied until the following pass. This is expected, what the test below does primarily is to verify that existing behavior is unaffected. The changes to behavior in `bulkSchedule(enabled:true)` will become more meaningful with upcoming work on alerting `v2` and `RulesClient.bulkCreate`. 1. Start ES + Kibana from this branch. Make sure you have a clean ES with no rules. 2. In Kibana, navigate to **Security → Rules → Detection rules (SIEM)** and click **Add Elastic Rules** to install the prebuilt detection rule set (~1850 rules). Leave them disabled. > Note: This is a good time to place some breakpoints if you're debugging locally. 3. Go back to the rules management screen. Under "Installed Rules" click the checkbox to select first 20 rules then `Bulk actions` > `Enable`. You should see a message saying "Successfully enabled 20 rules" 4. Verify the `runAt` / `scheduledAt` distribution using [`check-task-runtime.sh`](https://github.com/sdesalas/kibana-knowledge/blob/main/scripts/check-task-runtime.sh): ```bash $ ./check-task-runtime.sh ``` Or if you are using ports different to the standard `5601` and `9200` ```bash $ KIBANA_DEV_PORT=5606 ES_DEV_PORT=9205 ./check-task-runtime.sh ``` 5. Expected output: counts match, and the first-20 task timestamps are **spread across several minutes** rather than all stamped with the same "now": ``` starting.. KIBANA_URL=http://localhost:5601/kbn ES_URL=http://localhost:9200 1. 2. 3. 4. 5. 6. rules: 1850 rules_enabled: 20 tasks: 20 tasks_enabled: 20 api_key_owner: 20 apiKey present: 20 first 20 tasks: taskType status enabled runAt scheduledAt alerting:siem.queryRule idle true 2026-05-19T16:20:08.323Z 2026-05-19T16:20:08.323Z alerting:siem.queryRule idle true 2026-05-19T16:21:28.689Z 2026-05-19T16:21:28.689Z alerting:siem.eqlRule idle true 2026-05-19T16:21:05.927Z 2026-05-19T16:21:05.927Z alerting:siem.queryRule idle true 2026-05-19T16:20:53.163Z 2026-05-19T16:20:53.163Z alerting:siem.queryRule idle true 2026-05-19T16:23:30.562Z 2026-05-19T16:23:30.562Z alerting:siem.esqlRule idle true 2026-05-19T16:23:45.295Z 2026-05-19T16:23:45.295Z ... ``` For every task, `runAt` should equal its matching `scheduledAt`. The timestamps should be distributed across the configured jitter window (`min(rule interval, 5m)`) - confirming jitter is applied per-task. On `main` without this PR, every task's `runAt` collapses to the same value. ## Callers of `bulkSchedule` For reference, the production callers of `TaskScheduling.bulkSchedule` and whether they exercise the new jitter: | Caller | What it schedules | Hits the new jitter? | |---|---|---| | [`alerting_v2/.../rules_client.bulkEnableRules`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting_v2/server/lib/rules_client/rules_client.ts) | enabled, recurring (`schedule.interval`, `enabled: true`) | **Yes** — primary path exercised by the manual test above | | [`alerting/.../bulk_enable_rules.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/application/rule/methods/bulk_enable/bulk_enable_rules.ts) (legacy) | recurring but `enabled: false` — the comment in that file explicitly says "we create the task as disabled, taskManager.bulkEnable will enable them by randomising their schedule datetime" | No (jitter applied later by `bulkEnable`) | | [`workflows_execution_engine/server/plugin.ts`](https://github.com/elastic/kibana/blob/main/src/platform/plugins/shared/workflows_execution_engine/server/plugin.ts) | enabled but ad-hoc (no `schedule`) | No (ad-hoc — runs immediately, correct) | | [`actions/create_execute_function.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/actions/server/create_execute_function.ts) | ad-hoc action tasks | No | | [`actions/create_unsecured_execute_function.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/actions/server/create_unsecured_execute_function.ts) | ad-hoc action tasks | No | | [`alerting/.../backfill_client.ts`](https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/alerting/server/backfill_client/backfill_client.ts) | variable is literally `adHocTasksToSchedule` | No | Of these, only the alerting_v2 `bulkEnableRules` path schedules enabled recurring tasks in bulk, so it is the only caller whose runtime behavior changes with this PR. ## Release note skip ## Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### Identify risks - Behavior change is scoped to recurring tasks at `i > 0`; single-task `bulkSchedule` calls and ad-hoc tasks retain the existing "run now" semantics. - The `bulkEnable` path is unchanged in semantics; only the helper signature changed. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Resolves: #195136
Related to: #264893
Related to: #269340
Summary
TaskScheduling.bulkSchedulepreviously sent every task to the store with norunAt, which caused the store to default them all to "now". When a caller bulk-scheduled many recurring tasks at once, the polling queue was flooded with simultaneous claims.This PR brings
bulkSchedulein line withbulkEnable(see #172742): the first task in the batch still runs immediately, but subsequent enabled recurring tasks are scheduled with a randomizedrunAt, evenly distributed up to 5 minutes in the future. Ad-hoc tasks (noschedule.interval) and disabled tasks are left untouched and run immediately as before.This also helps unblock upcoming work on
RulesClient.bulkCreate(#264893), where a single API call may schedule a large number of detection-rule tasks at once.Whats included
randomlyOffsetRunTimestamp()helper is replaced by a smaller pure helperaddJitter()that returns{ runAt, scheduledAt }- orundefinedwhen no interval is supplied - so callers control the spread.bulkSchedulemap callback now receives the indexiand usesaddJitter()whenenabled && i > 0.bulkEnable's behavior continues exactly the same with thei > 0branch using the sharedaddJitter()helper.How to test
Important
There is no easy way to test TM
bulkSchedule()randomizing withoutv2alerting. Because of this, the test below only covers task randomizing underbulkEnable(). If you apply debugging here, you will notice that enabling detection rules in bulk usesbulkSchedule()under the hood, but it does so inenabled: falsestate. In other words, no jitter will get applied until the following pass. This is expected, what the test below does primarily is to verify that existing behavior is unaffected. The changes to behavior inbulkSchedule(enabled:true)will become more meaningful with upcoming work on alertingv2andRulesClient.bulkCreate.Start ES + Kibana from this branch. Make sure you have a clean ES with no rules.
In Kibana, navigate to Security → Rules → Detection rules (SIEM) and click Add Elastic Rules to install the prebuilt detection rule set (~1850 rules). Leave them disabled.
Go back to the rules management screen. Under "Installed Rules" click the checkbox to select first 20 rules then
Bulk actions>Enable. You should see a message saying "Successfully enabled 20 rules"Verify the
runAt/scheduledAtdistribution usingcheck-task-runtime.sh:Or if you are using ports different to the standard
5601and9200For every task,
runAtshould equal its matchingscheduledAt. The timestamps should be distributed across the configured jitter window (min(rule interval, 5m)) - confirming jitter is applied per-task. Onmainwithout this PR, every task'srunAtcollapses to the same value.Callers of
bulkScheduleFor reference, the production callers of
TaskScheduling.bulkScheduleand whether they exercise the new jitter:alerting_v2/.../rules_client.bulkEnableRulesschedule.interval,enabled: true)alerting/.../bulk_enable_rules.ts(legacy)enabled: false— the comment in that file explicitly says "we create the task as disabled, taskManager.bulkEnable will enable them by randomising their schedule datetime"bulkEnable)workflows_execution_engine/server/plugin.tsschedule)actions/create_execute_function.tsactions/create_unsecured_execute_function.tsalerting/.../backfill_client.tsadHocTasksToScheduleOf these, only the alerting_v2
bulkEnableRulespath schedules enabled recurring tasks in bulk, so it is the only caller whose runtime behavior changes with this PR.Release note
skip
Checklist
Identify risks
i > 0; single-taskbulkSchedulecalls and ad-hoc tasks retain the existing "run now" semantics.bulkEnablepath is unchanged in semantics; only the helper signature changed.