Commit dce6ae9
authored
# Backport
This will backport the following commits from `main` to `9.4`:
- [[Profiling] Fix profiling API tests when running on test lanes
(again) (#271733)](#271733)
<!--- Backport version: 12.0.0 -->
### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)
<!--BACKPORT [{"author":{"name":"Alex
Fernandez","email":"47327793+AlejandroFrndz@users.noreply.github.com"},"sourceCommit":{"committedDate":"2026-05-29T08:00:57Z","message":"[Profiling]
Fix profiling API tests when running on test lanes (again)
(#271733)\n\n## Summary\n\nThis PR is the result of investigating and
trying to fix the latest\nflakiness issues for profiling API suites as
reported in this
[issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe
error message we're dealing with this time
is\n```\nKbnClientRequesterError: [POST
http://localhost:5620/api/profiling/setup/es_resources] 500 Internal
Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\"}\n```\nWhich at first glance looked awfully similar to the
errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError:
[POST http://localhost:5620/api/profiling/setup/es_resources] 500
Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\",\"attributes\":{\"cause\":\"Saved object
[fleet-agent-policies/policy-elastic-agent-on-cloud] not
found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both
messages, we can notice the error now\ndoesn't have an explicit cause,
while before it complained about `Saved\nobject
[fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile
these are good news as at least we know #270623 did fix something,\nbut
the current issue merits deeper investigation\n\n## The Symptom: ES
returning a 408 response\n\n### Background\n\nThe Universal Profiling
setup flow in Kibana ends with a call to\n`GET
/_profiling/status?wait_for_resources_created=true`. This is
a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES
holds the\nconnection open and\nwatches for cluster state changes,
responding only once all profiling\nresources\n(index templates, ILM
policies, data streams) are confirmed created — or\nwhen it\ntimes
out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two
timeouts** that Kibana was losing\nevery time.\n\n**ES
side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses
the request and sets the server-side wait
budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\",
TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is
**30 seconds** when no `timeout=`\nquery parameter\nis
provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses
this value to register a `ClusterStateObserver` that waits up to\nthat
timeout\nfor `resolver::isResourcesCreated` to become true. When the
timeout\nelapses and\nresources still aren't ready, the `onTimeout`
callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue
timeout) {\n resolver.execute(clusterService.state(),
ActionListener.wrap(response -> {\n response.setTimedOut(true);\n
listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES
then sends back a **completed HTTP response** with status code 408,\nvia
this\nmapping
in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic
RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT :
RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: {
enabled: true }, resource_management: { enabled: true },\nresources: {
created: false } }`\n— the current partial status at
timeout.\n\n**Kibana side:** The call was configured with
`requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's
default server timeout is 30s and\nKibana's was 60s,\n**ES always won
the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s
client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408
response is a **completed HTTP exchange** from the
transport's\nperspective.\nThe `@elastic/transport` client only
auto-retries on `TimeoutError`\n(client-side\ntimeout before any
response arrives) and HTTP 502/503/504. The
existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both
gated on a\nclient-side\n`TimeoutError` — they never triggered because
Kibana received a valid\nHTTP response\n(just with status 408). The 408
was surfaced as a `ResponseError` and\nreturned to\nthe user as a
failure with no retry attempt.\n\n### The fix\n\nBy appending
`&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65
seconds** server-side before giving up. Kibana's
`requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before
ES would return a 408. The\nresulting\nclient-side `TimeoutError` is
exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already
designed to handle: up to 6 attempts, each with a fresh
60s\nbudget,\nwith bounded exponential backoff between retries. No code
path that\npreviously\nobserved a 60s per-attempt budget is
affected.\n\nThis not only affects test runs, but it will also correct
this behaviour\nin the actual profiling UI. In case ES is slow to
respond the request\nwill keep retrying for longer than 30s (as it was
always intended)\ninstead of failing and showing an error in the UI
after 30s.\n\nThe code for this fix is in
e8fe7a2. \n\n### Is the error
gone?\n\nWhile the problem described was an actual oversight we had in
Kibana,\nthis error is a symptom rather than the cause. After
implementing this\nchange we were no longer experiencing 408s but a new
problem came\naround. Tests weren't failing mid run anymore but now the
default test\nruntime timeout of 60s was being hit. My first instinct
was bumping the\ntimeout up, but even 180s wasn't enough. At this point
it was clear the\n408s coming from ES were just the way the issue was
surfacing, while the\nreal problem was why was ES taking so long to
complete the profiling\nsetup process\n\n## The Root Cause: Profiling is
constantly setup and cleaned up\n\n### Background\n\nThe API suites have
different requirements when it comes to profiling\nresources and/or data
existing or not during test execution. Because\nwe're also verifying how
the system behaves when setup or data is\nmissing as well as when
everything is already wired up, we can't have a\nsingle global setup
hook that pre-loads everything and leaves profiling\nsetup and data
ingested.\n\nThis is also the reason why these suites aren't parallel
and instead run\nsequentially. We can't ran suites concurrently because
test requirements\nwould conflict with one another. But, even so they
run sequentially,\neach test was being treated as a completely
independent unit. What this\nmeans is each test was setting up (or
cleaning up) all the resources it\nrequired as well as completely
cleaning up after each one. This\nsituation lead to elasticsearch being
continuously bombarded with\nrequest to setup resources quickly followed
by requests to delete those\nsame resources it just created, repeated
for each suite in the profiling\nconfig. Depending on how much work ES
is dealing with and the state it\nwas left in by other tests in the lane
it could lead to the situation\nobserved above where it starts to become
overwhelmed and the setup call\nwould take too long.\n\n### The
Fix\n\nThanks to test running sequentially, we can optimise the
execution\nprocess by carefully controlling the order in which each
suite runs. If\nwe can ensure suites expecting empty resources and/or
data run first and\nthen are followed by suites that do fully setup and
load data, we can\nreduce the amount of times we're loading and
offloading resources,\nreducing the impact tests have on ES. While
ideally each test would be\n100% independent from the rest, this is not
the environment tests will\nbe running on specially on CI with tests
lanes. Test shouldn't make the\nassumption that they're running on new
and clean environments, so this\nPR adapts these suites to be resilient
to polluted and overloaded\nenvironments.\n\nBy default, playwright runs
test in the same order they're discovered\nfrom the test dir, aka,
alphabetically. To ensure tests remain ordered\nas we expect them to be
going forward, each test file is now prefixed by\na number. The first
test to run is prefixed 00_* With this in mind, I've\nmodified and
ordered each suite to try to reduce the number of setup and\ncleanup
calls. While each tests will still make sure the environment is\nin the
state it's expect in a `beforeAll` hook, each setup call is
now\nguarded. This way, if resource or data already exist we won't try
to set\nthem up again. Also, I've removed the `afterAll` hooks that
were\ncleaning up everything profiling-related after each suite ran. If
test\n02 creates the ES resources, test 03 can reuse those instead of
having\nto create them all over again. And becase each test still checks
if\neverything they expect is in place, they remain independent of
each\nother if they need to.\n\nFor instance, if test 02 was skipped or
removed it wouldn't cause test\n03 to then break as 03 is capable of
setting up the environment as it\nexpects it to be. Each suite maintains
it's capability to set itself up\nfor success while also gaining the
ability to reuse pre-existing\nresources and data data other tests might
have already created
before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55","branchLabelMapping":{"^v9.5.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:version","v9.3.0","v9.4.0","Team:obs-presentation","v9.5.0"],"title":"[Profiling]
Fix profiling API tests when running on test lanes
(again)","number":271733,"url":"https://github.com/elastic/kibana/pull/271733","mergeCommit":{"message":"[Profiling]
Fix profiling API tests when running on test lanes (again)
(#271733)\n\n## Summary\n\nThis PR is the result of investigating and
trying to fix the latest\nflakiness issues for profiling API suites as
reported in this
[issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe
error message we're dealing with this time
is\n```\nKbnClientRequesterError: [POST
http://localhost:5620/api/profiling/setup/es_resources] 500 Internal
Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\"}\n```\nWhich at first glance looked awfully similar to the
errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError:
[POST http://localhost:5620/api/profiling/setup/es_resources] 500
Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\",\"attributes\":{\"cause\":\"Saved object
[fleet-agent-policies/policy-elastic-agent-on-cloud] not
found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both
messages, we can notice the error now\ndoesn't have an explicit cause,
while before it complained about `Saved\nobject
[fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile
these are good news as at least we know #270623 did fix something,\nbut
the current issue merits deeper investigation\n\n## The Symptom: ES
returning a 408 response\n\n### Background\n\nThe Universal Profiling
setup flow in Kibana ends with a call to\n`GET
/_profiling/status?wait_for_resources_created=true`. This is
a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES
holds the\nconnection open and\nwatches for cluster state changes,
responding only once all profiling\nresources\n(index templates, ILM
policies, data streams) are confirmed created — or\nwhen it\ntimes
out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two
timeouts** that Kibana was losing\nevery time.\n\n**ES
side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses
the request and sets the server-side wait
budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\",
TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is
**30 seconds** when no `timeout=`\nquery parameter\nis
provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses
this value to register a `ClusterStateObserver` that waits up to\nthat
timeout\nfor `resolver::isResourcesCreated` to become true. When the
timeout\nelapses and\nresources still aren't ready, the `onTimeout`
callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue
timeout) {\n resolver.execute(clusterService.state(),
ActionListener.wrap(response -> {\n response.setTimedOut(true);\n
listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES
then sends back a **completed HTTP response** with status code 408,\nvia
this\nmapping
in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic
RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT :
RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: {
enabled: true }, resource_management: { enabled: true },\nresources: {
created: false } }`\n— the current partial status at
timeout.\n\n**Kibana side:** The call was configured with
`requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's
default server timeout is 30s and\nKibana's was 60s,\n**ES always won
the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s
client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408
response is a **completed HTTP exchange** from the
transport's\nperspective.\nThe `@elastic/transport` client only
auto-retries on `TimeoutError`\n(client-side\ntimeout before any
response arrives) and HTTP 502/503/504. The
existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both
gated on a\nclient-side\n`TimeoutError` — they never triggered because
Kibana received a valid\nHTTP response\n(just with status 408). The 408
was surfaced as a `ResponseError` and\nreturned to\nthe user as a
failure with no retry attempt.\n\n### The fix\n\nBy appending
`&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65
seconds** server-side before giving up. Kibana's
`requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before
ES would return a 408. The\nresulting\nclient-side `TimeoutError` is
exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already
designed to handle: up to 6 attempts, each with a fresh
60s\nbudget,\nwith bounded exponential backoff between retries. No code
path that\npreviously\nobserved a 60s per-attempt budget is
affected.\n\nThis not only affects test runs, but it will also correct
this behaviour\nin the actual profiling UI. In case ES is slow to
respond the request\nwill keep retrying for longer than 30s (as it was
always intended)\ninstead of failing and showing an error in the UI
after 30s.\n\nThe code for this fix is in
e8fe7a2. \n\n### Is the error
gone?\n\nWhile the problem described was an actual oversight we had in
Kibana,\nthis error is a symptom rather than the cause. After
implementing this\nchange we were no longer experiencing 408s but a new
problem came\naround. Tests weren't failing mid run anymore but now the
default test\nruntime timeout of 60s was being hit. My first instinct
was bumping the\ntimeout up, but even 180s wasn't enough. At this point
it was clear the\n408s coming from ES were just the way the issue was
surfacing, while the\nreal problem was why was ES taking so long to
complete the profiling\nsetup process\n\n## The Root Cause: Profiling is
constantly setup and cleaned up\n\n### Background\n\nThe API suites have
different requirements when it comes to profiling\nresources and/or data
existing or not during test execution. Because\nwe're also verifying how
the system behaves when setup or data is\nmissing as well as when
everything is already wired up, we can't have a\nsingle global setup
hook that pre-loads everything and leaves profiling\nsetup and data
ingested.\n\nThis is also the reason why these suites aren't parallel
and instead run\nsequentially. We can't ran suites concurrently because
test requirements\nwould conflict with one another. But, even so they
run sequentially,\neach test was being treated as a completely
independent unit. What this\nmeans is each test was setting up (or
cleaning up) all the resources it\nrequired as well as completely
cleaning up after each one. This\nsituation lead to elasticsearch being
continuously bombarded with\nrequest to setup resources quickly followed
by requests to delete those\nsame resources it just created, repeated
for each suite in the profiling\nconfig. Depending on how much work ES
is dealing with and the state it\nwas left in by other tests in the lane
it could lead to the situation\nobserved above where it starts to become
overwhelmed and the setup call\nwould take too long.\n\n### The
Fix\n\nThanks to test running sequentially, we can optimise the
execution\nprocess by carefully controlling the order in which each
suite runs. If\nwe can ensure suites expecting empty resources and/or
data run first and\nthen are followed by suites that do fully setup and
load data, we can\nreduce the amount of times we're loading and
offloading resources,\nreducing the impact tests have on ES. While
ideally each test would be\n100% independent from the rest, this is not
the environment tests will\nbe running on specially on CI with tests
lanes. Test shouldn't make the\nassumption that they're running on new
and clean environments, so this\nPR adapts these suites to be resilient
to polluted and overloaded\nenvironments.\n\nBy default, playwright runs
test in the same order they're discovered\nfrom the test dir, aka,
alphabetically. To ensure tests remain ordered\nas we expect them to be
going forward, each test file is now prefixed by\na number. The first
test to run is prefixed 00_* With this in mind, I've\nmodified and
ordered each suite to try to reduce the number of setup and\ncleanup
calls. While each tests will still make sure the environment is\nin the
state it's expect in a `beforeAll` hook, each setup call is
now\nguarded. This way, if resource or data already exist we won't try
to set\nthem up again. Also, I've removed the `afterAll` hooks that
were\ncleaning up everything profiling-related after each suite ran. If
test\n02 creates the ES resources, test 03 can reuse those instead of
having\nto create them all over again. And becase each test still checks
if\neverything they expect is in place, they remain independent of
each\nother if they need to.\n\nFor instance, if test 02 was skipped or
removed it wouldn't cause test\n03 to then break as 03 is capable of
setting up the environment as it\nexpects it to be. Each suite maintains
it's capability to set itself up\nfor success while also gaining the
ability to reuse pre-existing\nresources and data data other tests might
have already created
before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55"}},"sourceBranch":"main","suggestedTargetBranches":["9.3","9.4"],"targetPullRequestStates":[{"branch":"9.3","label":"v9.3.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.4","label":"v9.4.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v9.5.0","branchLabelMappingKey":"^v9.5.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/271733","number":271733,"mergeCommit":{"message":"[Profiling]
Fix profiling API tests when running on test lanes (again)
(#271733)\n\n## Summary\n\nThis PR is the result of investigating and
trying to fix the latest\nflakiness issues for profiling API suites as
reported in this
[issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe
error message we're dealing with this time
is\n```\nKbnClientRequesterError: [POST
http://localhost:5620/api/profiling/setup/es_resources] 500 Internal
Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\"}\n```\nWhich at first glance looked awfully similar to the
errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError:
[POST http://localhost:5620/api/profiling/setup/es_resources] 500
Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server
Error\",\"message\":\"Error while setting up Universal
Profiling\",\"attributes\":{\"cause\":\"Saved object
[fleet-agent-policies/policy-elastic-agent-on-cloud] not
found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both
messages, we can notice the error now\ndoesn't have an explicit cause,
while before it complained about `Saved\nobject
[fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile
these are good news as at least we know #270623 did fix something,\nbut
the current issue merits deeper investigation\n\n## The Symptom: ES
returning a 408 response\n\n### Background\n\nThe Universal Profiling
setup flow in Kibana ends with a call to\n`GET
/_profiling/status?wait_for_resources_created=true`. This is
a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES
holds the\nconnection open and\nwatches for cluster state changes,
responding only once all profiling\nresources\n(index templates, ILM
policies, data streams) are confirmed created — or\nwhen it\ntimes
out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two
timeouts** that Kibana was losing\nevery time.\n\n**ES
side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses
the request and sets the server-side wait
budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\",
TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is
**30 seconds** when no `timeout=`\nquery parameter\nis
provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses
this value to register a `ClusterStateObserver` that waits up to\nthat
timeout\nfor `resolver::isResourcesCreated` to become true. When the
timeout\nelapses and\nresources still aren't ready, the `onTimeout`
callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue
timeout) {\n resolver.execute(clusterService.state(),
ActionListener.wrap(response -> {\n response.setTimedOut(true);\n
listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES
then sends back a **completed HTTP response** with status code 408,\nvia
this\nmapping
in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic
RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT :
RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: {
enabled: true }, resource_management: { enabled: true },\nresources: {
created: false } }`\n— the current partial status at
timeout.\n\n**Kibana side:** The call was configured with
`requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's
default server timeout is 30s and\nKibana's was 60s,\n**ES always won
the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s
client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408
response is a **completed HTTP exchange** from the
transport's\nperspective.\nThe `@elastic/transport` client only
auto-retries on `TimeoutError`\n(client-side\ntimeout before any
response arrives) and HTTP 502/503/504. The
existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both
gated on a\nclient-side\n`TimeoutError` — they never triggered because
Kibana received a valid\nHTTP response\n(just with status 408). The 408
was surfaced as a `ResponseError` and\nreturned to\nthe user as a
failure with no retry attempt.\n\n### The fix\n\nBy appending
`&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65
seconds** server-side before giving up. Kibana's
`requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before
ES would return a 408. The\nresulting\nclient-side `TimeoutError` is
exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already
designed to handle: up to 6 attempts, each with a fresh
60s\nbudget,\nwith bounded exponential backoff between retries. No code
path that\npreviously\nobserved a 60s per-attempt budget is
affected.\n\nThis not only affects test runs, but it will also correct
this behaviour\nin the actual profiling UI. In case ES is slow to
respond the request\nwill keep retrying for longer than 30s (as it was
always intended)\ninstead of failing and showing an error in the UI
after 30s.\n\nThe code for this fix is in
e8fe7a2. \n\n### Is the error
gone?\n\nWhile the problem described was an actual oversight we had in
Kibana,\nthis error is a symptom rather than the cause. After
implementing this\nchange we were no longer experiencing 408s but a new
problem came\naround. Tests weren't failing mid run anymore but now the
default test\nruntime timeout of 60s was being hit. My first instinct
was bumping the\ntimeout up, but even 180s wasn't enough. At this point
it was clear the\n408s coming from ES were just the way the issue was
surfacing, while the\nreal problem was why was ES taking so long to
complete the profiling\nsetup process\n\n## The Root Cause: Profiling is
constantly setup and cleaned up\n\n### Background\n\nThe API suites have
different requirements when it comes to profiling\nresources and/or data
existing or not during test execution. Because\nwe're also verifying how
the system behaves when setup or data is\nmissing as well as when
everything is already wired up, we can't have a\nsingle global setup
hook that pre-loads everything and leaves profiling\nsetup and data
ingested.\n\nThis is also the reason why these suites aren't parallel
and instead run\nsequentially. We can't ran suites concurrently because
test requirements\nwould conflict with one another. But, even so they
run sequentially,\neach test was being treated as a completely
independent unit. What this\nmeans is each test was setting up (or
cleaning up) all the resources it\nrequired as well as completely
cleaning up after each one. This\nsituation lead to elasticsearch being
continuously bombarded with\nrequest to setup resources quickly followed
by requests to delete those\nsame resources it just created, repeated
for each suite in the profiling\nconfig. Depending on how much work ES
is dealing with and the state it\nwas left in by other tests in the lane
it could lead to the situation\nobserved above where it starts to become
overwhelmed and the setup call\nwould take too long.\n\n### The
Fix\n\nThanks to test running sequentially, we can optimise the
execution\nprocess by carefully controlling the order in which each
suite runs. If\nwe can ensure suites expecting empty resources and/or
data run first and\nthen are followed by suites that do fully setup and
load data, we can\nreduce the amount of times we're loading and
offloading resources,\nreducing the impact tests have on ES. While
ideally each test would be\n100% independent from the rest, this is not
the environment tests will\nbe running on specially on CI with tests
lanes. Test shouldn't make the\nassumption that they're running on new
and clean environments, so this\nPR adapts these suites to be resilient
to polluted and overloaded\nenvironments.\n\nBy default, playwright runs
test in the same order they're discovered\nfrom the test dir, aka,
alphabetically. To ensure tests remain ordered\nas we expect them to be
going forward, each test file is now prefixed by\na number. The first
test to run is prefixed 00_* With this in mind, I've\nmodified and
ordered each suite to try to reduce the number of setup and\ncleanup
calls. While each tests will still make sure the environment is\nin the
state it's expect in a `beforeAll` hook, each setup call is
now\nguarded. This way, if resource or data already exist we won't try
to set\nthem up again. Also, I've removed the `afterAll` hooks that
were\ncleaning up everything profiling-related after each suite ran. If
test\n02 creates the ES resources, test 03 can reuse those instead of
having\nto create them all over again. And becase each test still checks
if\neverything they expect is in place, they remain independent of
each\nother if they need to.\n\nFor instance, if test 02 was skipped or
removed it wouldn't cause test\n03 to then break as 03 is capable of
setting up the environment as it\nexpects it to be. Each suite maintains
it's capability to set itself up\nfor success while also gaining the
ability to reuse pre-existing\nresources and data data other tests might
have already created
before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55"}}]}]
BACKPORT-->
1 parent 889a481 commit dce6ae9
7 files changed
Lines changed: 155 additions & 167 deletions
File tree
- x-pack/solutions/observability/plugins/profiling
- server/utils
- test/scout
- api/tests
Lines changed: 6 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
| |||
129 | 131 | | |
130 | 132 | | |
131 | 133 | | |
132 | | - | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
133 | 137 | | |
134 | 138 | | |
135 | 139 | | |
136 | 140 | | |
137 | 141 | | |
138 | | - | |
| 142 | + | |
139 | 143 | | |
140 | 144 | | |
141 | 145 | | |
| |||
Lines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
| |||
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | | - | |
| 14 | + | |
16 | 15 | | |
17 | 16 | | |
18 | 17 | | |
19 | 18 | | |
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
23 | | - | |
24 | | - | |
25 | 22 | | |
| 23 | + | |
| 24 | + | |
26 | 25 | | |
27 | 26 | | |
28 | 27 | | |
| |||
37 | 36 | | |
38 | 37 | | |
39 | 38 | | |
40 | | - | |
41 | | - | |
| 39 | + | |
| 40 | + | |
42 | 41 | | |
43 | 42 | | |
44 | 43 | | |
| |||
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
54 | | - | |
55 | | - | |
| 53 | + | |
| 54 | + | |
56 | 55 | | |
57 | 56 | | |
58 | 57 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
| 22 | + | |
26 | 23 | | |
27 | | - | |
| 24 | + | |
| 25 | + | |
28 | 26 | | |
29 | 27 | | |
| 28 | + | |
30 | 29 | | |
31 | 30 | | |
32 | 31 | | |
33 | 32 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | 33 | | |
40 | 34 | | |
41 | 35 | | |
| |||
47 | 41 | | |
48 | 42 | | |
49 | 43 | | |
50 | | - | |
51 | | - | |
| 44 | + | |
| 45 | + | |
52 | 46 | | |
53 | 47 | | |
54 | 48 | | |
| |||
62 | 56 | | |
63 | 57 | | |
64 | 58 | | |
65 | | - | |
66 | | - | |
| 59 | + | |
| 60 | + | |
67 | 61 | | |
68 | 62 | | |
69 | 63 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 19 | + | |
23 | 20 | | |
24 | | - | |
25 | | - | |
| 21 | + | |
| 22 | + | |
26 | 23 | | |
27 | 24 | | |
28 | 25 | | |
29 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
30 | 34 | | |
31 | 35 | | |
32 | 36 | | |
33 | 37 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | 38 | | |
40 | 39 | | |
41 | 40 | | |
| |||
47 | 46 | | |
48 | 47 | | |
49 | 48 | | |
50 | | - | |
| 49 | + | |
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
| |||
62 | 61 | | |
63 | 62 | | |
64 | 63 | | |
65 | | - | |
| 64 | + | |
66 | 65 | | |
67 | 66 | | |
68 | 67 | | |
Lines changed: 111 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
0 commit comments