Skip to content

Commit dce6ae9

Browse files
[9.4] [Profiling] Fix profiling API tests when running on test lanes (again) (#271733) (#271857)
# Backport This will backport the following commits from `main` to `9.4`: - [[Profiling] Fix profiling API tests when running on test lanes (again) (#271733)](#271733) <!--- Backport version: 12.0.0 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport) <!--BACKPORT [{"author":{"name":"Alex Fernandez","email":"47327793+AlejandroFrndz@users.noreply.github.com"},"sourceCommit":{"committedDate":"2026-05-29T08:00:57Z","message":"[Profiling] Fix profiling API tests when running on test lanes (again) (#271733)\n\n## Summary\n\nThis PR is the result of investigating and trying to fix the latest\nflakiness issues for profiling API suites as reported in this [issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe error message we're dealing with this time is\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\"}\n```\nWhich at first glance looked awfully similar to the errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\",\"attributes\":{\"cause\":\"Saved object [fleet-agent-policies/policy-elastic-agent-on-cloud] not found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both messages, we can notice the error now\ndoesn't have an explicit cause, while before it complained about `Saved\nobject [fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile these are good news as at least we know #270623 did fix something,\nbut the current issue merits deeper investigation\n\n## The Symptom: ES returning a 408 response\n\n### Background\n\nThe Universal Profiling setup flow in Kibana ends with a call to\n`GET /_profiling/status?wait_for_resources_created=true`. This is a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES holds the\nconnection open and\nwatches for cluster state changes, responding only once all profiling\nresources\n(index templates, ILM policies, data streams) are confirmed created — or\nwhen it\ntimes out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two timeouts** that Kibana was losing\nevery time.\n\n**ES side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses the request and sets the server-side wait budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\", TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is **30 seconds** when no `timeout=`\nquery parameter\nis provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses this value to register a `ClusterStateObserver` that waits up to\nthat timeout\nfor `resolver::isResourcesCreated` to become true. When the timeout\nelapses and\nresources still aren't ready, the `onTimeout` callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue timeout) {\n resolver.execute(clusterService.state(), ActionListener.wrap(response -> {\n response.setTimedOut(true);\n listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES then sends back a **completed HTTP response** with status code 408,\nvia this\nmapping in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT : RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: { enabled: true }, resource_management: { enabled: true },\nresources: { created: false } }`\n— the current partial status at timeout.\n\n**Kibana side:** The call was configured with `requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's default server timeout is 30s and\nKibana's was 60s,\n**ES always won the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408 response is a **completed HTTP exchange** from the transport's\nperspective.\nThe `@elastic/transport` client only auto-retries on `TimeoutError`\n(client-side\ntimeout before any response arrives) and HTTP 502/503/504. The existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both gated on a\nclient-side\n`TimeoutError` — they never triggered because Kibana received a valid\nHTTP response\n(just with status 408). The 408 was surfaced as a `ResponseError` and\nreturned to\nthe user as a failure with no retry attempt.\n\n### The fix\n\nBy appending `&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65 seconds** server-side before giving up. Kibana's `requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before ES would return a 408. The\nresulting\nclient-side `TimeoutError` is exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already designed to handle: up to 6 attempts, each with a fresh 60s\nbudget,\nwith bounded exponential backoff between retries. No code path that\npreviously\nobserved a 60s per-attempt budget is affected.\n\nThis not only affects test runs, but it will also correct this behaviour\nin the actual profiling UI. In case ES is slow to respond the request\nwill keep retrying for longer than 30s (as it was always intended)\ninstead of failing and showing an error in the UI after 30s.\n\nThe code for this fix is in e8fe7a2. \n\n### Is the error gone?\n\nWhile the problem described was an actual oversight we had in Kibana,\nthis error is a symptom rather than the cause. After implementing this\nchange we were no longer experiencing 408s but a new problem came\naround. Tests weren't failing mid run anymore but now the default test\nruntime timeout of 60s was being hit. My first instinct was bumping the\ntimeout up, but even 180s wasn't enough. At this point it was clear the\n408s coming from ES were just the way the issue was surfacing, while the\nreal problem was why was ES taking so long to complete the profiling\nsetup process\n\n## The Root Cause: Profiling is constantly setup and cleaned up\n\n### Background\n\nThe API suites have different requirements when it comes to profiling\nresources and/or data existing or not during test execution. Because\nwe're also verifying how the system behaves when setup or data is\nmissing as well as when everything is already wired up, we can't have a\nsingle global setup hook that pre-loads everything and leaves profiling\nsetup and data ingested.\n\nThis is also the reason why these suites aren't parallel and instead run\nsequentially. We can't ran suites concurrently because test requirements\nwould conflict with one another. But, even so they run sequentially,\neach test was being treated as a completely independent unit. What this\nmeans is each test was setting up (or cleaning up) all the resources it\nrequired as well as completely cleaning up after each one. This\nsituation lead to elasticsearch being continuously bombarded with\nrequest to setup resources quickly followed by requests to delete those\nsame resources it just created, repeated for each suite in the profiling\nconfig. Depending on how much work ES is dealing with and the state it\nwas left in by other tests in the lane it could lead to the situation\nobserved above where it starts to become overwhelmed and the setup call\nwould take too long.\n\n### The Fix\n\nThanks to test running sequentially, we can optimise the execution\nprocess by carefully controlling the order in which each suite runs. If\nwe can ensure suites expecting empty resources and/or data run first and\nthen are followed by suites that do fully setup and load data, we can\nreduce the amount of times we're loading and offloading resources,\nreducing the impact tests have on ES. While ideally each test would be\n100% independent from the rest, this is not the environment tests will\nbe running on specially on CI with tests lanes. Test shouldn't make the\nassumption that they're running on new and clean environments, so this\nPR adapts these suites to be resilient to polluted and overloaded\nenvironments.\n\nBy default, playwright runs test in the same order they're discovered\nfrom the test dir, aka, alphabetically. To ensure tests remain ordered\nas we expect them to be going forward, each test file is now prefixed by\na number. The first test to run is prefixed 00_* With this in mind, I've\nmodified and ordered each suite to try to reduce the number of setup and\ncleanup calls. While each tests will still make sure the environment is\nin the state it's expect in a `beforeAll` hook, each setup call is now\nguarded. This way, if resource or data already exist we won't try to set\nthem up again. Also, I've removed the `afterAll` hooks that were\ncleaning up everything profiling-related after each suite ran. If test\n02 creates the ES resources, test 03 can reuse those instead of having\nto create them all over again. And becase each test still checks if\neverything they expect is in place, they remain independent of each\nother if they need to.\n\nFor instance, if test 02 was skipped or removed it wouldn't cause test\n03 to then break as 03 is capable of setting up the environment as it\nexpects it to be. Each suite maintains it's capability to set itself up\nfor success while also gaining the ability to reuse pre-existing\nresources and data data other tests might have already created before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55","branchLabelMapping":{"^v9.5.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:version","v9.3.0","v9.4.0","Team:obs-presentation","v9.5.0"],"title":"[Profiling] Fix profiling API tests when running on test lanes (again)","number":271733,"url":"https://github.com/elastic/kibana/pull/271733","mergeCommit":{"message":"[Profiling] Fix profiling API tests when running on test lanes (again) (#271733)\n\n## Summary\n\nThis PR is the result of investigating and trying to fix the latest\nflakiness issues for profiling API suites as reported in this [issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe error message we're dealing with this time is\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\"}\n```\nWhich at first glance looked awfully similar to the errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\",\"attributes\":{\"cause\":\"Saved object [fleet-agent-policies/policy-elastic-agent-on-cloud] not found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both messages, we can notice the error now\ndoesn't have an explicit cause, while before it complained about `Saved\nobject [fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile these are good news as at least we know #270623 did fix something,\nbut the current issue merits deeper investigation\n\n## The Symptom: ES returning a 408 response\n\n### Background\n\nThe Universal Profiling setup flow in Kibana ends with a call to\n`GET /_profiling/status?wait_for_resources_created=true`. This is a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES holds the\nconnection open and\nwatches for cluster state changes, responding only once all profiling\nresources\n(index templates, ILM policies, data streams) are confirmed created — or\nwhen it\ntimes out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two timeouts** that Kibana was losing\nevery time.\n\n**ES side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses the request and sets the server-side wait budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\", TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is **30 seconds** when no `timeout=`\nquery parameter\nis provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses this value to register a `ClusterStateObserver` that waits up to\nthat timeout\nfor `resolver::isResourcesCreated` to become true. When the timeout\nelapses and\nresources still aren't ready, the `onTimeout` callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue timeout) {\n resolver.execute(clusterService.state(), ActionListener.wrap(response -> {\n response.setTimedOut(true);\n listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES then sends back a **completed HTTP response** with status code 408,\nvia this\nmapping in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT : RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: { enabled: true }, resource_management: { enabled: true },\nresources: { created: false } }`\n— the current partial status at timeout.\n\n**Kibana side:** The call was configured with `requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's default server timeout is 30s and\nKibana's was 60s,\n**ES always won the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408 response is a **completed HTTP exchange** from the transport's\nperspective.\nThe `@elastic/transport` client only auto-retries on `TimeoutError`\n(client-side\ntimeout before any response arrives) and HTTP 502/503/504. The existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both gated on a\nclient-side\n`TimeoutError` — they never triggered because Kibana received a valid\nHTTP response\n(just with status 408). The 408 was surfaced as a `ResponseError` and\nreturned to\nthe user as a failure with no retry attempt.\n\n### The fix\n\nBy appending `&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65 seconds** server-side before giving up. Kibana's `requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before ES would return a 408. The\nresulting\nclient-side `TimeoutError` is exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already designed to handle: up to 6 attempts, each with a fresh 60s\nbudget,\nwith bounded exponential backoff between retries. No code path that\npreviously\nobserved a 60s per-attempt budget is affected.\n\nThis not only affects test runs, but it will also correct this behaviour\nin the actual profiling UI. In case ES is slow to respond the request\nwill keep retrying for longer than 30s (as it was always intended)\ninstead of failing and showing an error in the UI after 30s.\n\nThe code for this fix is in e8fe7a2. \n\n### Is the error gone?\n\nWhile the problem described was an actual oversight we had in Kibana,\nthis error is a symptom rather than the cause. After implementing this\nchange we were no longer experiencing 408s but a new problem came\naround. Tests weren't failing mid run anymore but now the default test\nruntime timeout of 60s was being hit. My first instinct was bumping the\ntimeout up, but even 180s wasn't enough. At this point it was clear the\n408s coming from ES were just the way the issue was surfacing, while the\nreal problem was why was ES taking so long to complete the profiling\nsetup process\n\n## The Root Cause: Profiling is constantly setup and cleaned up\n\n### Background\n\nThe API suites have different requirements when it comes to profiling\nresources and/or data existing or not during test execution. Because\nwe're also verifying how the system behaves when setup or data is\nmissing as well as when everything is already wired up, we can't have a\nsingle global setup hook that pre-loads everything and leaves profiling\nsetup and data ingested.\n\nThis is also the reason why these suites aren't parallel and instead run\nsequentially. We can't ran suites concurrently because test requirements\nwould conflict with one another. But, even so they run sequentially,\neach test was being treated as a completely independent unit. What this\nmeans is each test was setting up (or cleaning up) all the resources it\nrequired as well as completely cleaning up after each one. This\nsituation lead to elasticsearch being continuously bombarded with\nrequest to setup resources quickly followed by requests to delete those\nsame resources it just created, repeated for each suite in the profiling\nconfig. Depending on how much work ES is dealing with and the state it\nwas left in by other tests in the lane it could lead to the situation\nobserved above where it starts to become overwhelmed and the setup call\nwould take too long.\n\n### The Fix\n\nThanks to test running sequentially, we can optimise the execution\nprocess by carefully controlling the order in which each suite runs. If\nwe can ensure suites expecting empty resources and/or data run first and\nthen are followed by suites that do fully setup and load data, we can\nreduce the amount of times we're loading and offloading resources,\nreducing the impact tests have on ES. While ideally each test would be\n100% independent from the rest, this is not the environment tests will\nbe running on specially on CI with tests lanes. Test shouldn't make the\nassumption that they're running on new and clean environments, so this\nPR adapts these suites to be resilient to polluted and overloaded\nenvironments.\n\nBy default, playwright runs test in the same order they're discovered\nfrom the test dir, aka, alphabetically. To ensure tests remain ordered\nas we expect them to be going forward, each test file is now prefixed by\na number. The first test to run is prefixed 00_* With this in mind, I've\nmodified and ordered each suite to try to reduce the number of setup and\ncleanup calls. While each tests will still make sure the environment is\nin the state it's expect in a `beforeAll` hook, each setup call is now\nguarded. This way, if resource or data already exist we won't try to set\nthem up again. Also, I've removed the `afterAll` hooks that were\ncleaning up everything profiling-related after each suite ran. If test\n02 creates the ES resources, test 03 can reuse those instead of having\nto create them all over again. And becase each test still checks if\neverything they expect is in place, they remain independent of each\nother if they need to.\n\nFor instance, if test 02 was skipped or removed it wouldn't cause test\n03 to then break as 03 is capable of setting up the environment as it\nexpects it to be. Each suite maintains it's capability to set itself up\nfor success while also gaining the ability to reuse pre-existing\nresources and data data other tests might have already created before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55"}},"sourceBranch":"main","suggestedTargetBranches":["9.3","9.4"],"targetPullRequestStates":[{"branch":"9.3","label":"v9.3.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"9.4","label":"v9.4.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v9.5.0","branchLabelMappingKey":"^v9.5.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/271733","number":271733,"mergeCommit":{"message":"[Profiling] Fix profiling API tests when running on test lanes (again) (#271733)\n\n## Summary\n\nThis PR is the result of investigating and trying to fix the latest\nflakiness issues for profiling API suites as reported in this [issue\ncomment](https://github.com/elastic/kibana/issues/248929#issuecomment-4563292721)\n\nThe error message we're dealing with this time is\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\"}\n```\nWhich at first glance looked awfully similar to the errors #270623 was\ntargeting which read\n```\nKbnClientRequesterError: [POST http://localhost:5620/api/profiling/setup/es_resources] 500 Internal Server Error -- {\"statusCode\":500,\"error\":\"Internal Server Error\",\"message\":\"Error while setting up Universal Profiling\",\"attributes\":{\"cause\":\"Saved object [fleet-agent-policies/policy-elastic-agent-on-cloud] not found\",\"name\":\"Error\"}}\n```\n\nBut looking more carefully at both messages, we can notice the error now\ndoesn't have an explicit cause, while before it complained about `Saved\nobject [fleet-agent-policies/policy-elastic-agent-on-cloud] not found`\nWhile these are good news as at least we know #270623 did fix something,\nbut the current issue merits deeper investigation\n\n## The Symptom: ES returning a 408 response\n\n### Background\n\nThe Universal Profiling setup flow in Kibana ends with a call to\n`GET /_profiling/status?wait_for_resources_created=true`. This is a\nlong-polling\nendpoint: when `wait_for_resources_created=true`, ES holds the\nconnection open and\nwatches for cluster state changes, responding only once all profiling\nresources\n(index templates, ILM policies, data streams) are confirmed created — or\nwhen it\ntimes out.\n\n### Why Kibana sees a 408\n\nThe problem is a **race between two timeouts** that Kibana was losing\nevery time.\n\n**ES side:**\n\n[`RestGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/rest/RestGetStatusAction.java)\nparses the request and sets the server-side wait budget:\n\n```java\nrestRequest.paramAsTime(\"timeout\", TimeValue.THIRTY_SECONDS)\n```\n\nThe default server-side timeout is **30 seconds** when no `timeout=`\nquery parameter\nis provided.\n\n[`TransportGetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/TransportGetStatusAction.java)\nuses this value to register a `ClusterStateObserver` that waits up to\nthat timeout\nfor `resolver::isResourcesCreated` to become true. When the timeout\nelapses and\nresources still aren't ready, the `onTimeout` callback fires:\n\n```java\n@Override\npublic void onTimeout(TimeValue timeout) {\n resolver.execute(clusterService.state(), ActionListener.wrap(response -> {\n response.setTimedOut(true);\n listener.onResponse(response);\n }, listener::onFailure));\n}\n```\n\nES then sends back a **completed HTTP response** with status code 408,\nvia this\nmapping in\n\n[`GetStatusAction.java`](https://github.com/elastic/elasticsearch/blob/1823a1f7dba79df95db4013c323c7afb219e4489/x-pack/plugin/profiling/src/main/java/org/elasticsearch/xpack/profiling/action/GetStatusAction.java):\n\n```java\npublic RestStatus status() {\n return timedOut ? RestStatus.REQUEST_TIMEOUT : RestStatus.OK;\n}\n```\n\nThe response body contains\n`{ profiling: { enabled: true }, resource_management: { enabled: true },\nresources: { created: false } }`\n— the current partial status at timeout.\n\n**Kibana side:** The call was configured with `requestTimeout: 60_000` —\na 60-second\nclient-side budget. Since ES's default server timeout is 30s and\nKibana's was 60s,\n**ES always won the race**: it responded with a 408 at ~30s, well before\nKibana's\n60s client timeout ever fired.\n\n### Why that 408 was not retried\n\nA 408 response is a **completed HTTP exchange** from the transport's\nperspective.\nThe `@elastic/transport` client only auto-retries on `TimeoutError`\n(client-side\ntimeout before any response arrives) and HTTP 502/503/504. The existing\noptions\n`retryOnTimeout: true` and `maxRetries: 5` are both gated on a\nclient-side\n`TimeoutError` — they never triggered because Kibana received a valid\nHTTP response\n(just with status 408). The 408 was surfaced as a `ResponseError` and\nreturned to\nthe user as a failure with no retry attempt.\n\n### The fix\n\nBy appending `&timeout=65s` to the request path, we explicitly tell ES\nto wait\n**65 seconds** server-side before giving up. Kibana's `requestTimeout:\n60_000` now\nwins the race — it fires 5 seconds before ES would return a 408. The\nresulting\nclient-side `TimeoutError` is exactly what `retryOnTimeout: true` +\n`maxRetries: 5`\nwas already designed to handle: up to 6 attempts, each with a fresh 60s\nbudget,\nwith bounded exponential backoff between retries. No code path that\npreviously\nobserved a 60s per-attempt budget is affected.\n\nThis not only affects test runs, but it will also correct this behaviour\nin the actual profiling UI. In case ES is slow to respond the request\nwill keep retrying for longer than 30s (as it was always intended)\ninstead of failing and showing an error in the UI after 30s.\n\nThe code for this fix is in e8fe7a2. \n\n### Is the error gone?\n\nWhile the problem described was an actual oversight we had in Kibana,\nthis error is a symptom rather than the cause. After implementing this\nchange we were no longer experiencing 408s but a new problem came\naround. Tests weren't failing mid run anymore but now the default test\nruntime timeout of 60s was being hit. My first instinct was bumping the\ntimeout up, but even 180s wasn't enough. At this point it was clear the\n408s coming from ES were just the way the issue was surfacing, while the\nreal problem was why was ES taking so long to complete the profiling\nsetup process\n\n## The Root Cause: Profiling is constantly setup and cleaned up\n\n### Background\n\nThe API suites have different requirements when it comes to profiling\nresources and/or data existing or not during test execution. Because\nwe're also verifying how the system behaves when setup or data is\nmissing as well as when everything is already wired up, we can't have a\nsingle global setup hook that pre-loads everything and leaves profiling\nsetup and data ingested.\n\nThis is also the reason why these suites aren't parallel and instead run\nsequentially. We can't ran suites concurrently because test requirements\nwould conflict with one another. But, even so they run sequentially,\neach test was being treated as a completely independent unit. What this\nmeans is each test was setting up (or cleaning up) all the resources it\nrequired as well as completely cleaning up after each one. This\nsituation lead to elasticsearch being continuously bombarded with\nrequest to setup resources quickly followed by requests to delete those\nsame resources it just created, repeated for each suite in the profiling\nconfig. Depending on how much work ES is dealing with and the state it\nwas left in by other tests in the lane it could lead to the situation\nobserved above where it starts to become overwhelmed and the setup call\nwould take too long.\n\n### The Fix\n\nThanks to test running sequentially, we can optimise the execution\nprocess by carefully controlling the order in which each suite runs. If\nwe can ensure suites expecting empty resources and/or data run first and\nthen are followed by suites that do fully setup and load data, we can\nreduce the amount of times we're loading and offloading resources,\nreducing the impact tests have on ES. While ideally each test would be\n100% independent from the rest, this is not the environment tests will\nbe running on specially on CI with tests lanes. Test shouldn't make the\nassumption that they're running on new and clean environments, so this\nPR adapts these suites to be resilient to polluted and overloaded\nenvironments.\n\nBy default, playwright runs test in the same order they're discovered\nfrom the test dir, aka, alphabetically. To ensure tests remain ordered\nas we expect them to be going forward, each test file is now prefixed by\na number. The first test to run is prefixed 00_* With this in mind, I've\nmodified and ordered each suite to try to reduce the number of setup and\ncleanup calls. While each tests will still make sure the environment is\nin the state it's expect in a `beforeAll` hook, each setup call is now\nguarded. This way, if resource or data already exist we won't try to set\nthem up again. Also, I've removed the `afterAll` hooks that were\ncleaning up everything profiling-related after each suite ran. If test\n02 creates the ES resources, test 03 can reuse those instead of having\nto create them all over again. And becase each test still checks if\neverything they expect is in place, they remain independent of each\nother if they need to.\n\nFor instance, if test 02 was skipped or removed it wouldn't cause test\n03 to then break as 03 is capable of setting up the environment as it\nexpects it to be. Each suite maintains it's capability to set itself up\nfor success while also gaining the ability to reuse pre-existing\nresources and data data other tests might have already created before\nthem","sha":"86bd4c5817c49814e938cd3a59f9d0acefcbce55"}}]}] BACKPORT-->
1 parent 889a481 commit dce6ae9

7 files changed

Lines changed: 155 additions & 167 deletions

File tree

x-pack/solutions/observability/plugins/profiling/server/utils/create_profiling_es_client.ts

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ import type {
1919
} from '@kbn/profiling-utils';
2020
import { withProfilingSpan } from './with_profiling_span';
2121

22+
const PROFILING_STATUS_TIMEOUT_SECONDS = 60;
23+
2224
export function cancelEsRequestOnAbort<T extends Promise<any>>(
2325
promise: T,
2426
request: KibanaRequest,
@@ -129,13 +131,15 @@ export function createProfilingEsClient({
129131
{
130132
method: 'GET',
131133
path: encodeURI(
132-
`/_profiling/status?wait_for_resources_created=${waitForResourcesCreated}`
134+
`/_profiling/status?wait_for_resources_created=${waitForResourcesCreated}&timeout=${
135+
PROFILING_STATUS_TIMEOUT_SECONDS + 5
136+
}s`
133137
),
134138
},
135139
{
136140
signal: controller.signal,
137141
meta: true,
138-
requestTimeout: 60000,
142+
requestTimeout: PROFILING_STATUS_TIMEOUT_SECONDS * 1000,
139143
maxRetries: 5,
140144
retryOnTimeout: true,
141145
}

x-pack/solutions/observability/plugins/profiling/test/scout/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ node scripts/scout.js start-server --arch serverless --domain [search|observabil
1313
### Run the Tests
1414

1515
**Note**: Profiling resources and test data are automatically set up when you run the tests. The global setup hook will:
16+
1617
- Set up profiling Elasticsearch resources via `/api/profiling/setup/es_resources`
1718
- Load test profiling data from 'x-pack/solutions/observability/packages/kbn-scout-oblt/src/playwright/fixtures/worker/profiling/test_data'
1819
- Skip setup if resources and data are already present
@@ -28,7 +29,16 @@ npx playwright test --config x-pack/solutions/observability/plugins/profiling/te
2829
## Test Categories
2930

3031
Tests are tagged with:
32+
3133
- `@local-stateful-classic` - Stateful tests
3234
- `@local-serverless-observability_complete` - Serverless tests - currently not supported in Universal Profiling
3335

3436
Test results are available in `x-pack/solutions/observability/plugins/profiling/test/scout/ui/output`
37+
38+
## Api Tests Execution Order
39+
40+
While UI tests run in parallel, api tests can't be parallelized as they test different setups that could conflict with one another. To avoid constant setups and cleanups of profiling resources, which can lead to performance issues and flaky tests, api tests must run sequentially and in a predefined order.
41+
42+
This strict ordering will ensure suites testing features for missing setups will run before those requiring partial setups, which in turn run before others requiring full setup and data. Thanks to it, the tests can limit the amount of calls to setup resources and cleanup, instead of having to setup the whole thing in each `beforeAll` as well as having to cleanup in every `afterAll`. The global teardown hook that runs after all suites will ensure cleanup is done before terminating the profiling suite execution ensuring tests cleanup after themselves and do not pollute test lanes when running in one.
43+
44+
By default, playwright runs tests in the order they're discovered when read from the test directory. This is why api test files are prefixed with a given number which indicates the order they're expected to run, being 00\_\* the first test to run. If you're adding a new test, consider your suite's needs in terms of setup and data. If your tests concern features when no setup and/or data is present be sure to run them **before** tests that do configure resources and data. And likewise, if your test is configuring resources or adding data make sure it runs **after** tests that expect empty config or data.

x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/has_no_setup.spec.ts renamed to x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/00_has_no_setup.spec.ts

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,17 @@ import type { RoleApiCredentials } from '@kbn/scout-oblt';
1111
import { apiTest } from '../../common/fixtures';
1212
import { esResourcesEndpoint } from '../../common/fixtures/constants';
1313

14-
// Failing: See https://github.com/elastic/kibana/issues/268400
15-
apiTest.describe.skip(
14+
apiTest.describe(
1615
'Profiling is not setup and no data is loaded',
1716
{ tag: tags.stateful.classic },
1817
() => {
1918
let viewerApiCreditials: RoleApiCredentials;
2019
let adminApiCreditials: RoleApiCredentials;
2120

2221
apiTest.beforeAll(async ({ profilingHelper, profilingSetup, requestAuth }) => {
23-
await profilingHelper.cleanupPolicies();
24-
await profilingHelper.installPolicies();
2522
await profilingSetup.cleanup();
23+
await profilingHelper.cleanupPolicies();
24+
2625
viewerApiCreditials = await requestAuth.getApiKey('viewer');
2726
adminApiCreditials = await requestAuth.getApiKey('admin');
2827
});
@@ -37,8 +36,8 @@ apiTest.describe.skip(
3736
});
3837
const adminStatus = adminRes.body;
3938
expect(adminStatus.has_setup).toBe(false);
40-
expect(adminStatus.has_data).toBe(false);
41-
expect(adminStatus.pre_8_9_1_data).toBe(false);
39+
expect(adminStatus.has_data).toBeDefined();
40+
expect(adminStatus.pre_8_9_1_data).toBeDefined();
4241
});
4342

4443
apiTest('Viewer users', async ({ apiClient }) => {
@@ -51,8 +50,8 @@ apiTest.describe.skip(
5150
});
5251
const readStatus = readRes.body;
5352
expect(readStatus.has_setup).toBe(false);
54-
expect(readStatus.has_data).toBe(false);
55-
expect(readStatus.pre_8_9_1_data).toBe(false);
53+
expect(readStatus.has_data).toBeDefined();
54+
expect(readStatus.pre_8_9_1_data).toBeDefined();
5655
expect(readStatus.has_required_role).toBe(false);
5756
});
5857
}

x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/has_setup_apm_not_installed.spec.ts renamed to x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/01_has_setup_apm_not_installed.spec.ts

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,23 +19,17 @@ apiTest.describe(
1919
let adminApiCreditials: RoleApiCredentials;
2020

2121
apiTest.beforeAll(async ({ profilingHelper, profilingSetup, requestAuth }) => {
22-
// Ensure the agent policy that the cloud setup attaches the profiler_collector and
23-
// profiler_symbolizer package policies to exists. Without this, setupResources()
24-
// returns 500 when this spec runs ahead of (or independently of) has_no_setup.spec.
25-
await profilingHelper.installPolicies();
22+
const status = await profilingSetup.checkStatus();
2623

27-
if (!(await profilingSetup.checkStatus()).has_setup) {
24+
if (!status.has_setup) {
25+
await profilingHelper.installPolicies();
2826
await profilingSetup.setupResources();
2927
}
28+
3029
viewerApiCreditials = await requestAuth.getApiKey('viewer');
3130
adminApiCreditials = await requestAuth.getApiKey('admin');
3231
});
3332

34-
apiTest.afterAll(async ({ profilingSetup, profilingHelper }) => {
35-
await profilingHelper.cleanupPolicies();
36-
await profilingSetup.cleanup();
37-
});
38-
3933
apiTest('Admin user', async ({ apiClient }) => {
4034
const adminRes = await apiClient.get(esResourcesEndpoint, {
4135
headers: {
@@ -47,8 +41,8 @@ apiTest.describe(
4741

4842
const adminStatus = adminRes.body;
4943
expect(adminStatus.has_setup).toBe(true);
50-
expect(adminStatus.has_data).toBe(false);
51-
expect(adminStatus.pre_8_9_1_data).toBe(false);
44+
expect(adminStatus.has_data).toBeDefined();
45+
expect(adminStatus.pre_8_9_1_data).toBeDefined();
5246
});
5347

5448
apiTest('Viewer user', async ({ apiClient }) => {
@@ -62,8 +56,8 @@ apiTest.describe(
6256

6357
const readStatus = readRes.body;
6458
expect(readStatus.has_setup).toBe(true);
65-
expect(readStatus.has_data).toBe(false);
66-
expect(readStatus.pre_8_9_1_data).toBe(false);
59+
expect(readStatus.has_data).toBeDefined();
60+
expect(readStatus.pre_8_9_1_data).toBeDefined();
6761
expect(readStatus.has_required_role).toBe(false);
6862
});
6963
}

x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/has_setup_with_data.spec.ts renamed to x-pack/solutions/observability/plugins/profiling/test/scout/api/tests/02_has_setup_with_data.spec.ts

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,25 @@ apiTest.describe('Profiling is setup and data is loaded', { tag: tags.stateful.c
1616
let adminApiCreditials: RoleApiCredentials;
1717

1818
apiTest.beforeAll(async ({ requestAuth, profilingHelper, profilingSetup }) => {
19-
// Make this spec self-sufficient instead of relying on has_no_setup.spec running first:
20-
// ensure the cloud agent policy exists and profiling resources are set up before
21-
// attempting to load data.
22-
await profilingHelper.installPolicies();
19+
let status = await profilingSetup.checkStatus();
2320

24-
// Ensure the ES profiling data streams + aliases (`profiling-stackframes`, `profiling-stacktraces`, `profiling-executables`, `profiling-hosts`, `profiling-events-*`) exist before we bulk-write into them. Bulk `create` ops to non-existing aliases would otherwise create regular indices that later collide with the ES profiling plugin's alias creation (`InvalidAliasNameException`).
25-
if (!(await profilingSetup.checkStatus()).has_setup) {
21+
if (!status.has_setup) {
22+
await profilingHelper.installPolicies();
2623
await profilingSetup.setupResources();
2724
}
2825

29-
await profilingSetup.loadData(esArchiversPath);
26+
if (!status.has_data) {
27+
await profilingSetup.loadData(esArchiversPath);
28+
}
29+
30+
status = await profilingSetup.checkStatus();
31+
expect(status.has_setup).toBe(true);
32+
expect(status.has_data).toBe(true);
33+
3034
viewerApiCreditials = await requestAuth.getApiKey('viewer');
3135
adminApiCreditials = await requestAuth.getApiKey('admin');
3236
});
3337

34-
apiTest.afterAll(async ({ profilingSetup, profilingHelper }) => {
35-
await profilingHelper.cleanupPolicies();
36-
await profilingSetup.cleanup();
37-
});
38-
3938
apiTest('Admin user', async ({ apiClient }) => {
4039
const adminRes = await apiClient.get(esResourcesEndpoint, {
4140
headers: {
@@ -47,7 +46,7 @@ apiTest.describe('Profiling is setup and data is loaded', { tag: tags.stateful.c
4746
const adminStatus = adminRes.body;
4847
expect(adminStatus.has_setup).toBe(true);
4948
expect(adminStatus.has_data).toBe(true);
50-
expect(adminStatus.pre_8_9_1_data).toBe(false);
49+
expect(adminStatus.pre_8_9_1_data).toBeDefined();
5150
});
5251

5352
apiTest('Viewer user', async ({ apiClient }) => {
@@ -62,7 +61,7 @@ apiTest.describe('Profiling is setup and data is loaded', { tag: tags.stateful.c
6261
const readStatus = readRes.body;
6362
expect(readStatus.has_setup).toBe(true);
6463
expect(readStatus.has_data).toBe(true);
65-
expect(readStatus.pre_8_9_1_data).toBe(false);
64+
expect(readStatus.pre_8_9_1_data).toBeDefined();
6665
expect(readStatus.has_required_role).toBe(false);
6766
});
6867
});
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License
4+
* 2.0; you may not use this file except in compliance with the Elastic License
5+
* 2.0.
6+
*/
7+
8+
import { tags } from '@kbn/scout-oblt';
9+
import { expect } from '@kbn/scout-oblt/api';
10+
import type { ApiClientFixture, RoleApiCredentials } from '@kbn/scout-oblt';
11+
import { apiTest } from '../../common/fixtures';
12+
import { esArchiversPath, esResourcesEndpoint } from '../../common/fixtures/constants';
13+
14+
apiTest.describe('Collector integration is not installed', { tag: tags.stateful.classic }, () => {
15+
let viewerApiCreditials: RoleApiCredentials;
16+
17+
apiTest.beforeAll(async ({ requestAuth, profilingSetup, profilingHelper }) => {
18+
let ids = await profilingHelper.getPolicyIds();
19+
if (!ids.collectorId || !ids.symbolizerId) {
20+
await profilingHelper.installPolicies();
21+
await profilingSetup.setupResources();
22+
ids = await profilingHelper.getPolicyIds();
23+
}
24+
25+
expect(ids.collectorId).toBeDefined();
26+
expect(ids.symbolizerId).toBeDefined();
27+
28+
// Same fixture as has_setup_with_data: ensure profiling* has documents so we assert has_data like has_setup — indexed data remains after Fleet policy deletion.
29+
let status = await profilingSetup.checkStatus();
30+
if (!status.has_data) {
31+
await profilingSetup.loadData(esArchiversPath);
32+
status = await profilingSetup.checkStatus();
33+
}
34+
expect(status.has_data).toBe(true);
35+
36+
viewerApiCreditials = await requestAuth.getApiKey('viewer');
37+
});
38+
39+
// Fleet package_policies.delete is eventually consistent — a subsequent /api/profiling/setup/es_resources call may still see the policy as installed for a brief window. Polling until `has_setup` reflects the deletion (cloud: false; self-managed: unchanged true) keeps the test deterministic.
40+
const getViewerStatus = async (apiClient: ApiClientFixture) => {
41+
const res = await apiClient.get(esResourcesEndpoint, {
42+
headers: {
43+
...viewerApiCreditials.apiKeyHeader,
44+
'content-type': 'application/json',
45+
'kbn-xsrf': 'reporting',
46+
},
47+
});
48+
return res.body;
49+
};
50+
51+
const waitForExpectedHasSetup = async (apiClient: ApiClientFixture) => {
52+
// In cloud, has_setup requires collector + symbolizer Fleet policies — deleting either flips it to false. In self-managed/serverless, has_setup is ES-only and unaffected by Fleet policy deletion.
53+
await expect
54+
.poll(
55+
async () => {
56+
const status = await getViewerStatus(apiClient);
57+
return status.has_setup === (status.type !== 'cloud');
58+
},
59+
{
60+
timeout: 30_000,
61+
intervals: [500, 1000, 2000, 4000],
62+
message:
63+
'has_setup did not converge to expected value after Fleet package policy deletion',
64+
}
65+
)
66+
.toBe(true);
67+
};
68+
69+
apiTest('collector integration missing', async ({ profilingHelper, apiServices, apiClient }) => {
70+
const ids = await profilingHelper.getPolicyIds();
71+
const collectorId = ids.collectorId;
72+
73+
expect(collectorId).toBeDefined();
74+
75+
await apiServices.fleet.package_policies.delete(collectorId!);
76+
77+
const adminRes = await apiClient.get(esResourcesEndpoint);
78+
const adminStatus = adminRes.body;
79+
expect(adminStatus.has_setup).toBeUndefined();
80+
expect(adminStatus.has_data).toBeUndefined();
81+
expect(adminStatus.pre_8_9_1_data).toBeUndefined();
82+
83+
await waitForExpectedHasSetup(apiClient);
84+
85+
const readStatus = await getViewerStatus(apiClient);
86+
expect(readStatus.has_setup).toBe(readStatus.type !== 'cloud');
87+
expect(readStatus.has_data).toBe(true);
88+
expect(readStatus.pre_8_9_1_data).toBeDefined();
89+
expect(readStatus.has_required_role).toBe(false);
90+
});
91+
92+
apiTest(
93+
'Symbolizer integration is not installed',
94+
async ({ profilingHelper, apiClient, apiServices }) => {
95+
const ids = await profilingHelper.getPolicyIds();
96+
const symbolizerId = ids.symbolizerId;
97+
98+
expect(symbolizerId).toBeDefined();
99+
100+
await apiServices.fleet.package_policies.delete(symbolizerId!);
101+
102+
await waitForExpectedHasSetup(apiClient);
103+
104+
const readStatus = await getViewerStatus(apiClient);
105+
expect(readStatus.has_setup).toBe(readStatus.type !== 'cloud');
106+
expect(readStatus.has_data).toBe(true);
107+
expect(readStatus.pre_8_9_1_data).toBeDefined();
108+
expect(readStatus.has_required_role).toBe(false);
109+
}
110+
);
111+
});

0 commit comments

Comments
 (0)