Skip to content

[8.19] (backport #17028) Smoke test os retry event assertions #17067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 3, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jun 2, 2025

Motivation/summary

Smoke tests intermittently fail when asserting events have been indexed to ES:

2025-05-14T03:32:09.9526511Z -> Asserting logs-apm.error-* contains expected documents...
2025-05-14T03:32:09.9558871Z Didn't find 1 indexed documents error.id=9876543210abcdeffedcba0123456789, total hits 0
2025-05-14T03:32:09.9559796Z {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
2025-05-14T03:32:09.9560360Z -> Smoke tests FAILED!!

The failures were initially believed to be caused by a data race where the smoke test was reading events before they were flushed and indexed. Added a retries and wait to event assertion code so the test do not fail right away. The flush_interval was also decreased to reduce the amount of time the smoke test has to wait and retry.

The retry function was adapted from the buildkite scripts found here.

Please let me know if you think we need to adjust the number of retries, wait times, or the flush_interval.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

The below script was used to validate the retry control flow locally:

#!/bin/bash

source testing/smoke/lib.sh
data_stream_assert_events

The smoke tests os workflow was ran multiple times
image

Validated logs show missed events being retried.

example for one retry

-> Sending events to APM Server...
-> Asserting ingest pipelines...
-> Asserting logs-apm.error-* contains expected documents...
-> Asserted 1 error.id=9876543210abcdeffedcba0123456789 exists
-> Asserting traces-apm-* contains expected documents...
-> Asserted 1 span.id=1234567890aaaade exists
-> Asserting traces-apm-* contains expected documents...
Didn't find 1 indexed documents transaction.id=4340a8e0df1906ecbfa9, total hits 0
{"took":40,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
-> Retry cmd: 'assert_document traces-apm-* transaction.id 4340a8e0df1906ecbfa9 9.1.0';  1/10 exited 2, retrying in 1 seconds...
-> Asserting traces-apm-* contains expected documents...
-> Asserted 1 transaction.id=4340a8e0df1906ecbfa9 exists

example with multiple retries

-> Sending events to APM Server...
-> Asserting ingest pipelines...
-> Asserting logs-apm.error-* contains expected documents...
Didn't find 1 indexed documents error.id=9876543210abcdeffedcba0123456789, total hits 0
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0.0,"hits":[]}}
-> Retry cmd: 'assert_document logs-apm.error-* error.id 9876543210abcdeffedcba0123456789 9.1.0';  1/10 exited 2, retrying in 1 seconds...
{"error":{"root_cause":[{"type":"no_shard_available_action_exception","reason":null}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":".ds-logs-apm.error-default-2025.05.27-000001","node":null,"reason":{"type":"no_shard_available_action_exception","reason":null}}]},"status":503}
-> Asserting logs-apm.error-* contains expected documents...
Didn't find 1 indexed documents error.id=9876543210abcdeffedcba0123456789, total hits null
{"error":{"root_cause":[{"type":"no_shard_available_action_exception","reason":null}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":".ds-logs-apm.error-default-2025.05.27-000001","node":null,"reason":{"type":"no_shard_available_action_exception","reason":null}}]},"status":503}
-> Retry cmd: 'assert_document logs-apm.error-* error.id 9876543210abcdeffedcba0123456789 9.1.0';  2/10 exited 2, retrying in 2 seconds...
-> Asserting logs-apm.error-* contains expected documents...
Didn't find 1 indexed documents error.id=9876543210abcdeffedcba0123456789, total hits 0
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
-> Retry cmd: 'assert_document logs-apm.error-* error.id 9876543210abcdeffedcba0123456789 9.1.0';  3/10 exited 2, retrying in 4 seconds...
-> Asserting logs-apm.error-* contains expected documents...
Didn't find 1 indexed documents error.id=9876543210abcdeffedcba0123456789, total hits 0
{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
-> Retry cmd: 'assert_document logs-apm.error-* error.id 9876543210abcdeffedcba0123456789 9.1.0';  4/10 exited 2, retrying in 8 seconds...

Related issues

Closes #16905


This is an automatic backport of pull request #17028 done by Mergify.

* smoke: Add retries and wait to data stream event assertions

This will improve the reliability of the smoke tests which intermittently fail due to data not yet being available in ES.

* smoke: decrease wait time after sending events since assertion now performs retries

* smoke: decrease elasticsearch flush_interval to 100ms from the 1s default

This will reduce any latency indexing events to help stabilize smoke test and reduce the amount of retries when asserting events.

* smoke: decreased retries for data stream event assertions

(cherry picked from commit b3381c2)

# Conflicts:
#	testing/smoke/lib.sh
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Jun 2, 2025
@mergify mergify bot requested a review from a team as a code owner June 2, 2025 17:09
Copy link
Contributor Author

mergify bot commented Jun 2, 2025

Cherry-pick of b3381c2 has failed:

On branch mergify/bp/8.19/pr-17028
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit b3381c21.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   testing/infra/terraform/modules/standalone_apm_server/apm-server.yml.tftpl

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   testing/smoke/lib.sh

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Jun 2, 2025
Copy link
Contributor

github-actions bot commented Jun 2, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify mergify bot merged commit d53757c into 8.19 Jun 3, 2025
15 checks passed
@mergify mergify bot deleted the mergify/bp/8.19/pr-17028 branch June 3, 2025 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport conflicts There is a conflict in the backported pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants