Skip to content

Commit 887a0d0

Browse files
authored
[Scout][CI] Retry only failed specs on lane retry (#274844)
## Summary On a Buildkite retry, re-run only the Scout spec files that failed instead of the whole config. Today, when a single spec fails, the entire config is marked failed and stored in build metadata, so a retry re-runs everything. That slows CI feedback and holds agents longer than needed. Closes elastic/appex-qa-team#858. ## How it works - After a config fails, parse the failed spec files from Playwright's JSON report and upload them as a build-scoped, per-config artifact, stamped with the attempt that produced it. - On retry, restore the snapshot from the immediately-preceding attempt and re-run only those spec files (passed as positional filters). Any miss falls back to re-running the whole config. - The attempt stamp handles the agent-lost case: if the previous attempt was lost before it could snapshot, there is nothing to restore for that attempt, so we safely re-run the whole config rather than trusting a stale snapshot. ## Notes - Retry unit is the spec **file** plus its server config (a spec can pass under one server config and fail under another). Re-running whole files also keeps `describe.serial` blocks intact, since their cases share state and must run together. ## Verified in CI Validated end-to-end in build [#459089](https://buildkite.com/elastic/kibana-pull-request/builds/459089) (both core cases observed on real lanes): **1. Real test failure → only the failed spec file re-runs.** Lane #14 failed on the first attempt; the retry restored the snapshot (`hosts/hosts_flyout.spec.ts`) and re-ran only that file (`Retrying 1 failed spec file(s)` → `8 passed`), not the whole `infra` config. <img width="844" height="664" alt="Screenshot 2026-06-25 at 10 11 16 AM" src="https://github.com/user-attachments/assets/3ff092be-0de9-4bfe-8bfc-44ea71548d13" /> **2. Agent lost → safe whole-config fallback.** Lane #58's first attempt was lost (exit -1) before it could snapshot, so the retry found nothing to restore and re-ran the whole config (`whole config will re-run` → `Running:`) instead of trusting stale data. <img width="742" height="507" alt="Screenshot 2026-06-25 at 10 15 46 AM" src="https://github.com/user-attachments/assets/c601ef29-3dc1-4300-b3cf-d491d02bf96c" />
1 parent 06f94ca commit 887a0d0

1 file changed

Lines changed: 78 additions & 1 deletion

File tree

.buildkite/scripts/steps/test/scout/run_test_lane.sh

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ PASSED_INDICES=""
1212
PASSED=()
1313
FAILED=()
1414
SKIPPED=()
15+
RETRY_SPEC_FILES=()
1516

1617
# Fail early if any of the given environment variable names are unset or empty
1718
check_required_env_vars() {
@@ -36,6 +37,70 @@ download_test_lane_loads() {
3637
download_tmp_artifact "$SCOUT_TEST_LANE_LOADS_PATH" . "$BUILDKITE_BUILD_ID"
3738
}
3839

40+
# Per-config, attempt-stamped artifact path for a config's failed-spec snapshot.
41+
failed_specs_artifact_path() {
42+
local idx="$1" attempt="$2"
43+
local key="${BUILDKITE_STEP_KEY//[^a-zA-Z0-9_-]/_}"
44+
echo ".scout/failed-specs-${key}-${idx}-attempt-${attempt}.json"
45+
}
46+
47+
# Per-config path for Playwright's JSON report; set via PLAYWRIGHT_JSON_OUTPUT_FILE so it resolves
48+
# against this script's cwd rather than the config's directory.
49+
json_report_path() {
50+
echo ".scout/test-results-${1}.json"
51+
}
52+
53+
# After a config fails, persist its failed spec files so the next attempt can re-run only those.
54+
# Spec-file level (not per-test) so state-sharing blocks (e.g. describe.serial) re-run together.
55+
snapshot_failed_specs() {
56+
local idx="$1"
57+
58+
local report
59+
report="$(json_report_path "$idx")"
60+
if [[ ! -f "$report" ]]; then
61+
echo "No $report produced for index $idx; whole config will re-run on retry"
62+
return
63+
fi
64+
65+
# `|| true`: a non-zero jq exit (e.g. truncated report after a crash) must not abort the lane.
66+
local failed_specs
67+
failed_specs="$(jq -r '[.suites[]? | recurse(.suites[]?) | .specs[]? | select(.ok == false) | .file] | unique[]?' "$report" 2>/dev/null)" || true
68+
69+
if [[ -z "$failed_specs" ]]; then
70+
echo "No failed spec files parsed for index $idx; whole config will re-run on retry"
71+
return
72+
fi
73+
74+
local artifact_path
75+
artifact_path="$(failed_specs_artifact_path "$idx" "${BUILDKITE_RETRY_COUNT:-0}")"
76+
printf '%s\n' "$failed_specs" > "$artifact_path"
77+
# Best-effort: on upload failure the retry just re-runs the whole config.
78+
buildkite-agent artifact upload "$artifact_path" || echo "Failed to upload failed-specs snapshot for index $idx"
79+
}
80+
81+
# On retry, load the previous attempt's failed spec files into RETRY_SPEC_FILES. Returns non-zero
82+
# on any miss (first attempt, agent lost before snapshot, download failure) so we re-run the whole config.
83+
restore_failed_specs() {
84+
local idx="$1"
85+
RETRY_SPEC_FILES=()
86+
87+
local retry_count="${BUILDKITE_RETRY_COUNT:-0}"
88+
[[ "$retry_count" -gt 0 ]] || return 1
89+
90+
local artifact_path
91+
artifact_path="$(failed_specs_artifact_path "$idx" "$(( retry_count - 1 ))")"
92+
93+
# --include-retried-jobs: the snapshot was uploaded by the now-superseded previous attempt, whose
94+
# artifacts buildkite-agent excludes by default.
95+
if ! download_artifact "$artifact_path" . --include-retried-jobs >/dev/null 2>&1; then
96+
return 1
97+
fi
98+
[[ -f "$artifact_path" ]] || return 1
99+
100+
mapfile -t RETRY_SPEC_FILES < "$artifact_path"
101+
[[ ${#RETRY_SPEC_FILES[@]} -gt 0 ]]
102+
}
103+
39104
# Read the comma-separated list of previously passed load indices from Buildkite metadata
40105
load_passed_indices() {
41106
PASSED_INDICES=$(buildkite-agent meta-data get "$PASSED_LOAD_INDICES_META_KEY" --default "" 2>/dev/null)
@@ -144,10 +209,19 @@ run_scout_tests() {
144209
local idx="$1"
145210
local config_path="$2"
146211

147-
echo "--- Running: $config_path"
212+
# On retry, re-run only the previously-failed spec files (passed as positional path filters).
213+
local spec_filter_args=()
214+
if restore_failed_specs "$idx"; then
215+
echo "--- Retrying ${#RETRY_SPEC_FILES[@]} failed spec file(s): $config_path"
216+
printf ' %s\n' "${RETRY_SPEC_FILES[@]}"
217+
spec_filter_args=("${RETRY_SPEC_FILES[@]}")
218+
else
219+
echo "--- Running: $config_path"
220+
fi
148221

149222
local pw_args=(
150223
test
224+
"${spec_filter_args[@]+"${spec_filter_args[@]}"}"
151225
"--config=$config_path"
152226
"--grep=$PLAYWRIGHT_GREP_TAG"
153227
"--project=$PLAYWRIGHT_PROJECT"
@@ -158,6 +232,8 @@ run_scout_tests() {
158232
"SCOUT_TARGET_ARCH=$SCOUT_TEST_TARGET_ARCH"
159233
"SCOUT_TARGET_DOMAIN=$SCOUT_TEST_TARGET_DOMAIN"
160234
"NODE_OPTIONS=${NODE_OPTIONS:-} --require=@kbn/babel-register/install"
235+
# Pin the JSON report to a path we control (see json_report_path).
236+
"PLAYWRIGHT_JSON_OUTPUT_FILE=$(json_report_path "$idx")"
161237
)
162238

163239
local start_time
@@ -184,6 +260,7 @@ run_scout_tests() {
184260
;;
185261
*)
186262
upload_report_events "$config_path"
263+
snapshot_failed_specs "$idx"
187264
FAILED+=("$config_path")
188265
echo "Exited with code $exit_code for $config_path"
189266
;;

0 commit comments

Comments
 (0)