Clean viewer test by YPandas · Pull Request #503 · aws-samples/amazon-kinesis-video-streams-demos

YPandas · 2026-04-02T21:48:04Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add Puppeteer-based storage viewer canary test with frame detection - Add CloudWatch metrics integration for test monitoring - Add Jenkins pipeline support for storage viewer testing - Add shell script for automated test environment setup - Update orchestrator and runner for storage viewer scenarios - adding Debug logs to see exactly what's happening with script and git HASH that's been used

- Extend master duration to 4x original (e.g., 40 minutes for 10-minute base) - Add 3 join/leave cycles per viewer (3 minutes each, 1 minute gaps) - Wait 15 minutes for master build on first iteration only - Generate unique CLIENT_IDs per session for better tracking - Reuse WebRTC build across multiple viewer sessions for efficiency - Support both single and dual viewer scenarios with parallel execution

- Add JS_STORAGE_THREE_VIEWERS parameter and stage for 3-viewer testing - Create reusable runViewerSessions() function to eliminate code duplication - Organize functions: move runViewerSessions before buildStorageCanary - Support parallel execution of 3 viewers with continuous master (4x duration) - Add storageThreeViewers job configuration in orchestrator - Maintain consistent timing: 10min wait for multi-viewer, 21min for single

- Replace retry logic with unique workspace isolation using ws() - Each viewer gets isolated workspace: ${JOB_NAME}-${viewerId}-${BUILD_NUMBER} - Prevents concurrent Git operations on same .git directory - Eliminates pack file corruption and race conditions between parallel viewers - Simplifies checkout logic by removing retry complexity

Add pre-cleanup stage to remove old workspaces and temp files at pipeline start Add try-finally blocks in runViewerSessions to ensure workspace cleanup on completion/failure Clean webrtc workspaces older than 1 hour, Jenkins temp files, and Git pack files Display disk usage before/after cleanup for monitoring Prevents /tmp disk space exhaustion from accumulated workspace directories

- Set fixed 23-minute duration (1380s) for continuous storage master - Apply to all multi-viewer scenarios (single, two, three viewers) - Add npm verification in setup-storage-viewer.sh to catch installation failures - Update chrome-headless.js success criteria to only require storage session join - Add comprehensive cache cleanup in Jenkins pipeline post section - Fix Jenkins workspace cleanup paths to target correct directories

Remove conflicting Ubuntu Node.js packages (libnode72, nodejs) before installing NodeSource Node.js 18 to prevent dpkg file overwrite errors. Add npm verification to catch installation failures early. Fixes "npm: command not found" errors caused by silent installation failures.

- Add JobName and RunnerLabel dimensions to CloudWatch metrics for runner isolation - Start monitoring timer immediately after storage session join instead of waiting for ambiguous storageSessionActive condition - Prevent false test failures when storage session successfully joins but times out waiting for additional conditions

- Track time between receiving SDP offer and sending SDP answer - Capture timestamps from console logs for offer/answer events - Publish OfferReceivedToAnswerSentTime metric to CloudWatch - Helps monitor WebRTC negotiation performance and identify delays

…acking - Add 8 granular timing metrics to track WebRTC connection stages: * OfferReceivedToAnswerSentTime - SDP processing time * AnswerSentToFirstIceReceivedTime - ICE exchange initiation * FirstIceReceivedToFirstIceSentTime - ICE turnaround time * FirstIceSentToAllIceGeneratedTime - ICE gathering completion * AllIceGeneratedToConnectionEstablishedTime - Connection finalization * TotalConnectionEstablishmentTime - End-to-end connection time * JoinSSAsViewerTime - Storage session join time * TimeToFirstFrame - Video streaming readiness - Add WebRTC connection retry tracking: * Track internal viewer connection retries via console log detection * WebRTCConnectionRetryCount metric for actual retry analysis * Remove script-level retry logic for cleaner failure handling - Add StorageSessionSuccessRate percentage metric for ultimate success tracking - Enhance CloudWatch integration: * Add publishPercentageMetric() for percentage-based metrics * Add publishCountMetric() for count-based metrics * Maintain consistent dimension structure across all metrics - Improve observability for WebRTC connection bottleneck identification and reliability analysis in storage session viewer scenarios This enables precise identification of connection stage bottlenecks and provides visibility into both performance and reliability patterns.

- Problem: StoragePeriodic jobs stopped rescheduling after adding pre/post cleanup stages - Root cause: Aggressive cleanup was interfering with all jobs, including simple storage jobs that don't need it - Solution: Only apply cleanup to JS storage jobs (JS_STORAGE_VIEWER_*) that cause disk space issues - Impact: StoragePeriodic and other standard jobs preserve job state for proper rescheduling - JS storage jobs still get necessary cleanup to prevent disk space exhaustion

…ime for viewer to increase the time interval between each tests.

- enable endpoint configuration

Add support for custom endpoint and metric suffix parameters to enable testing against gamma environments with distinguishable CloudWatch metrics. Changes: - chrome-headless.js: Add metricSuffix config option and getMetricName() helper to append suffix to all metric names (e.g., JoinSSAsViewerTime_Viewer1-gamma) - setup-storage-viewer.sh: Pass ENDPOINT and METRIC_SUFFIX env vars to test - runner.groovy: Add ENDPOINT and METRIC_SUFFIX parameters, pass to runViewerSessions() and Reschedule stage for persistence across retries - js-storage-gamma-tests.groovy: Pass METRIC_SUFFIX="-gamma" to all test builds

Add CONTROL_PLANE_URI and AWS_DEFAULT_REGION to masterEnvs in buildStorageCanary() so the C storage master can connect to custom endpoints (e.g., gamma) when ENDPOINT parameter is provided.

- Fix endpoint parameter in chrome-headless.js to always be included in URLSearchParams - Add VIEWER_WAIT_MINUTES parameter to runner.groovy to make the viewer wait time for master build configurable instead of hardcoded to 55 - Update orchestrator.groovy to pass VIEWER_WAIT_MINUTES=55 for production JS storage viewer tests - Update js-storage-gamma-tests.groovy to pass VIEWER_WAIT_MINUTES=25 for gamma tests - Propagate VIEWER_WAIT_MINUTES through Reschedule stage for continuity

- Reschedule was a regular stage — if any earlier stage threw an uncaught exception, the pipeline skipped it and the canary stopped - Now in post{always}, it fires regardless of build result

- Stop deleting consumer GetClip MP4 after verification; log full path - Use timestamped filenames (clip-{timestamp}.mp4) to prevent overwrites - Increase NUM_LOGS from 1 to 100 in seed.groovy so build logs aren't immediately deleted by the next scenario sharing the same runner

…s, 156s) C code: - Samples.h, Include.h: NUMBER_OF_H264_FRAME_FILES 1500→4676, DEFAULT_FPS_VALUE 25→30 - kvsWebRTCClientMaster.cpp: encoderStats resolution 640x480→1280x720 verify.py: - FPS 25→30, TOTAL_SOURCE_FRAMES 4500→4676, EXPECTED_DURATION 180→155.87s - TIMER_CROP updated to (25, 20, 145, 90) matching new video's sync box position Durations (all set to 156s = 2 min 36 sec): - runner.groovy: continuous master hardcoded duration, VIEWER_SESSION_DURATION_SECONDS default and fallback - gamma_runner.groovy: same — master duration, DURATION_IN_SECONDS, VIEWER_SESSION_DURATION_SECONDS default - orchestrator.groovy: STORAGE_PERIODIC_DURATION_IN_SECONDS, STORAGE_WITH_VIEWER_DURATION_IN_SECONDS - chrome-headless.js: monitorConnection duration 180000→156000ms Frame files: - Replaced 4500 old frames with 4676 new frames from videoaudiosamplemedia.mp4 - Renamed from 0-indexed (gstreamer output) to 1-indexed (C code expects frame-0001)

- so that consumer has time to do video verification

… killed early

- Previous gstreamer extraction produced frames without byte-stream format, causing STATUS_SRTP_INVALID_NALU (0x5c000003) errors in the C SDK's writeFrame() - Re-extracted with 'video/x-h264,stream-format=byte-stream' caps to ensure Annex B start codes (00 00 00 01) in each frame - Renamed 0-indexed (frame-0000) to 1-indexed (frame-0001) to match C code's fileIndex % N + 1 pattern — frame-0001 now contains the SPS/PPS needed for decoder initialization

- lack of return statement

- 'gConnectingStartTime' was not declared in the correct scope

- Extract ALL frames from reference video (not just 1fps) so each frame is directly addressable by index - OCR each clip frame to read the sync box frame counter, then compare against the reference frame at that exact index via SSIM - No more offset calculation or second-based alignment — the frame number IS the index - OCR improved: upscale 4x, threshold, auto-crop to white box, digit whitelist - Thresholds now match VideoVerificationComponent.java: duration >= 120s, max SSIM > 0.99, avg SSIM > 0.85, min SSIM > 0.03

- Reorder cleanup: stop recording → stop viewer → close browser → run verify.py. Previously verify.py ran while browser was still open, causing stale disconnects/reconnects during the 5+ minute verification that polluted UnexpectedDisconnectCount and ViewerReconnectCount - Increase verify.py timeout from 300s to 600s — extracting all 4676 reference frames + per-frame OCR can exceed 5 minutes on slow nodes

…hreshold - Instead of extracting all 4676 reference PNGs, OCR clip frames first to get frame numbers, then extract only those ~120 reference frames via a single ffmpeg select filter — ~97% fewer frames to extract - Fix dropped frame threshold: check total frame count in the received clip (via ffprobe) against 3176, not the number of SSIM comparisons - Close browser before running verify.py to prevent stale WebRTC disconnects/reconnects during the long verification process - Increase verify.py timeout from 300s to 600s

- Viewer: ViewerSSIMAvg, ViewerSSIMMin, ViewerSSIMMax (Percent, 0-100) - Consumer: ConsumerSSIMAvg, ConsumerSSIMMin, ConsumerSSIMMax (raw 0-1)

- Cleanup timeout was firing at 146s (duration-10), killing the viewer before the 156s monitoring completed — causing getTestResults() to never run and VIEWER_STATS to never be printed - VIEWER_SESSION_DURATION_SECONDS is only used as a hard timeout for the viewer process, not the actual monitoring duration (hardcoded 156s) - Set to 900s (15 min) in both prod and gamma to give ample room for monitoring (156s) + browser cleanup + verify.py (~2 min)

- verify.py was exiting with code 1 when availability=0, causing execSync/sh to throw an exception — the catch block logged "Video verification failed" and skipped the metric push entirely - Now always exits 0 with --json; availability value is in the JSON for the caller to read and publish regardless of pass/fail

… v20 Gamma runner: - Add buildConsumerProject() and refactor buildStorageCanary to accept isConsumer parameter with full consumer path (Java GetClip, video verification, ConsumerStorageAvailability metric) - Add IS_STORAGE, IS_STORAGE_SINGLE_NODE, VIDEO_VERIFY_ENABLED, CONSUMER_NODE_LABEL parameters - Add "Build and Run Storage Master and Consumer" parallel stage Gamma orchestrator: - Add RUN_STORAGE_PERIODIC (156s), RUN_STORAGE_SUB_RECONNECT (2700s), RUN_STORAGE_SINGLE_RECONNECT (3900s) parameters and parallel stages Gamma seed: - Add the three new boolean parameters Node.js: - Upgrade from v18 to v20 in setup-storage-viewer.sh - Add version check: auto-upgrade existing v18 installations to v20 (some npm packages now require Node >= 20)

…angs - C binary can hang indefinitely when JoinStorageSession fails but ICE agent threads keep running after main() exits (observed 4+ hour hang) - Wrap sh step with timeout(DURATION_IN_SECONDS + 15 min) in both runner.groovy and gamma_runner.groovy - withRunnerWrapper catches the timeout exception and marks unstable, allowing rescheduling to continue

- Reschedule was a regular stage — if earlier stages failed, it was skipped and the canary stopped running - Now in post{always}: reschedules on success or failure, skips only on manual abort or when RESCHEDULE=false - Matches prod runner behavior

- Consumer was rejecting GammaStoragePeriodic/SubReconnect/SingleReconnect labels with "Improper canary label" exception - Add GAMMA_PERIODIC_LABEL, GAMMA_SUB_RECONNECT_LABEL, GAMMA_SINGLE_RECONNECT_LABEL constants to CanaryConstants.java - GammaStoragePeriodic falls into the periodic switch case - Gamma reconnect labels pass the default case validation

- Add JS_PAGE_URL parameter to gamma runner, orchestrator, and seed - chrome-headless.js uses JS_PAGE_URL env var if set, falls back to default GitHub Pages URL - Pass through setup-storage-viewer.sh and reschedule parameters - Enables testing JS SDK branches (e.g., ICE-offer-order-fix) against gamma without modifying code

- Log region, canary label, and run time at consumer startup alongside the existing stream name log - Helps diagnose ResourceNotFoundException by confirming which region the consumer is targeting

- Consumer was using standard KVS endpoint, causing ResourceNotFoundException when running against gamma (stream exists in gamma control plane only) - Read CONTROL_PLANE_URI env var; if set, use withEndpointConfiguration instead of withRegion for the KVS client - Pass CONTROL_PLANE_URI from ENDPOINT param to consumerEnvs in both runner.groovy and gamma_runner.groovy - Log control plane URI at startup for debugging

- raw.githack.com and jsdelivr CDNs don't work (interstitial page or broken relative paths) - If JS_PAGE_URL is a branch name (no ://), setup-storage-viewer.sh clones the repo at that branch to /tmp and rewrites the URL to file:// path - Add --allow-file-access-from-files to Puppeteer launch args for file:// origin API calls - Usage: set JS_PAGE_URL=ICE-offer-order-fix in gamma orchestrator - Prod unchanged: no JS_PAGE_URL set, uses default GitHub Pages URL

- Run `npm install` after cloning the JS SDK branch to install dependencies - Start `npm run develop` dev server on port 3001 instead of using file:// protocol - Add readiness check loop (up to 30s) to wait for dev server startup - Auto-detect and pull remote changes when the local clone is stale - Replace static "already cloned" skip with local/remote HEAD comparison

…iewer - Kill the JS SDK dev server after chrome-headless.js finishes to prevent orphaned processes on CI runners - Dynamically allocate a free port instead of hardcoding 3001 to avoid collisions with other processes - Pass the selected port to webpack dev server via --port flag

…ver readiness - Split setup-storage-viewer.sh into prepare-storage-viewer.sh (installs deps, clones JS SDK) and run-storage-viewer.sh (starts dev server, runs test) - Viewer nodes now install dependencies in parallel with master build instead of waiting until after master is ready - Keep setup-storage-viewer.sh as a thin wrapper for backward compatibility - Replace TCP/HTTP port polling with log-based readiness check — wait for webpack's "Server started" message instead of guessing from connection status - Detect and report dev server crashes during startup with full log output - Increase readiness timeout to 120s to handle slow webpack compiles on CI - Update runner.groovy and gamma_runner.groovy to call prepare then run - Rename JS_PAGE_URL parameter to JS_BRANCH (default: master) in gamma jobs

- Change grep from "Server started" to "compiled successfully" - "Server started" was a browser console message, not webpack stdout - "compiled successfully" is the actual webpack output confirming the server is ready to serve pages

- Cache compiled WebRTC binaries to /tmp/kvs-webrtc-build-cache/ keyed by git commit hash and TLS backend (openssl/mbedtls) - On subsequent runs, skip cmake+make if the cached hash matches HEAD and restore binaries from cache instead - Cert setup always runs regardless of cache state - Applies to all jobs using buildWebRTCProject in both runner.groovy and gamma_runner.groovy - Add explicit --region to aws cloudwatch put-metric-data calls for ConsumerStorageAvailability and ConsumerSSIM metrics so they land in the correct region instead of the Jenkins agent's default

Yuqi Huang and others added 30 commits January 6, 2026 17:06

separate master's job to avoid restarting it between jobs

93479ca

make sure different viewers have different metric name

3d4fc18

make sure the master run in parallel

b48d55c

add checkout to make sure all the nodes are updated before running

844722c

make sure jobs are rescheduled correctly

3349fbd

update useTrickleICE to true to improve latency

7f0cb13

change frequency of viewer test to 1 hour per run.

84aebbf

increase the waiting time for viewers.

be2dfa3

remove un-used methods in chrome script and truly increase the wait t…

742e35e

…ime for viewer to increase the time interval between each tests.

removing post clean up to investigate canary issue root cause

92bb50a

temporary change to verify startup latency issue

60445dc

update cleaning method to avoid disk storage issue

9bedcf7

use v1.15.0 for webrtc c

1e00291

fixed webRTC C sdk tag

2f5a5f4

adding parameter to chrome-headless canary

08aec3c

- enable endpoint configuration

Pass CONTROL_PLANE_URI to C storage master for gamma testing

373f760

Add CONTROL_PLANE_URI and AWS_DEFAULT_REGION to masterEnvs in buildStorageCanary() so the C storage master can connect to custom endpoints (e.g., gamma) when ENDPOINT parameter is provided.

adding debug log to make sure endpoint is being passed correctly

7909ad8

Yuqi Huang added 30 commits April 20, 2026 15:51

Move Reschedule to post{always} so it runs even if stages fail

d0ac8d5

- Reschedule was a regular stage — if any earlier stage threw an uncaught exception, the pipeline skipped it and the canary stopped - Now in post{always}, it fires regardless of build result

remove fail fast for storageperiodic

0337eba

- so that consumer has time to do video verification

Move CANARY_DURATION_IN_SECONDS to master-only envs so consumer isn't…

4ba4562

… killed early

adding failfast back for better testing in StoragePeriodic

cf83206

reverting peerconnection availability logic in cmaster

894681a

fix bug in cmaster

af17fa0

- lack of return statement

bug fix in cmaster

642696a

- 'gConnectingStartTime' was not declared in the correct scope

Add per-run SSIM avg/min/max metrics for both viewer and consumer

1871482

- Viewer: ViewerSSIMAvg, ViewerSSIMMin, ViewerSSIMMax (Percent, 0-100) - Consumer: ConsumerSSIMAvg, ConsumerSSIMMin, ConsumerSSIMMax (raw 0-1)

Add startup logging to Java consumer for debugging region/stream issues

852e0bd

- Log region, canary label, and run time at consumer startup alongside the existing stream name log - Helps diagnose ResourceNotFoundException by confirming which region the consumer is targeting

Add missing JS_PAGE_URL parameter to gamma seed

3260751

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean viewer test#503

Clean viewer test#503
YPandas wants to merge 138 commits intomasterfrom
clean_viewer_test

YPandas commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YPandas commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants