Skip to content

Clean viewer test#503

Open
YPandas wants to merge 138 commits intomasterfrom
clean_viewer_test
Open

Clean viewer test#503
YPandas wants to merge 138 commits intomasterfrom
clean_viewer_test

Conversation

@YPandas
Copy link
Copy Markdown
Contributor

@YPandas YPandas commented Apr 2, 2026

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Yuqi Huang and others added 30 commits January 6, 2026 17:06
- Add Puppeteer-based storage viewer canary test with frame detection
- Add CloudWatch metrics integration for test monitoring
- Add Jenkins pipeline support for storage viewer testing
- Add shell script for automated test environment setup
- Update orchestrator and runner for storage viewer scenarios
- adding Debug logs to see exactly what's happening with script and git HASH that's been used
- Extend master duration to 4x original (e.g., 40 minutes for 10-minute base)
- Add 3 join/leave cycles per viewer (3 minutes each, 1 minute gaps)
- Wait 15 minutes for master build on first iteration only
- Generate unique CLIENT_IDs per session for better tracking
- Reuse WebRTC build across multiple viewer sessions for efficiency
- Support both single and dual viewer scenarios with parallel execution
- Add JS_STORAGE_THREE_VIEWERS parameter and stage for 3-viewer testing
- Create reusable runViewerSessions() function to eliminate code duplication
- Organize functions: move runViewerSessions before buildStorageCanary
- Support parallel execution of 3 viewers with continuous master (4x duration)
- Add storageThreeViewers job configuration in orchestrator
- Maintain consistent timing: 10min wait for multi-viewer, 21min for single
- Replace retry logic with unique workspace isolation using ws()
- Each viewer gets isolated workspace: ${JOB_NAME}-${viewerId}-${BUILD_NUMBER}
- Prevents concurrent Git operations on same .git directory
- Eliminates pack file corruption and race conditions between parallel viewers
- Simplifies checkout logic by removing retry complexity
Add pre-cleanup stage to remove old workspaces and temp files at pipeline start

Add try-finally blocks in runViewerSessions to ensure workspace cleanup on completion/failure

Clean webrtc workspaces older than 1 hour, Jenkins temp files, and Git pack files

Display disk usage before/after cleanup for monitoring

Prevents /tmp disk space exhaustion from accumulated workspace directories
- Set fixed 23-minute duration (1380s) for continuous storage master
- Apply to all multi-viewer scenarios (single, two, three viewers)
- Add npm verification in setup-storage-viewer.sh to catch installation failures
- Update chrome-headless.js success criteria to only require storage session join
- Add comprehensive cache cleanup in Jenkins pipeline post section
- Fix Jenkins workspace cleanup paths to target correct directories
Remove conflicting Ubuntu Node.js packages (libnode72, nodejs) before
installing NodeSource Node.js 18 to prevent dpkg file overwrite errors.
Add npm verification to catch installation failures early.

Fixes "npm: command not found" errors caused by silent installation failures.
- Add JobName and RunnerLabel dimensions to CloudWatch metrics for runner isolation
- Start monitoring timer immediately after storage session join instead of waiting for ambiguous storageSessionActive condition
- Prevent false test failures when storage session successfully joins but times out waiting for additional conditions
- Track time between receiving SDP offer and sending SDP answer
- Capture timestamps from console logs for offer/answer events
- Publish OfferReceivedToAnswerSentTime metric to CloudWatch
- Helps monitor WebRTC negotiation performance and identify delays
…acking

- Add 8 granular timing metrics to track WebRTC connection stages:
  * OfferReceivedToAnswerSentTime - SDP processing time
  * AnswerSentToFirstIceReceivedTime - ICE exchange initiation
  * FirstIceReceivedToFirstIceSentTime - ICE turnaround time
  * FirstIceSentToAllIceGeneratedTime - ICE gathering completion
  * AllIceGeneratedToConnectionEstablishedTime - Connection finalization
  * TotalConnectionEstablishmentTime - End-to-end connection time
  * JoinSSAsViewerTime - Storage session join time
  * TimeToFirstFrame - Video streaming readiness

- Add WebRTC connection retry tracking:
  * Track internal viewer connection retries via console log detection
  * WebRTCConnectionRetryCount metric for actual retry analysis
  * Remove script-level retry logic for cleaner failure handling

- Add StorageSessionSuccessRate percentage metric for ultimate success tracking

- Enhance CloudWatch integration:
  * Add publishPercentageMetric() for percentage-based metrics
  * Add publishCountMetric() for count-based metrics
  * Maintain consistent dimension structure across all metrics

- Improve observability for WebRTC connection bottleneck identification
  and reliability analysis in storage session viewer scenarios

This enables precise identification of connection stage bottlenecks and
provides visibility into both performance and reliability patterns.
- Problem: StoragePeriodic jobs stopped rescheduling after adding pre/post cleanup stages
- Root cause: Aggressive cleanup was interfering with all jobs, including simple storage jobs that don't need it
- Solution: Only apply cleanup to JS storage jobs (JS_STORAGE_VIEWER_*) that cause disk space issues
- Impact: StoragePeriodic and other standard jobs preserve job state for proper rescheduling
- JS storage jobs still get necessary cleanup to prevent disk space exhaustion
…ime for viewer to increase the time interval between each tests.
- enable endpoint configuration
Add support for custom endpoint and metric suffix parameters to enable
testing against gamma environments with distinguishable CloudWatch metrics.

Changes:
- chrome-headless.js: Add metricSuffix config option and getMetricName()
  helper to append suffix to all metric names (e.g., JoinSSAsViewerTime_Viewer1-gamma)
- setup-storage-viewer.sh: Pass ENDPOINT and METRIC_SUFFIX env vars to test
- runner.groovy: Add ENDPOINT and METRIC_SUFFIX parameters, pass to
  runViewerSessions() and Reschedule stage for persistence across retries
- js-storage-gamma-tests.groovy: Pass METRIC_SUFFIX="-gamma" to all test builds
Add CONTROL_PLANE_URI and AWS_DEFAULT_REGION to masterEnvs in
buildStorageCanary() so the C storage master can connect to custom
endpoints (e.g., gamma) when ENDPOINT parameter is provided.
- Fix endpoint parameter in chrome-headless.js to always be included
  in URLSearchParams
- Add VIEWER_WAIT_MINUTES parameter to runner.groovy to make the viewer
  wait time for master build configurable instead of hardcoded to 55
- Update orchestrator.groovy to pass VIEWER_WAIT_MINUTES=55 for
  production JS storage viewer tests
- Update js-storage-gamma-tests.groovy to pass VIEWER_WAIT_MINUTES=25
  for gamma tests
- Propagate VIEWER_WAIT_MINUTES through Reschedule stage for continuity
Yuqi Huang added 30 commits April 20, 2026 15:51
- Reschedule was a regular stage — if any earlier stage threw an
  uncaught exception, the pipeline skipped it and the canary stopped
- Now in post{always}, it fires regardless of build result
- Stop deleting consumer GetClip MP4 after verification; log full path
- Use timestamped filenames (clip-{timestamp}.mp4) to prevent overwrites
- Increase NUM_LOGS from 1 to 100 in seed.groovy so build logs aren't
  immediately deleted by the next scenario sharing the same runner
…s, 156s)

C code:
- Samples.h, Include.h: NUMBER_OF_H264_FRAME_FILES 1500→4676, DEFAULT_FPS_VALUE 25→30
- kvsWebRTCClientMaster.cpp: encoderStats resolution 640x480→1280x720

verify.py:
- FPS 25→30, TOTAL_SOURCE_FRAMES 4500→4676, EXPECTED_DURATION 180→155.87s
- TIMER_CROP updated to (25, 20, 145, 90) matching new video's sync box position

Durations (all set to 156s = 2 min 36 sec):
- runner.groovy: continuous master hardcoded duration, VIEWER_SESSION_DURATION_SECONDS
  default and fallback
- gamma_runner.groovy: same — master duration, DURATION_IN_SECONDS,
  VIEWER_SESSION_DURATION_SECONDS default
- orchestrator.groovy: STORAGE_PERIODIC_DURATION_IN_SECONDS,
  STORAGE_WITH_VIEWER_DURATION_IN_SECONDS
- chrome-headless.js: monitorConnection duration 180000→156000ms

Frame files:
- Replaced 4500 old frames with 4676 new frames from videoaudiosamplemedia.mp4
- Renamed from 0-indexed (gstreamer output) to 1-indexed (C code expects frame-0001)
- so that consumer has time to do video verification
- Previous gstreamer extraction produced frames without byte-stream
  format, causing STATUS_SRTP_INVALID_NALU (0x5c000003) errors in the
  C SDK's writeFrame()
- Re-extracted with 'video/x-h264,stream-format=byte-stream' caps to
  ensure Annex B start codes (00 00 00 01) in each frame
- Renamed 0-indexed (frame-0000) to 1-indexed (frame-0001) to match
  C code's fileIndex % N + 1 pattern — frame-0001 now contains the
  SPS/PPS needed for decoder initialization
- lack of return statement
- 'gConnectingStartTime' was not declared in the correct scope
- Extract ALL frames from reference video (not just 1fps) so each frame
  is directly addressable by index
- OCR each clip frame to read the sync box frame counter, then compare
  against the reference frame at that exact index via SSIM
- No more offset calculation or second-based alignment — the frame
  number IS the index
- OCR improved: upscale 4x, threshold, auto-crop to white box, digit
  whitelist
- Thresholds now match VideoVerificationComponent.java:
  duration >= 120s, max SSIM > 0.99, avg SSIM > 0.85, min SSIM > 0.03
- Reorder cleanup: stop recording → stop viewer → close browser → run
  verify.py. Previously verify.py ran while browser was still open,
  causing stale disconnects/reconnects during the 5+ minute verification
  that polluted UnexpectedDisconnectCount and ViewerReconnectCount
- Increase verify.py timeout from 300s to 600s — extracting all 4676
  reference frames + per-frame OCR can exceed 5 minutes on slow nodes
…hreshold

- Instead of extracting all 4676 reference PNGs, OCR clip frames first
  to get frame numbers, then extract only those ~120 reference frames
  via a single ffmpeg select filter — ~97% fewer frames to extract
- Fix dropped frame threshold: check total frame count in the received
  clip (via ffprobe) against 3176, not the number of SSIM comparisons
- Close browser before running verify.py to prevent stale WebRTC
  disconnects/reconnects during the long verification process
- Increase verify.py timeout from 300s to 600s
- Viewer: ViewerSSIMAvg, ViewerSSIMMin, ViewerSSIMMax (Percent, 0-100)
- Consumer: ConsumerSSIMAvg, ConsumerSSIMMin, ConsumerSSIMMax (raw 0-1)
- Cleanup timeout was firing at 146s (duration-10), killing the viewer
  before the 156s monitoring completed — causing getTestResults() to
  never run and VIEWER_STATS to never be printed
- VIEWER_SESSION_DURATION_SECONDS is only used as a hard timeout for
  the viewer process, not the actual monitoring duration (hardcoded 156s)
- Set to 900s (15 min) in both prod and gamma to give ample room for
  monitoring (156s) + browser cleanup + verify.py (~2 min)
- verify.py was exiting with code 1 when availability=0, causing
  execSync/sh to throw an exception — the catch block logged "Video
  verification failed" and skipped the metric push entirely
- Now always exits 0 with --json; availability value is in the JSON
  for the caller to read and publish regardless of pass/fail
… v20

Gamma runner:
- Add buildConsumerProject() and refactor buildStorageCanary to accept
  isConsumer parameter with full consumer path (Java GetClip, video
  verification, ConsumerStorageAvailability metric)
- Add IS_STORAGE, IS_STORAGE_SINGLE_NODE, VIDEO_VERIFY_ENABLED,
  CONSUMER_NODE_LABEL parameters
- Add "Build and Run Storage Master and Consumer" parallel stage

Gamma orchestrator:
- Add RUN_STORAGE_PERIODIC (156s), RUN_STORAGE_SUB_RECONNECT (2700s),
  RUN_STORAGE_SINGLE_RECONNECT (3900s) parameters and parallel stages

Gamma seed:
- Add the three new boolean parameters

Node.js:
- Upgrade from v18 to v20 in setup-storage-viewer.sh
- Add version check: auto-upgrade existing v18 installations to v20
  (some npm packages now require Node >= 20)
…angs

- C binary can hang indefinitely when JoinStorageSession fails but ICE
  agent threads keep running after main() exits (observed 4+ hour hang)
- Wrap sh step with timeout(DURATION_IN_SECONDS + 15 min) in both
  runner.groovy and gamma_runner.groovy
- withRunnerWrapper catches the timeout exception and marks unstable,
  allowing rescheduling to continue
- Reschedule was a regular stage — if earlier stages failed, it was
  skipped and the canary stopped running
- Now in post{always}: reschedules on success or failure, skips only
  on manual abort or when RESCHEDULE=false
- Matches prod runner behavior
- Consumer was rejecting GammaStoragePeriodic/SubReconnect/SingleReconnect
  labels with "Improper canary label" exception
- Add GAMMA_PERIODIC_LABEL, GAMMA_SUB_RECONNECT_LABEL,
  GAMMA_SINGLE_RECONNECT_LABEL constants to CanaryConstants.java
- GammaStoragePeriodic falls into the periodic switch case
- Gamma reconnect labels pass the default case validation
- Add JS_PAGE_URL parameter to gamma runner, orchestrator, and seed
- chrome-headless.js uses JS_PAGE_URL env var if set, falls back to
  default GitHub Pages URL
- Pass through setup-storage-viewer.sh and reschedule parameters
- Enables testing JS SDK branches (e.g., ICE-offer-order-fix) against
  gamma without modifying code
- Log region, canary label, and run time at consumer startup alongside
  the existing stream name log
- Helps diagnose ResourceNotFoundException by confirming which region
  the consumer is targeting
- Consumer was using standard KVS endpoint, causing ResourceNotFoundException
  when running against gamma (stream exists in gamma control plane only)
- Read CONTROL_PLANE_URI env var; if set, use withEndpointConfiguration
  instead of withRegion for the KVS client
- Pass CONTROL_PLANE_URI from ENDPOINT param to consumerEnvs in both
  runner.groovy and gamma_runner.groovy
- Log control plane URI at startup for debugging
- raw.githack.com and jsdelivr CDNs don't work (interstitial page or
  broken relative paths)
- If JS_PAGE_URL is a branch name (no ://), setup-storage-viewer.sh
  clones the repo at that branch to /tmp and rewrites the URL to
  file:// path
- Add --allow-file-access-from-files to Puppeteer launch args for
  file:// origin API calls
- Usage: set JS_PAGE_URL=ICE-offer-order-fix in gamma orchestrator
- Prod unchanged: no JS_PAGE_URL set, uses default GitHub Pages URL
- Run `npm install` after cloning the JS SDK branch to install dependencies
- Start `npm run develop` dev server on port 3001 instead of using file:// protocol
- Add readiness check loop (up to 30s) to wait for dev server startup
- Auto-detect and pull remote changes when the local clone is stale
- Replace static "already cloned" skip with local/remote HEAD comparison
…iewer

- Kill the JS SDK dev server after chrome-headless.js finishes to prevent orphaned processes on CI runners
- Dynamically allocate a free port instead of hardcoding 3001 to avoid collisions with other processes
- Pass the selected port to webpack dev server via --port flag
…ver readiness

- Split setup-storage-viewer.sh into prepare-storage-viewer.sh (installs deps,
  clones JS SDK) and run-storage-viewer.sh (starts dev server, runs test)
- Viewer nodes now install dependencies in parallel with master build instead
  of waiting until after master is ready
- Keep setup-storage-viewer.sh as a thin wrapper for backward compatibility
- Replace TCP/HTTP port polling with log-based readiness check — wait for
  webpack's "Server started" message instead of guessing from connection status
- Detect and report dev server crashes during startup with full log output
- Increase readiness timeout to 120s to handle slow webpack compiles on CI
- Update runner.groovy and gamma_runner.groovy to call prepare then run
- Rename JS_PAGE_URL parameter to JS_BRANCH (default: master) in gamma jobs
- Change grep from "Server started" to "compiled successfully"
- "Server started" was a browser console message, not webpack stdout
- "compiled successfully" is the actual webpack output confirming the
  server is ready to serve pages
- Cache compiled WebRTC binaries to /tmp/kvs-webrtc-build-cache/ keyed
  by git commit hash and TLS backend (openssl/mbedtls)
- On subsequent runs, skip cmake+make if the cached hash matches HEAD
  and restore binaries from cache instead
- Cert setup always runs regardless of cache state
- Applies to all jobs using buildWebRTCProject in both runner.groovy
  and gamma_runner.groovy
- Add explicit --region to aws cloudwatch put-metric-data calls for
  ConsumerStorageAvailability and ConsumerSSIM metrics so they land
  in the correct region instead of the Jenkins agent's default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants