Edge consistently times out when collecting from specific WPT segment #644
Description
When running tests in Edge via Sauce Labs, we currently divide WPT into 100 segments (referred to by the WPT infrastructure as "chunks"). On 2018-12-26, the collection from chunk 10 of 100 appeared to complete successfully, but the Buildbot master did not detect completion, and after 20 minutes, the job timed out. This has occurred for that chunk with regularity in the weeks that have passed since.
Example of logs (from 2019-01-28)
2019-01-28 14:05:26,345 INFO validate-wpt-results wpt-run:stdout Testing ed53b0478faeb87abd71cd9972d4e2b7a6103241 == a7c05b88682fd5c1adb58984bb0a9edf46daee95
2019-01-28 14:05:26,345 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Got signal terminated
2019-01-28 14:05:26,442 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Cleaning up.
2019-01-28 14:05:26,442 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Removing tunnel 9f7ba991b73046a593aab88fbfebe745.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Waiting for any active jobs using this tunnel to finish.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Press CTRL-C again to shut down immediately.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Note: if you do this, tests that are still running will fail.
2019-01-28 14:05:31,267 INFO validate-wpt-results wpt-run:stdout 108:10.77 INFO Closing logging queue
2019-01-28 14:05:31,270 INFO validate-wpt-results wpt-run:stdout 108:10.77 INFO queue closed
2019-01-28 14:05:31,577 INFO validate-wpt-results WPT CLI exited with return code 1
2019-01-28 14:05:31,632 INFO validate-wpt-results Expected 1229 results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 1229 results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 0 unexpected results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 0 missing results
command timed out: 1200 seconds without output running ['run-and-verify.py', '--max-attempts', '3', '--log-wptreport', '/tmp/tmpXvRNyX/report.json', '--log-raw', '/tmp/tmpXvRNyX/log-raw.txt', '--', '--log-mach', '-', '--this-chunk', '10', '--total-chunks', '100', '--sauce-platform', 'windows 10', '--sauce-user', 'wpt-shiek', '--sauce-key', '<sauce_labs_key_shiek>', '--sauce-tunnel-id', 'shiek', '--sauce-connect-binary', 'sc', '--sauce-connect-arg=--logfile=/var/log/sauce-connect/sc.log', '--sauce-init-timeout', '45', '--no-restart-on-unexpected', '--run-by-dir', '3', 'sauce:MicrosoftEdge:17'], attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1
remoteFailed: [Failure instance: Traceback from remote host -- exceptions.RuntimeError: SIGKILL failed to kill process
]
Between Buildbot, the "retry" wrapper script, the WPT CLI, and the Sauce Connect binary, the process tree is pretty deep for this code. My initial thought is that some thread is active and causing one of these processes to not report as complete, though the message "WPT CLI exited with return code 1" suggests this is happening at a pretty high level.