Open
Description
@bbockelm reports possible issues observed by the Coffea team when scaling past ~50 workers with TLS, as well as issues where auto-scaled-down workers are killed while still holding useful results in memory. We should investigate both issues on our setup and see if we can reproduce them.
I'm also interested in testing overall stability during large/long calculations by manually killing workers and seeing if Dask can dynamically recover in a reasonable way (as it claims it can).