Undulating cluster auto-scaling

**What happened**: 

After nearing the end of a simple yet long running job with ~1200 workers at peak, the scheduler auto-scaler seems to become confused and start launching and destroying large numbers of workers.

What's concerning about this is that new workers joining the cluster seems to be expensive: the scheduler CPU spikes, and large numbers of concurrent workers connecting at the same time has caused us instability in the past. So this seems like a perfect storm: A long running, expensive job starts to end, then the scheduler essentially starts trying to DDOS itself by launching and destroying hundreds of workers every minute or so.

The lower number of workers is probably the correct number, given the number of tasks remaining:

<img width="1074" alt="Screenshot 2022-03-10 at 23 59 52" src="https://user-images.githubusercontent.com/1027207/157775597-970e5451-b7bc-4090-bfb6-578983635271.png">

But the scheduler will spin up another ~400 workers that sit idle, before tearing them down again:

https://user-images.githubusercontent.com/1027207/157775544-88f5dca5-3f97-4287-b9d1-712b48ef803a.mov

You can see this on Grafana, even though the numbers are slightly off and the resolution doesn't capture every peak and dip:

<img width="892" alt="image" src="https://user-images.githubusercontent.com/1027207/157776478-c5092a8d-e448-44d8-951c-e72929a87113.png">

The task graph isn't complex: It's reading from parquet, repartitioning into small chunks, doing `map_partitions`, then splitting the results again before writing to parquet:

```python
df = ddf.read_parquet("s3:/...")
df = df.repartition(4_462)
# Runtime for each partition: ~1 hour
df = df.map_partitions(expensive_function)
# Partitions produced are _big_, which can lead to OOM when writing parquet. So split it up again
df = df.repartition(df.npartitions * 3)
df.write_parquet("s3://...", compute=False).compute()
```

![image](https://user-images.githubusercontent.com/1027207/157775681-5c198c68-e3f6-48e3-86f5-cb13054483fd.png)

**Anything else we need to know?**:

I'm not able to run a dump state. I know this isn't helpful, but this is a very expensive and long running job and I can't take any chances that it may disrupt the job nearing completion. I _can_ supply raw Dask logs from the scheduler via email or another method. I also can't get a raw client connection to the scheduler without a fair bit of effort.

I can probably run this job once more to collect some specific debug information if you tell me exactly what is needed: for example, would dumping the cluster state at the end of the run be OK? Or does it need to be in the middle of the scheduler exhibiting this issue?

**Environment**:

- Dask version: 2022.2.1 
- Python version: 3.9
- Operating System: Linux
- Install method (conda, pip, source): pip


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Undulating cluster auto-scaling #5931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Undulating cluster auto-scaling #5931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions