Skip to content

Diagnostics page for Adaptive decisions #3647

Open
@TomAugspurger

Description

@TomAugspurger

It can be somewhat hard to determine when / why the scheduler decides to scale the cluster under adaptive mode. Ideally a dashboard page could shed some light here.

We currently have /json/counts.json which provides desired_workers. I think that's it.

I think there are two main pieces of information to convey:

  1. Stock: The current state of things including current CPU load, current CPU capacity, and the current desired CPU capacity. Likewise for memory
  2. Flow: The history of decisions on when to scale up / down the cluster (ideally with information on why those decisions were made (the state at that time)

Here's a rough sketch for number 1.

Adaptive sketch

cc @rsignell-usgs, @jsignell for adaptive things, and @jacobtomlinson for dashboard design things.

Metadata

Metadata

Assignees

No one assigned

    Labels

    adaptiveAll things relating to adaptive scaling

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions