[EPIC] Improve Cluster Observability

Two major directions (as of now): 

## Collect Executor Statistics

As starting point we could start by collecting and aggregating executor statistic, something similar to [spark UI executor tab](https://spark.apache.org/docs/3.5.7/web-ui.html#executors-tab)

![](https://spark.apache.org/docs/3.5.7/img/webui-exe-tab.png)

some statistics, such as memory utilisation, disk usage, would be collected on the executor, other like shuffle read and write may be collected on scheduler side. 

We would need to expose additional rest interface to expose collected metrics 

## Per Stage Flame Graph

Similar to [Nvidia RAPIDS Per Stage Flame Graph](https://nvidia.github.io/spark-rapids/docs/additional-functionality/per-stage-flamegraph.html) collect stats and produce flame graphs. 

We would need to further investigate what should be done 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve Cluster Observability #1426

Collect Executor Statistics

Per Stage Flame Graph

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Improve Cluster Observability #1426

Description

Collect Executor Statistics

Per Stage Flame Graph

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions