Skip to content

Add comprehensive metrics instrumentation for scheduler and executor#10

Merged
lukekim merged 7 commits into
spiceai-51from
spiceai-51-metrics
Jan 23, 2026
Merged

Add comprehensive metrics instrumentation for scheduler and executor#10
lukekim merged 7 commits into
spiceai-51from
spiceai-51-metrics

Conversation

@phillipleblanc
Copy link
Copy Markdown

@phillipleblanc phillipleblanc commented Jan 22, 2026

Rationale for this change

This PR adds comprehensive metrics instrumentation to track scheduler and executor performance, enabling better observability into task scheduling, shuffle operations, and resource utilization.

What changes are included in this PR?

This PR includes 6 commits that progressively build out the metrics infrastructure:

1. Add shuffle read metrics extraction and QueryStageExecutor::plan() method

  • Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes)
  • Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan
  • Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats
  • Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor

2. Add shuffle locality metrics to ExecutorMetricsCollector, SchedulerMetricsCollector, and ShuffleReaderExec

  • Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait
  • Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait
  • Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads
  • Instrument shuffle_reader.rs to call metrics callback during partition fetches
  • Add SessionConfigExt methods to pass metrics callback via session config

3. Add metrics collector to SchedulerState and instrument executor and planning metrics

  • Add metrics_collector field to SchedulerState struct
  • Instrument record_planning_duration in submit_job
  • Instrument record_executor_registered/deregistered and set_active_executor_count
  • Update all SchedulerState constructors and call sites

4. Add stage and task lifecycle metrics instrumentation to update_task_status flow

  • Instrument stage and task lifecycle metrics in the task status update flow

5. Add shuffle affinity metrics to scheduler task binding

  • Track shuffle affinity hits/misses when binding tasks to executors

6. Add actual task scheduling latency tracking

  • Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state)
  • Update all TaskDescription creation sites to pass RunningStage.stage_running_time
  • Calculate actual scheduling latency by computing the difference between current time and schedulable_time_millis
  • Enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0

Are there any user-facing changes?

Yes, this PR introduces new metrics that can be collected via the ExecutorMetricsCollector and SchedulerMetricsCollector traits:

Executor Metrics:

  • Shuffle read bytes, rows, and duration
  • Local vs remote shuffle reads

Scheduler Metrics:

  • Planning duration
  • Executor registration/deregistration events
  • Active executor count
  • Stage and task lifecycle events
  • Shuffle affinity hits/misses
  • Task scheduling latency

These metrics are exposed through the existing Prometheus metrics endpoint when using the PrometheusMetricsCollector implementation.

…thod

- Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes)
- Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan
- Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats
- Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor
…tricsCollector, and ShuffleReaderExec

- Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait
- Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait  
- Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads
- Instrument shuffle_reader.rs to call metrics callback during partition fetches
- Add SessionConfigExt methods to pass metrics callback via session config
…lanning metrics

- Add metrics_collector field to SchedulerState struct
- Instrument record_planning_duration in submit_job
- Instrument record_executor_registered/deregistered and set_active_executor_count
- Update all SchedulerState constructors and call sites
- Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state)
- Update all TaskDescription creation sites to pass RunningStage.stage_running_time
- Calculate actual scheduling latency in record_task_scheduled calls by computing the difference between current time and schedulable_time_millis
- This enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0
@phillipleblanc phillipleblanc changed the title Spiceai 51 metrics Add comprehensive metrics instrumentation for scheduler and executor Jan 22, 2026
@phillipleblanc phillipleblanc self-assigned this Jan 22, 2026
@phillipleblanc phillipleblanc marked this pull request as ready for review January 22, 2026 23:05
@phillipleblanc phillipleblanc added the enhancement New feature or request label Jan 22, 2026
@lukekim lukekim merged commit b04653b into spiceai-51 Jan 23, 2026
31 checks passed
@lukekim lukekim deleted the spiceai-51-metrics branch January 23, 2026 01:39
lukekim pushed a commit that referenced this pull request Jan 23, 2026
…10)

* Add shuffle read metrics extraction and QueryStageExecutor::plan() method

- Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes)
- Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan
- Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats
- Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor

* Add shuffle locality metrics to ExecutorMetricsCollector, SchedulerMetricsCollector, and ShuffleReaderExec

- Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait
- Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait
- Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads
- Instrument shuffle_reader.rs to call metrics callback during partition fetches
- Add SessionConfigExt methods to pass metrics callback via session config

* Add metrics collector to SchedulerState and instrument executor and planning metrics

- Add metrics_collector field to SchedulerState struct
- Instrument record_planning_duration in submit_job
- Instrument record_executor_registered/deregistered and set_active_executor_count
- Update all SchedulerState constructors and call sites

* Add stage and task lifecycle metrics instrumentation to update_task_status flow

* Add shuffle affinity metrics to scheduler task binding

* Add actual task scheduling latency tracking

- Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state)
- Update all TaskDescription creation sites to pass RunningStage.stage_running_time
- Calculate actual scheduling latency in record_task_scheduled calls by computing the difference between current time and schedulable_time_millis
- This enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0

* fix lint
lukekim added a commit that referenced this pull request Jan 23, 2026
* feat: Store shuffles in object store (S3, Azure)

* Add comprehensive metrics instrumentation for scheduler and executor (#10)

* Add shuffle read metrics extraction and QueryStageExecutor::plan() method

- Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes)
- Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan
- Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats
- Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor

* Add shuffle locality metrics to ExecutorMetricsCollector, SchedulerMetricsCollector, and ShuffleReaderExec

- Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait
- Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait
- Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads
- Instrument shuffle_reader.rs to call metrics callback during partition fetches
- Add SessionConfigExt methods to pass metrics callback via session config

* Add metrics collector to SchedulerState and instrument executor and planning metrics

- Add metrics_collector field to SchedulerState struct
- Instrument record_planning_duration in submit_job
- Instrument record_executor_registered/deregistered and set_active_executor_count
- Update all SchedulerState constructors and call sites

* Add stage and task lifecycle metrics instrumentation to update_task_status flow

* Add shuffle affinity metrics to scheduler task binding

* Add actual task scheduling latency tracking

- Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state)
- Update all TaskDescription creation sites to pass RunningStage.stage_running_time
- Calculate actual scheduling latency in record_task_scheduled calls by computing the difference between current time and schedulable_time_millis
- This enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0

* fix lint

* feat: add Vortex columnar format support for shuffle operations (#7)

* feat: add Vortex columnar format support for shuffle operations

- Introduced Vortex dependencies in Cargo.toml for columnar format handling.
- Updated Ballista configuration to support shuffle format selection between Arrow IPC and Vortex.
- Implemented Vortex shuffle reader and writer in execution plans.
- Enhanced shuffle operations to detect and handle Vortex files.
- Added utility functions for writing streams to disk in both Arrow IPC and Vortex formats.
- Created a new module for Vortex shuffle operations, including reading and writing logic.
- Added tests for Vortex write and read roundtrip functionality.

* Fix Clippy and lint

* Fix reading of Vortex files

* Fix lint

* Don't expose final stage

* Remove build-binary

---------

Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants