feat: Store shuffles in object store (S3, Azure)#9
Merged
Conversation
…10) * Add shuffle read metrics extraction and QueryStageExecutor::plan() method - Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes) - Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan - Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats - Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor * Add shuffle locality metrics to ExecutorMetricsCollector, SchedulerMetricsCollector, and ShuffleReaderExec - Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait - Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait - Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads - Instrument shuffle_reader.rs to call metrics callback during partition fetches - Add SessionConfigExt methods to pass metrics callback via session config * Add metrics collector to SchedulerState and instrument executor and planning metrics - Add metrics_collector field to SchedulerState struct - Instrument record_planning_duration in submit_job - Instrument record_executor_registered/deregistered and set_active_executor_count - Update all SchedulerState constructors and call sites * Add stage and task lifecycle metrics instrumentation to update_task_status flow * Add shuffle affinity metrics to scheduler task binding * Add actual task scheduling latency tracking - Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state) - Update all TaskDescription creation sites to pass RunningStage.stage_running_time - Calculate actual scheduling latency in record_task_scheduled calls by computing the difference between current time and schedulable_time_millis - This enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0 * fix lint
* feat: add Vortex columnar format support for shuffle operations - Introduced Vortex dependencies in Cargo.toml for columnar format handling. - Updated Ballista configuration to support shuffle format selection between Arrow IPC and Vortex. - Implemented Vortex shuffle reader and writer in execution plans. - Enhanced shuffle operations to detect and handle Vortex files. - Added utility functions for writing streams to disk in both Arrow IPC and Vortex formats. - Created a new module for Vortex shuffle operations, including reading and writing logic. - Added tests for Vortex write and read roundtrip functionality. * Fix Clippy and lint * Fix reading of Vortex files
phillipleblanc
approved these changes
Jan 23, 2026
phillipleblanc
approved these changes
Jan 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.