Skip to content

feat: add Vortex columnar format support for shuffle operations#7

Merged
lukekim merged 4 commits into
spiceai-51from
lukim/vortex
Jan 23, 2026
Merged

feat: add Vortex columnar format support for shuffle operations#7
lukekim merged 4 commits into
spiceai-51from
lukim/vortex

Conversation

@lukekim
Copy link
Copy Markdown

@lukekim lukekim commented Jan 22, 2026

This pull request adds support for a configurable shuffle data format in Ballista, allowing shuffle files to be written and read in either the default Arrow IPC format or the new Vortex columnar format. The Vortex format is only available when the vortex feature is enabled, and all relevant configuration, dependency, and code changes have been made to support this. The changes include updates to configuration management, dependency lists, shuffle writer and reader logic, and feature gating for Vortex support.

Configurable Shuffle Format Support

  • Added BALLISTA_SHUFFLE_FORMAT configuration key and ShuffleFormat enum to allow users to select between arrow_ipc and vortex formats for shuffle data. This includes parsing logic and documentation in ballista/core/src/config.rs. [1] [2] [3]
  • Updated shuffle writer logic to write partitioned data using the configured format, abstracting over Arrow IPC and Vortex writers with the new ShuffleFileWriter enum. File extensions and format selection are now dynamic. [1] [2] [3] [4]

Vortex Format Integration (Feature-Gated)

  • Added Vortex dependencies to Cargo.toml and gated related code with the vortex feature. This includes workspace and optional dependency management in ballista/core/Cargo.toml and root Cargo.toml. [1] [2] [3]
  • Implemented Vortex shuffle reader and writer modules, and updated module exports to conditionally include Vortex support.

Shuffle Reader Enhancements

  • Modified shuffle reader logic to detect file format by extension and select the appropriate reader (Arrow IPC or Vortex). Added error handling for unsupported formats and feature gating. [1] [2]
  • Updated tests and internal function names to reflect the separation of Arrow IPC and Vortex file reading logic.

Documentation and Miscellaneous

  • Updated documentation/comments in shuffle writer to clarify format options and feature requirements.
  • Minor refactoring and cleanup in distributed query execution and config imports. [1] [2] [3] [4] [5]

Let me know if you'd like a deeper walkthrough of any part of the new shuffle format logic or the feature gating for Vortex!

- Introduced Vortex dependencies in Cargo.toml for columnar format handling.
- Updated Ballista configuration to support shuffle format selection between Arrow IPC and Vortex.
- Implemented Vortex shuffle reader and writer in execution plans.
- Enhanced shuffle operations to detect and handle Vortex files.
- Added utility functions for writing streams to disk in both Arrow IPC and Vortex formats.
- Created a new module for Vortex shuffle operations, including reading and writing logic.
- Added tests for Vortex write and read roundtrip functionality.
@github-actions github-actions Bot removed the python label Jan 22, 2026
@lukekim lukekim changed the title Lukim/vortex feat: add Vortex columnar format support for shuffle operations Jan 22, 2026
@lukekim lukekim self-assigned this Jan 22, 2026
@lukekim lukekim added the enhancement New feature or request label Jan 22, 2026
@lukekim lukekim merged commit 42e655c into spiceai-51 Jan 23, 2026
30 checks passed
@lukekim lukekim deleted the lukim/vortex branch January 23, 2026 02:05
lukekim added a commit that referenced this pull request Jan 23, 2026
* feat: add Vortex columnar format support for shuffle operations

- Introduced Vortex dependencies in Cargo.toml for columnar format handling.
- Updated Ballista configuration to support shuffle format selection between Arrow IPC and Vortex.
- Implemented Vortex shuffle reader and writer in execution plans.
- Enhanced shuffle operations to detect and handle Vortex files.
- Added utility functions for writing streams to disk in both Arrow IPC and Vortex formats.
- Created a new module for Vortex shuffle operations, including reading and writing logic.
- Added tests for Vortex write and read roundtrip functionality.

* Fix Clippy and lint

* Fix reading of Vortex files
lukekim added a commit that referenced this pull request Jan 23, 2026
* feat: Store shuffles in object store (S3, Azure)

* Add comprehensive metrics instrumentation for scheduler and executor (#10)

* Add shuffle read metrics extraction and QueryStageExecutor::plan() method

- Add public getter methods to PartitionStats (num_rows, num_batches, num_bytes)
- Extend QueryStageExecutor trait with plan() method to access underlying ExecutionPlan
- Add extract_shuffle_read_metrics() to walk plan tree and sum ShuffleReaderExec partition stats
- Record shuffle read metrics (bytes, rows, duration) after successful task execution in executor

* Add shuffle locality metrics to ExecutorMetricsCollector, SchedulerMetricsCollector, and ShuffleReaderExec

- Add record_shuffle_read_local/remote methods to ExecutorMetricsCollector trait
- Add record_task_shuffle_affinity_hit/miss methods to SchedulerMetricsCollector trait
- Add ShuffleReadMetricsCallback trait in ballista-core for tracking local vs remote reads
- Instrument shuffle_reader.rs to call metrics callback during partition fetches
- Add SessionConfigExt methods to pass metrics callback via session config

* Add metrics collector to SchedulerState and instrument executor and planning metrics

- Add metrics_collector field to SchedulerState struct
- Instrument record_planning_duration in submit_job
- Instrument record_executor_registered/deregistered and set_active_executor_count
- Update all SchedulerState constructors and call sites

* Add stage and task lifecycle metrics instrumentation to update_task_status flow

* Add shuffle affinity metrics to scheduler task binding

* Add actual task scheduling latency tracking

- Add schedulable_time_millis field to TaskDescription to track when a task became schedulable (when its stage transitioned to running state)
- Update all TaskDescription creation sites to pass RunningStage.stage_running_time
- Calculate actual scheduling latency in record_task_scheduled calls by computing the difference between current time and schedulable_time_millis
- This enables accurate scheduler_task_scheduling_latency_ms metrics instead of the previous placeholder value of 0

* fix lint

* feat: add Vortex columnar format support for shuffle operations (#7)

* feat: add Vortex columnar format support for shuffle operations

- Introduced Vortex dependencies in Cargo.toml for columnar format handling.
- Updated Ballista configuration to support shuffle format selection between Arrow IPC and Vortex.
- Implemented Vortex shuffle reader and writer in execution plans.
- Enhanced shuffle operations to detect and handle Vortex files.
- Added utility functions for writing streams to disk in both Arrow IPC and Vortex formats.
- Created a new module for Vortex shuffle operations, including reading and writing logic.
- Added tests for Vortex write and read roundtrip functionality.

* Fix Clippy and lint

* Fix reading of Vortex files

* Fix lint

* Don't expose final stage

* Remove build-binary

---------

Co-authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants