-
Notifications
You must be signed in to change notification settings - Fork 9
refactor: refactor file select #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the file selection system by replacing the previous static strategy pattern with a new grouping-based approach. Instead of filtering files individually, the system now groups files using configurable strategies and handles each group as a separate compaction unit.
- Replaces
UnifiedStrategy
andStaticFileStrategy
withCompactionStrategy
andFileGroup
- Introduces new grouping configuration options (min/max group size, file counts, grouping strategy)
- Updates the compaction planner to return multiple plans (one per file group) instead of a single plan
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
core/src/file_selection/mod.rs | Removes file filtering logic and delegates to strategy execution |
core/src/executor/mod.rs | Replaces InputFileScanTasks with FileGroup and removes runtime config |
core/src/executor/datafusion/mod.rs | Updates to use FileGroup and extract parallelism from the group |
core/src/executor/datafusion/datafusion_processor.rs | Removes runtime config dependency and uses parallelism parameters |
core/src/config/mod.rs | Adds new grouping strategy configuration and removes RuntimeConfig |
core/src/compaction/validator.rs | Updates to use FileGroup and parallelism parameters |
core/src/compaction/mod.rs | Major refactor to handle multiple plans and merge results |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
/// No grouping - files are processed individually | ||
Noop, | ||
/// Binpack grouping - files are grouped to optimize size distribution |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for Noop
and BinPack
variants is unclear about their specific behavior and use cases. Consider adding more detailed explanations about when each strategy should be used and what 'optimize size distribution' means for BinPack.
/// No grouping - files are processed individually | |
Noop, | |
/// Binpack grouping - files are grouped to optimize size distribution | |
/// No grouping strategy. | |
/// | |
/// Files are processed individually without any grouping. This strategy is suitable | |
/// when you want to compact each file separately, for example, when files are already | |
/// close to the target size or when grouping is not beneficial due to data distribution. | |
Noop, | |
/// BinPack grouping strategy. | |
/// | |
/// Files are grouped together using a bin-packing algorithm to optimize the distribution | |
/// of file sizes within each group. The goal is to combine smaller files into groups that | |
/// approach the target file size, reducing the number of small files and improving storage | |
/// efficiency. This strategy is recommended when you have many small files and want to | |
/// minimize the number of output files while maximizing their size. |
Copilot uses AI. Check for mistakes.
// Extract parallelism before file_group is moved | ||
let executor_parallelism = file_group.executor_parallelism; | ||
let output_parallelism = file_group.output_parallelism; | ||
|
||
let datafusion_task_ctx = DataFusionTaskContext::builder()? | ||
.with_schema(schema.clone()) | ||
.with_input_data_files(input_file_scan_tasks) | ||
.with_input_data_files(file_group) |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment explains why parallelism values are extracted before moving file_group
, but this pattern suggests a potential design issue. Consider restructuring so that parallelism values don't need to be extracted separately, or use Clone
if FileGroup
is meant to be copied frequently.
Copilot uses AI. Check for mistakes.
Buckets::exponential( | ||
50.0, 4.0, 10, // Start at 50ms, multiply each bucket by 4, up to 10 buckets | ||
), |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponential bucket configuration uses magic numbers (50.0, 4.0, 10) that should be defined as named constants. This makes the metrics configuration more maintainable and self-documenting.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM
This PR changes the design of core plan files. By introducing the concept of grouping, it enables selecting multiple compaction-plans via the planner—each plan represents a batch of files. Callers can choose for themselves how to handle the plans and ultimately submit them.
Additionally, the compaction structure also provides upper-layer interfaces to execute all tasks in parallel and submit them all at once.
Introduce new GroupingStrategy and GroupFilterStrategy to enhance the current capabilities of the planner, and refactor the relevant code for file-selector and compaction mod to support subsequent feature development.