Skip to content

Conversation

Li0k
Copy link
Collaborator

@Li0k Li0k commented Sep 18, 2025

This PR changes the design of core plan files. By introducing the concept of grouping, it enables selecting multiple compaction-plans via the planner—each plan represents a batch of files. Callers can choose for themselves how to handle the plans and ultimately submit them.
Additionally, the compaction structure also provides upper-layer interfaces to execute all tasks in parallel and submit them all at once.

#[derive(Debug)]
pub struct CompactionStrategy {
    file_filters: Vec<Box<dyn FileFilterStrategy>>,
    grouping: GroupingStrategyEnum,
    group_filters: Vec<Box<dyn GroupFilterStrategy>>,
}

Introduce new GroupingStrategy and GroupFilterStrategy to enhance the current capabilities of the planner, and refactor the relevant code for file-selector and compaction mod to support subsequent feature development.

@Li0k Li0k requested a review from Copilot September 22, 2025 10:19
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the file selection system by replacing the previous static strategy pattern with a new grouping-based approach. Instead of filtering files individually, the system now groups files using configurable strategies and handles each group as a separate compaction unit.

  • Replaces UnifiedStrategy and StaticFileStrategy with CompactionStrategy and FileGroup
  • Introduces new grouping configuration options (min/max group size, file counts, grouping strategy)
  • Updates the compaction planner to return multiple plans (one per file group) instead of a single plan

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
core/src/file_selection/mod.rs Removes file filtering logic and delegates to strategy execution
core/src/executor/mod.rs Replaces InputFileScanTasks with FileGroup and removes runtime config
core/src/executor/datafusion/mod.rs Updates to use FileGroup and extract parallelism from the group
core/src/executor/datafusion/datafusion_processor.rs Removes runtime config dependency and uses parallelism parameters
core/src/config/mod.rs Adds new grouping strategy configuration and removes RuntimeConfig
core/src/compaction/validator.rs Updates to use FileGroup and parallelism parameters
core/src/compaction/mod.rs Major refactor to handle multiple plans and merge results

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Li0k Li0k requested a review from Copilot September 23, 2025 08:55
@Li0k Li0k marked this pull request as ready for review September 23, 2025 08:55
@Li0k Li0k requested review from chenzl25 and xxhZs September 23, 2025 08:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +47 to +49
/// No grouping - files are processed individually
Noop,
/// Binpack grouping - files are grouped to optimize size distribution
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for Noop and BinPack variants is unclear about their specific behavior and use cases. Consider adding more detailed explanations about when each strategy should be used and what 'optimize size distribution' means for BinPack.

Suggested change
/// No grouping - files are processed individually
Noop,
/// Binpack grouping - files are grouped to optimize size distribution
/// No grouping strategy.
///
/// Files are processed individually without any grouping. This strategy is suitable
/// when you want to compact each file separately, for example, when files are already
/// close to the target size or when grouping is not beneficial due to data distribution.
Noop,
/// BinPack grouping strategy.
///
/// Files are grouped together using a bin-packing algorithm to optimize the distribution
/// of file sizes within each group. The goal is to combine smaller files into groups that
/// approach the target file size, reducing the number of small files and improving storage
/// efficiency. This strategy is recommended when you have many small files and want to
/// minimize the number of output files while maximizing their size.

Copilot uses AI. Check for mistakes.

Comment on lines +70 to +76
// Extract parallelism before file_group is moved
let executor_parallelism = file_group.executor_parallelism;
let output_parallelism = file_group.output_parallelism;

let datafusion_task_ctx = DataFusionTaskContext::builder()?
.with_schema(schema.clone())
.with_input_data_files(input_file_scan_tasks)
.with_input_data_files(file_group)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains why parallelism values are extracted before moving file_group, but this pattern suggests a potential design issue. Consider restructuring so that parallelism values don't need to be extracted separately, or use Clone if FileGroup is meant to be copied frequently.

Copilot uses AI. Check for mistakes.

Comment on lines 96 to 98
Buckets::exponential(
50.0, 4.0, 10, // Start at 50ms, multiply each bucket by 4, up to 10 buckets
),
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponential bucket configuration uses magic numbers (50.0, 4.0, 10) that should be defined as named constants. This makes the metrics configuration more maintainable and self-documenting.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants