[Data] refactor checkpointfilter with anti-join

### Description

Currently, each ReadTask initializes a checkpointer to filter rows whose primary key already exists in `checkpoint_id` (similar to a broadcast join). This is a typical anti-join pattern. I think that we refactor the checkpoint filter using an anti-join approach.

By adopting anti-join, the checkpoint filter may benefit from optimizer rules:
1.The optimizer can choose between broadcast join or distributed hash join based on dataset statistics, potentially resolving efficiency or memory issues (such as #60294).
2.Once the runtimefilter/dynamic optimizer is supported, the checkpoint filter may be pushed down to the source side.

However, currently the hash join is based on `ShuffleAggregation`, which remains a pipeline broker. This architecture blocks the streaming execution pipeline, as all data must be completely shuffled and materialized before downstream operators can begin processing, thereby preventing true pipelined execution.

ptal @wxwmd @owenowenisme 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] refactor checkpointfilter with anti-join #60764

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] refactor checkpointfilter with anti-join #60764

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions