Skip to content

feat: add RDF bloom payload model#8354

Draft
discord9 wants to merge 2 commits into
GreptimeTeam:mainfrom
discord9:omos/rdf-bloom-common-query
Draft

feat: add RDF bloom payload model#8354
discord9 wants to merge 2 commits into
GreptimeTeam:mainfrom
discord9:omos/rdf-bloom-common-query

Conversation

@discord9

@discord9 discord9 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

This is PR 1a of the RDF Bloom stack. Follow-up fork-only stack PRs:

Depends on greptime-proto payload PR: GreptimeTeam/greptime-proto#325

Part of original umbrella draft PR: #8342

What's changed and what's your intention?

This PR introduces only the typed RDF Bloom payload model and serialization plumbing:

  • adds DynFilterPayload::JoinHashBloom as a typed remote dynamic filter payload variant
  • adds slim JoinHashBloomPayload model/proto conversion/validation
  • updates initial remote dynamic filter snapshot serde and payload-size accounting
  • updates greptime-proto dependency to PR feat: timestamp column support i64 #325
  • keeps the proto surface slim: no JoinHashKind, no BloomHashAlgorithm, and no serialized distinct_hash_count; df_seed0..3 remain for receiver-side DataFusion hash reconstruction

Bloom probing and HashTableLookupExpr encoding are intentionally split into follow-up PR1b/PR1c to keep this review bounded.

Validation run locally:

  • cargo fmt --all -- --check
  • git diff --check
  • CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p common-query request --lib

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions Bot added size/XXL docs-not-required This change does not impact docs. labels Jun 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the project's DataFusion dependency to a new fork and introduces a join-hash Bloom filter payload model (JoinHashBloomPayload) along with its corresponding probe expression (JoinHashBloomProbeExpr) to optimize remote dynamic filter updates. It also adds robust encoding, decoding, and fallback mechanisms, supported by extensive unit tests. The code review feedback suggests restoring the deleted documentation comments for DynFilterUpdate and QueryRequest to preserve code clarity, and simplifying the ceiling division math in num_bits_ceil_bytes using a more standard integer arithmetic pattern.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +762 to 771
/// A remote dynamic filter update sent from a query coordinator to region servers.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
pub struct DynFilterUpdate {
/// Protocol version used by this update payload.
pub protocol_version: u32,
/// Internal query identifier that owns this dynamic filter lifecycle.
pub query_id: String,
/// Identifier of the dynamic filter within the query.
pub filter_id: String,
/// Monotonic update generation for this filter.
pub generation: u64,
/// Whether this update completes the dynamic filter stream.
pub is_complete: bool,
/// Serialized predicate payload carried by this update.
pub payload: DynFilterPayload,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The detailed documentation for DynFilterUpdate and its fields has been removed. This reduces code clarity and makes it harder for future developers to understand the purpose of each field. Please consider restoring the documentation.

/// A remote dynamic filter update sent from a query coordinator to region servers.
///
/// `generation` is monotonic within a `query_id`/`filter_id` pair and matches the
/// gRPC field name used by `RemoteDynFilterUpdate`. Receivers use it to ignore
/// stale updates while `is_complete` marks the final payload for the filter.
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Deserialize)]
pub struct DynFilterUpdate {
    /// Protocol version used by this update payload.
    pub protocol_version: u32,
    /// Internal query identifier that owns this dynamic filter lifecycle.
    pub query_id: String,
    /// Identifier of the dynamic filter within the query.
    pub filter_id: String,
    /// Monotonic update generation for this filter.
    pub generation: u64,
    /// Whether this update completes the dynamic filter stream.
    pub is_complete: bool,
    /// Serialized predicate payload carried by this update.
    pub payload: DynFilterPayload,
}

Comment on lines 792 to 798
/// The query request to be handled by the RegionServer (Datanode).
#[derive(Clone, Debug)]
pub struct QueryRequest {
/// The header of this request. Often to store some context of the query. None means all to defaults.
pub header: Option<RegionRequestHeader>,

/// The id of the region to be queried.
pub region_id: RegionId,

/// The form of the query: a logical plan.
pub plan: LogicalPlan,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The doc comments for the fields of QueryRequest have been removed. To maintain code clarity and help other developers understand the structure, please consider adding them back.

/// The query request to be handled by the RegionServer (Datanode).
#[derive(Clone, Debug)]
pub struct QueryRequest {
    /// The header of this request. Often to store some context of the query. None means all to defaults.
    pub header: Option<RegionRequestHeader>,
    /// The id of the region to be queried.
    pub region_id: RegionId,
    /// The form of the query: a logical plan.
    pub plan: LogicalPlan,
}

Comment on lines +611 to +613
fn num_bits_ceil_bytes(num_bits: usize) -> usize {
(num_bits / 8) + usize::from(!num_bits.is_multiple_of(8))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of num_bits_ceil_bytes is a bit obscure. For better readability and maintainability, consider using the more conventional integer arithmetic pattern (num_bits + 7) / 8 for ceiling division.

Suggested change
fn num_bits_ceil_bytes(num_bits: usize) -> usize {
(num_bits / 8) + usize::from(!num_bits.is_multiple_of(8))
}
fn num_bits_ceil_bytes(num_bits: usize) -> usize {
(num_bits + 7) / 8
}

discord9 added 2 commits June 30, 2026 11:29
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
@discord9 discord9 force-pushed the omos/rdf-bloom-common-query branch from 8aaf683 to 532e9f9 Compare June 30, 2026 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant