-
Notifications
You must be signed in to change notification settings - Fork 381
feat: kafka bounded datasource #5970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile SummaryThis PR implements a bounded Kafka datasource that enables batch-style reads from Apache Kafka topics, addressing issue #4603. The implementation spans Rust (scan operator and streaming consumer) and Python (public API with comprehensive bound normalization). Key Changes:
Implementation Quality:
Issue Found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Python API
participant KafkaScanOperator
participant BaseConsumer
participant StreamConsumer
participant Kafka Broker
User->>Python API: read_kafka(bootstrap_servers, topics, start, end)
Python API->>Python API: Normalize bounds (timestamps, offsets)
Python API->>Python API: Validate topics, partitions, timeouts
Python API->>KafkaScanOperator: kafka_scan_bounded()
KafkaScanOperator->>BaseConsumer: create(config)
BaseConsumer->>Kafka Broker: connect
loop For each topic
KafkaScanOperator->>BaseConsumer: fetch_metadata(topic)
BaseConsumer->>Kafka Broker: metadata request
Kafka Broker-->>BaseConsumer: partition list
loop For each partition
KafkaScanOperator->>BaseConsumer: fetch_watermarks(topic, partition)
BaseConsumer->>Kafka Broker: watermark request
Kafka Broker-->>BaseConsumer: low/high offsets
alt timestamp_ms bound
KafkaScanOperator->>BaseConsumer: offsets_for_times(timestamp)
BaseConsumer->>Kafka Broker: offset lookup
Kafka Broker-->>BaseConsumer: resolved offset
end
KafkaScanOperator->>KafkaScanOperator: resolve_bound(start/end)
KafkaScanOperator->>KafkaScanOperator: create ScanTask
end
end
KafkaScanOperator-->>User: ScanTasks (DataFrame plan)
User->>User: .collect() / .show()
loop For each ScanTask
User->>StreamConsumer: create(config)
StreamConsumer->>StreamConsumer: assign(topic, partition)
StreamConsumer->>StreamConsumer: seek(start_offset)
loop While not reached end_offset or limit
StreamConsumer->>Kafka Broker: recv()
Kafka Broker-->>StreamConsumer: message
StreamConsumer->>StreamConsumer: check offset bounds
StreamConsumer->>StreamConsumer: build batch (chunk_size)
alt Limit reached or end_offset
StreamConsumer->>StreamConsumer: stop streaming
end
end
StreamConsumer-->>User: RecordBatch stream
end
User->>User: MicroPartition results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
src/daft-scan/src/kafka.rs, line 189-191 (link)logic:
can_absorb_limit()returnsfalse, but the Kafka consumer inscan_task.rs:555-625implements limit pushdown by trackingremainingand stopping early. This should returntrueto enable query optimization.
12 files reviewed, 1 comment
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5970 +/- ##
==========================================
+ Coverage 72.63% 72.78% +0.15%
==========================================
Files 970 972 +2
Lines 126562 127542 +980
==========================================
+ Hits 91924 92829 +905
- Misses 34638 34713 +75
🚀 New features to boost your workflow:
|
083a34d to
3174030
Compare
3174030 to
e2db2d9
Compare
|
@desmondcheongzx tagging you on this one! |
Changes Made
Added a bounded Kafka batch read API via
daft.read_kafka, supportingstart / endbounds expressed as:"earliest"/"latest"{partition: offset}for single-topic;{topic: {partition: offset}}for multi-topic)Related Issues
Closes #4603