-
Notifications
You must be signed in to change notification settings - Fork 2k
[WIP][feature][spark] Support streaming #7476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
0f4a03f
to
2969453
Compare
Lines 27 to 49 in 1bba723
Lines 87 to 97 in 1bba723
PartitionReader never close in streaming mode. |
1b1d744
to
6392a37
Compare
hi @CheneyYin It seems that after a checkpoint, it will be close |
6392a37
to
f9b244a
Compare
Yes. If the reader does not receive new data for a long time, Spark will end the current micro batch. Spark's micro batch mechanism does not fully meet the requirements of long term streaming computing. First, creating a new reader for the next batch will incur some overhead. Second, the granularity of fault recovery is too large, and the Spark micro batch mechanism cannot restore the reader from the latest snapshot of the Seatunnel reader. |
The checkpoint space like this: ./
├── commits
│ ├........
│ ├── 10
│ ├── 11
│ ├── 12
│ ├── 13
│ ├── 14
│ ├── 15
| ......
├── metadata
├── offsets
│ ├ ........
│ ├── 10
│ ├── 11
│ ├── 12
│ ├── 13
│ ├── 14
│ ├── 15
│ ├── 16
│ .......
└── sources
└── 0
├ ........
├── 11
│ ├── 0
│ │ └── 0.committed
│ └── 1
│ └── 0.committed
├── 12
│ ├── 0
│ │ └── 0.committed
│ └── 1
│ └── 0.committed
├── 13
│ ├── 0
│ │ └── 0.committed
│ └── 1
│ └── 0.committed
├── 14
│ ├── 0
│ │ └── 0.committed
│ └── 1
│ └── 0.committed
├── 15
│ ├── 0
│ │ └── 0.committed
│ └── 1
│ └── 0.committed
........ |
The |
f9b244a
to
3520b9b
Compare
Yes, the micro-batch process cannot meet the requirements of streaming. |
2f11b11
to
d3d78b3
Compare
yes, you're right |
I think this pattern would be more like Spark's continuous streaming mode, but it seems to completely lack fault tolerance |
It can ensure end-to-end at-least-once semantics. If sink can be idempotent for handling reprocessing data, it can ensure exactly-once. At present, spark continuous streaming execution mode is still experimental and guarantees at-least-once fault-tolerance. |
52bb0ae
to
f0eced2
Compare
❤️ |
9630cd0
to
37b6533
Compare
7c8db55
to
6e45582
Compare
- Support Spark Continuous Processing
6e45582
to
b522c35
Compare
Purpose of this pull request
Support streaming for spark engine.
Related:
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide
release-note
.