-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink: add snapshot expiration reset strategy #12639
base: main
Are you sure you want to change the base?
Flink: add snapshot expiration reset strategy #12639
Conversation
@pvary @stevenzwu Could you please take some time to review whether this change is feasible? Thanks! |
It might be only me, but @stevenzwu, @mxm: What are your thoughts? |
@pvary Thank you very much for your response. Here is my somewhat immature little thought. In our daily operations, if we encounter such a scenario and manually intervene, the only way to recover is by modifying the source configuration to set the Moreover, manual intervention may not always be timely, potentially leading to even more data loss. This is why we came up with the idea of needing such an automated recovery configuration, while also retaining the default option to support the default behavior. |
To add further, manual intervention to recover the job requires abandoning the previous state. If downstream jobs utilize windows, this would also result in the loss of statistical data. To ensure the preservation of state data, we also hope to have an automated recovery mechanism in place. This PR mainly to avoid manual intervention to restart the job for recovery |
@Guosmilesmile: I hope my answer were not too harsh, I did not intend to hurt. All I was trying to say is that in my experience the correctness is the most important feature for a job. YMMW
Could you use |
Peter, your feedback is very insightful and timely, and I am very glad to receive your response. I would love to communicate with you more; I just worry that my replies may not fully express my thoughts, so I would like to provide additional information. Modifying the UID can resolve my scenario, but it requires manual intervention each time, especially in the middle of the night, which is why I came up with this PR. I apologize if my previous replies have caused any misunderstandings. |
Good to hear! |
Yes, you are right. Change uid can help me to save the state. I think this PR is primarily aimed at providing an automatic recovery mechanism to avoid the need for manual intervention, while the configuration switch can also preserve the original behavior. |
We encountered a scenario where, when using Flink source to incrementally consume data from Iceberg, the lastSnapshotId being consumed has already been cleaned up. This can happen, for example, through Spark's expire_snapshots (CALL iceberg.system.expire_snapshots(table => 'default.my_table', older_than => TIMESTAMP '2025-03-25 00:00:00.000', retain_last => 1)) or in other cases where consumption is too slow and historical snapshots are cleaned up.
In such scenarios, the Flink job will repeatedly restart due to the snapshot ID stored in the state being unavailable, requiring manual intervention to restart the job for recovery. We hope to have a mechanism that allows the job to recover automatically when encountering this situation, similar to how Kafka handles out-of-range offsets by automatically starting to consume from the earliest or latest available data.
This PR mainly adds a configuration option called snapshot-expiration-reset-strategy. When the lastSnapshot is not the parent ancestor of the current snapshot, it can be handled in three ways to avoid manual intervention to restart the job for recovery :
Default mode: Maintain the current behavior.
Earliest mode: Start incremental consumption from the oldest snapshot as the lastSnapshot.
Latest mode: Start incremental consumption from the current latest snapshot as the lastSnapshot.