Description
We use Dataflow with Apache Beam to read events from Kinesis streams. Recently, we've spotted that in a case when one of the streams was not available in the middle of events processing (due to removal or problem with the credentials), the data watermark for this stream was still being updated.
Imagine scenario:
- Permissions allow to read from stream A
- Data is read from stream A
- Permissions are changed and don’t allow to read from stream A
- Watermark for stream A is progressing (but stream data is not read due to permissions issue)
- Permissions are fixed to read stream A
- Data is read from stream A but from the updated watermark
As a result, stream data between steps 3-5 is lost and the client doesn’t know that.
Additionally, it may be confusing from the Dataflow console perspective, as it suggests that events are still being read from the stream. It is hard to rely on the watermark as a source metric for alerting purposes as well.
Brief investigation suggests that maybe the KinesisReader.getWatermark() logic doesn’t consider the state of the stream i.e. is it available or not, and it treats the removed stream as a stream without traffic. Watermark calculation should be adjusted to take that information into account.
Imported from Jira BEAM-12406. Original Jira may contain additional context.
Reported by: mateuszratajocado.