The misjudgment of the WAL iterator status results in severe replication delays.

### Expected behavior
When no errors occur, the WAL iterator needs to be regenerated only after it has traversed all the WAL files it initially identified during its creation.

### Actual behavior
   There are two scenarios where the iterator incorrectly determines that the file has been fully traversed before reaching the actual end, resulting in a "TryAgain" return. However, when traversal is interrupted within a WAL file, subsequent attempts to call `SeekToStartSequence` can incur significant delays. Our tracking indicates that in such cases, `SeekToStartSequence` can take between 80 to 200 milliseconds, and `RestrictedRead` may be executed up to 100,000 times.

## case1: 
 check `current_last_seq_ == versions_->LastSequence()` twice, but external writes between the two checks may cause the `LastSequence` to increase, leading to the success of the first check and the failure of the second

![20241231133901](https://github.com/user-attachments/assets/b8ab9e42-6d2a-4b39-b07d-76f971616f0a)
_Figure 1: double check in nextImpl_

![20241231131201](https://github.com/user-attachments/assets/a6f7922c-b508-455d-a3d3-741ebb1b0864)
_Figure 2: first check in RestrictedRead_

  After addressing this issue, the delay in replication has been significantly optimized, though occasional delay spikes still occur.

![20241231131222](https://github.com/user-attachments/assets/327de95d-6e4f-4443-8a25-6b60bbb353a9)
_Figure 3: replication Pmax(Red line: control group, Orange line: experimental group)_


## case2:
`current_log_reader_->ReadRecord(record, &scratch_)` may return false in `kEof` branch. In certain scenarios, reaching EOF does not necessarily indicate that the file has truly reached its end. We observed this behavior in some custom log info, which also explains the spikes seen in the experimental group in Figure 3.

Although we have not yet pinpointed the specific scenarios that lead to this false EOF, we can prevent this misjudgment by verifying whether a new live WAL file has actually been generated. This issue can be completely solved after adding this check.

![20241231141215](https://github.com/user-attachments/assets/b67899e9-c2a7-426a-bd5d-c40bb24d2796)
_Figure 4: replication Pmax(Red line: control group, Orange line: experimental group)_



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The misjudgment of the WAL iterator status results in severe replication delays. #13260

Expected behavior

Actual behavior

case1:

case2:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The misjudgment of the WAL iterator status results in severe replication delays. #13260

Description

Expected behavior

Actual behavior

case1:

case2:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions