Open
Description
Which Github Action / Prow Jobs are flaking?
ci-etcd-robustness-main-arm64
Which tests are flaking?
Robustness test
Github Action / Prow Job link
Reason for failure (if possible)
Issue investigated as part of robustness tests meeting on March 26th.
Failed scenario https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640
unexpected differences between wal entries, diff:
[]model.EtcdRequest{
... // 922 identical elements
{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key2", Value: {Value: "925"}}}}}},
{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key7", Value: {Hash: 1602287042}}}}}},
{
... // 2 identical fields
LeaseRevoke: nil,
Range: nil,
Txn: &model.TxnRequest{
Conditions: nil,
OperationsOnSuccess: []model.EtcdOperation{
{
Type: "put-operation",
Range: {},
Put: model.PutOptions{
- Key: "key5",
+ Key: "key7",
Value: model.ValueOrHash{
- Value: "",
+ Value: "927",
- Hash: 876259569,
+ Hash: 0,
},
LeaseID: 0,
},
Delete: {},
},
},
OperationsOnFailure: nil,
},
Defragment: nil,
Compact: nil,
},
}
Thinks we confirmed:
- It's not bug in etcd, history is linearizable if we disable reading WAL.
- The
etcd-dump-log
is able to read WAL properly without a problem, returned commitIndex 2936 - It's not issue with comparing uncommitted entries, however the current logic doesn't support it.
- Reading WAL logs by robustness fails with
Error occurred when reading WAL entries wal: slice bounds out of range
implying the entry 2448 is corrupted, even whenetcd-dump-log
worked. - Error
ErrSliceOutOfRange
is swallowed.
From experience I frequently observed errors when reading WAL in 3 node member cluster. Expect that failpoing causing etcd crash might be disrupting writing WAL, resulting in corrupted state. This is still ok as long as it happens to just one member and remaining two are ok.
Things to do:
- Don't swallow errors when reading WAL file
- Skip for uncommitted entries, by breaking loop after entry.Index > hardState.Commit.
- Investigate why etcd-dump-log reads WAL without a problem.
Anything else we need to know?
No response