Skip to content

Robustness test flake due to error when reading WAL #19674

Open
@serathius

Description

@serathius

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-arm64

Which tests are flaking?

Robustness test

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

Reason for failure (if possible)

Issue investigated as part of robustness tests meeting on March 26th.

Failed scenario https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-main-arm64/1904808574576496640

unexpected differences between wal entries, diff:
        	            	  []model.EtcdRequest{
        	            	  	... // 922 identical elements
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key2", Value: {Value: "925"}}}}}},
        	            	  	{Type: "txn", Txn: &{OperationsOnSuccess: {{Type: "put-operation", Put: {Key: "key7", Value: {Hash: 1602287042}}}}}},
        	            	  	{
        	            	  		... // 2 identical fields
        	            	  		LeaseRevoke: nil,
        	            	  		Range:       nil,
        	            	  		Txn: &model.TxnRequest{
        	            	  			Conditions: nil,
        	            	  			OperationsOnSuccess: []model.EtcdOperation{
        	            	  				{
        	            	  					Type:  "put-operation",
        	            	  					Range: {},
        	            	  					Put: model.PutOptions{
        	            	- 						Key: "key5",
        	            	+ 						Key: "key7",
        	            	  						Value: model.ValueOrHash{
        	            	- 							Value: "",
        	            	+ 							Value: "927",
        	            	- 							Hash:  876259569,
        	            	+ 							Hash:  0,
        	            	  						},
        	            	  						LeaseID: 0,
        	            	  					},
        	            	  					Delete: {},
        	            	  				},
        	            	  			},
        	            	  			OperationsOnFailure: nil,
        	            	  		},
        	            	  		Defragment: nil,
        	            	  		Compact:    nil,
        	            	  	},
        	            	  }

Thinks we confirmed:

  • It's not bug in etcd, history is linearizable if we disable reading WAL.
  • The etcd-dump-log is able to read WAL properly without a problem, returned commitIndex 2936
  • It's not issue with comparing uncommitted entries, however the current logic doesn't support it.
  • Reading WAL logs by robustness fails with Error occurred when reading WAL entries wal: slice bounds out of range implying the entry 2448 is corrupted, even when etcd-dump-log worked.
  • Error ErrSliceOutOfRange is swallowed.

From experience I frequently observed errors when reading WAL in 3 node member cluster. Expect that failpoing causing etcd crash might be disrupting writing WAL, resulting in corrupted state. This is still ok as long as it happens to just one member and remaining two are ok.

Things to do:

  • Don't swallow errors when reading WAL file
  • Skip for uncommitted entries, by breaking loop after entry.Index > hardState.Commit.
  • Investigate why etcd-dump-log reads WAL without a problem.

Anything else we need to know?

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions