Description
Hi,
We are using a VolumeSnapshotClass as below for Block volume snapshotting:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: oci-bv-snapshot-incremental
driver: blockvolume.csi.oraclecloud.com
parameters:
backupType: incremental # No functional restore difference between full and incremental
deletionPolicy: Delete
This is integrated with CNPG for lower environment database volume snapshots.
Occasionally (every few weeks), we find these backups failing. with the error:
DeadlineExceeded desc = Timed out waiting for backup to become available
It looks like this is being thrown by the oci-bv csi here:
Which uses a timeout of 45 seconds as defined here:
However, in practice a 45 second timeout is too conservative, looking in the logs, we see the following times for snapshot creation in uk-london-1
between going from com.oraclecloud.BlockVolumes.CreateVolumeBackup.begin
to com.oraclecloud.BlockVolumes.CreateVolumeBackup.end
state.
Over 9 samples: average: 37.4 seconds | min: 34 seconds | max: 41 seconds
With a backupPollInterval
of 5 seconds
, the CSI steps just outside of the permissible timeout of 45 seconds.
https://github.com/oracle/oci-cloud-controller-manager/blob/master/pkg/oci/client/block_storage.go#L150C36-L150C60
I believe the solution for this would be to increase the available timeout to 60 seconds
to align better with the expected response times from the API.
Thanks!