Skip to content

Conversation

@adskyiproger
Copy link
Contributor

@adskyiproger adskyiproger commented Oct 8, 2025

Description

We are experiencing issues with backups on out staging v19 beta environment, the error doesn't effect production.
image

error log example:

root@farajaland-v19-beta-staging:~# bash /root/backup.sh --passphrase=xxxxxxxxxxxxxx --ssh_user=backup --ssh_host=1.1.1.1 --ssh_port=22 --remote_dir=/home/backup/v19-beta-staging --replicas=1
Backing up PostgreSQL 'events' database
Creating a backup for SQLite
fetch https://dl-cdn.alpinelinux.org/alpine/v3.22/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.22/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ncurses-terminfo-base (6.5_p20250503-r0)
(2/4) Installing libncursesw (6.5_p20250503-r0)
(3/4) Installing readline (8.2.13-r1)
(4/4) Installing sqlite (3.49.2-r1)
Executing busybox-1.37.0-r18.trigger
OK: 10 MiB in 20 packages

Delete all currently existing snapshots

{"acknowledged":true}
Register backup folder as an Elasticsearch repository for backing up the search data


Backup Elasticsearch as a set of snapshot files into an elasticsearch sub folder

List indices for backup: events_birth,events_tennis-club-membership,events_death
List indices for backup: events_birth,events_tennis-club-membership,events_death
{ "error" : { "root_cause" : [ { "type" : "snapshot_name_already_in_use_exception", "reason" : "[ocrvs:snapshot_2025-10-08] Invalid snapshot name [snapshot_2025-10-08], snapshot with the same name already exists" } ], "type" : "snapshot_name_already_in_use_exception", "reason" : "[ocrvs:snapshot_2025-10-08] Invalid snapshot name [snapshot_2025-10-08], snapshot with the same name already exists" }, "status" : 400 }
Failed to backup Elasticsearch. Trying again in...

Root cause of the issue is async call to elasticsearch endpoint: /_snapshot/ocrvs/snapshot_${LABEL:-$BACKUP_DATE}?wait_for_completion=true&pretty

By adding timeout between snapshot delete and create we are going to address the issue with lower resources on staging.

Testing

Script was patched on staging environment and executed:

root@farajaland-v19-beta-staging:~# bash /root/backup1.sh --passphrase=xxxxxxxxxx --ssh_user=backup --ssh_host=1.1.1.1 --ssh_port=22 --remote_dir=/home/backup/v19-beta-staging --replicas=1
Backing up PostgreSQL 'events' database

Delete all currently existing snapshots
{"acknowledged":true}

Register backup folder as an Elasticsearch repository for backing up the search data


Backup Elasticsearch as a set of snapshot files into an elasticsearch sub folder

List indices for backup: events_birth,events_tennis-club-membership,events_death
Snapshot state is SUCCESS
...

Checklist

  • I have linked the correct Github issue under "Development"
  • I have tested the changes locally, and written appropriate tests
  • I have tested beyond the happy path (e.g. edge cases, failure paths)
  • I have updated the changelog with this change (if applicable)
  • I have updated the GitHub issue status accordingly

@github-actions

This comment has been minimized.

Comment on lines +224 to +237
echo "Waiting for snapshots to be removed"
sleep 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do better by waiting until the snapshot is deleted? using an API call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to validate result of execution.

We can check filesystem, but that is a bit over-engineering.
For me it's a bit weird this issue didn't happen before.

@adskyiproger adskyiproger deleted the fix-backup branch October 27, 2025 10:58
@adskyiproger adskyiproger restored the fix-backup branch October 27, 2025 10:59
@adskyiproger adskyiproger reopened this Oct 27, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Important Files Changed

File Analysis

Filename Score Overview
infrastructure/backups/backup.sh 0/5 Added timeout for snapshot deletion but includes critical bugs: exit 1 terminates script, jq not available in image, async deletions may fail

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@opencrvs opencrvs deleted a comment from greptile-apps bot Oct 27, 2025
@adskyiproger adskyiproger changed the base branch from develop to release-v1.9.0 October 28, 2025 15:37
@adskyiproger adskyiproger merged commit 8b2b742 into release-v1.9.0 Oct 28, 2025
3 checks passed
@adskyiproger adskyiproger deleted the fix-backup branch October 28, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants