Skip to content

PostgreSQL WAL directory consuming excessive storage and causing down time #181

Open
@sunu

Description

@sunu

Problem description

We're running a standard deployment of eoapi-k8s for IFRC Montandon with these configuration values:

    ingress:
      host: "montandon-eoapi-stage.ifrc.org"
      tls:
        enabled: false
    pgstacBootstrap:
      settings:
        envVars:
          LOAD_FIXTURES: "0"
          RUN_FOREVER: "1"
    postgrescluster:
      instances:
      - name: eoapi
        replicas: 1
        dataVolumeClaimSpec:
          accessModes:
          - "ReadWriteOnce"
          resources:
            requests:
              storage: "500Gi"
              cpu: "1024m"
              memory: "3048Mi"

We've hit a snag where the database service eventually runs out of disk space after ingesting data for a while. When this happens, it fails its health checks and becomes unresponsive, which brings down the raster service with it.
Looking into our staging instance, we found that the PostgreSQL WAL (Write-Ahead Log) directory is the culprit - it's eating up almost all of our storage. Here's what we're seeing:

bash-4.4$ df -h /pgdata/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdd        493G  493G     0 100% /pgdata
bash-4.4$ du -sh /pgdata/*
16K	/pgdata/lost+found
116M	/pgdata/pg16
492G	/pgdata/pg16_wal
12K	/pgdata/pgbackrest

The 492GB WAL directory is quite large compared to the 116MB data directory.

Expected Output

Ideally, the WAL directory should stay under certain usage limit and should not cause down time.

Environment Information

Crunchy Postgres Operator: v5.5.2
eoapi-k8s: v0.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions