Skip to content

Blob purging is horribly slow #1443

Closed
@alxndrsn

Description

@alxndrsn

From prod, this query has terrible performance on some deployments:

DELETE FROM blobs
  WHERE id IN (
    SELECT b.id
      FROM blobs AS b
      LEFT JOIN client_audits          AS ca ON ca."blobId"    = b.id
      LEFT JOIN submission_attachments AS sa ON sa."blobId"    = b.id
      LEFT JOIN form_attachments       AS fa ON fa."blobId"    = b.id
      LEFT JOIN form_defs              AS fd ON fd."xlsBlobId" = b.id
    WHERE ca."blobId"    IS NULL
      AND sa."blobId"    IS NULL
      AND fa."blobId"    IS NULL
      AND fd."xlsBlobId" IS NULL
      AND b.s3_status = 'uploaded'
    LIMIT 1
)
  RETURNING blobs.id, blobs.sha

Call stack:

  • Blobs._purgeOneUnattachedUploaded()
  • Blobs.purgeUnattached()
  • purgeTask()
  • lib/bin/purge.js
  • purge.sh

Queries

  • what does the query plan look like on servers where this is slow?
    • A. it is much faster without the LIMIT clause (~100x)
    • A. it has many materialization steps
    • A. there are seq scans for all the joins, so perhaps indexes are missing on blobId columns?
    • A. perhaps index is missing on s3_status column?
  • how often are blobs de-duplicated across different types (e.g. client audits vs submission attachments vs form attachments vs form definitions)? It seems unlikely that there is significant space saved by merging all of these

Metadata

Metadata

Assignees

Labels

performancePerformance, benchmarking

Type

No type

Projects

Status

✅ done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions