Closed
Description
From prod, this query has terrible performance on some deployments:
DELETE FROM blobs
WHERE id IN (
SELECT b.id
FROM blobs AS b
LEFT JOIN client_audits AS ca ON ca."blobId" = b.id
LEFT JOIN submission_attachments AS sa ON sa."blobId" = b.id
LEFT JOIN form_attachments AS fa ON fa."blobId" = b.id
LEFT JOIN form_defs AS fd ON fd."xlsBlobId" = b.id
WHERE ca."blobId" IS NULL
AND sa."blobId" IS NULL
AND fa."blobId" IS NULL
AND fd."xlsBlobId" IS NULL
AND b.s3_status = 'uploaded'
LIMIT 1
)
RETURNING blobs.id, blobs.sha
Call stack:
Blobs._purgeOneUnattachedUploaded()
Blobs.purgeUnattached()
purgeTask()
lib/bin/purge.js
purge.sh
Queries
- what does the query plan look like on servers where this is slow?
- A. it is much faster without the
LIMIT
clause (~100x) - A. it has many materialization steps
- A. there are seq scans for all the joins, so perhaps indexes are missing on
blobId
columns? - A. perhaps index is missing on
s3_status
column?
- A. it is much faster without the
- how often are blobs de-duplicated across different types (e.g. client audits vs submission attachments vs form attachments vs form definitions)? It seems unlikely that there is significant space saved by merging all of these
Metadata
Metadata
Assignees
Type
Projects
Status
✅ done