Skip to content

[Issue] crt.sh SQL query does unnecessary work, caps results at 10k rows #1773

@lc

Description

@lc

The crt.sh SQL query introduced in #700 (replacing the simpler query from #666) copies the crt.sh web UI query verbatim. That query is designed to render a certificate browser as it computes:

  • x509_commonName
  • x509_notBefore
  • x509_notAfter
  • x509_serialNumber

& groups by certificate blob, then joins ct_log_entry for chronological ordering.

subfinder uses none of this; only reading NAME_VALUE.

This causes:

  • GROUP BY sub.CERTIFICATE on multi-KB DER blobs
  • 4x x509_* function calls per certificate group (CPU-intensive ASN.1 parsing)
  • Correlated LEFT JOIN LATERAL to ct_log_entry per group
  • ORDER BY on timestamp (unnecessary — subfinder deduplicates downstream)
  • LIMIT 10000 on raw certificate rows (not subdomains) as a band-aid for the query being too expensive

The fix from #700 found more subdomains than the original #666 query, but that was due to switching from certificate_identity to certificate_and_identities and using plainto_tsquery — not the
certificate metadata ceremony.

Suggested replacement:

  SELECT DISTINCT cai.NAME_VALUE
  FROM certificate_and_identities cai
  WHERE plainto_tsquery('certwatch', $1) @@ identities(cai.CERTIFICATE)
    AND cai.NAME_VALUE ILIKE ('%' || $1 || '%')

Eliminating the GROUP BY, x509 functions, correlated join should result in less CPU & memory pressure on crt.sh's public database.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions