Skip to content

Add pca_components="auto" to pick the projection rank from the scree elbow#54

Merged
andrewkern merged 1 commit into
mainfrom
feature/pca-auto-rank
May 21, 2026
Merged

Add pca_components="auto" to pick the projection rank from the scree elbow#54
andrewkern merged 1 commit into
mainfrom
feature/pca-auto-rank

Conversation

@andrewkern
Copy link
Copy Markdown
Member

Summary

Adds pca_components="auto": instead of hand-picking the PCA-init projection
width, locator resolves it to the genotype-PCA scree elbow of the
training data.

  • scree_elbow() (locator/pca.py) builds the explained-variance
    spectrum from the Gram-matrix eigenvalues -- the same on-device path as the
    PCA projection fit -- and returns the chord-distance elbow (the point of the
    spectrum farthest below the line joining its first and last components).
  • _resolve_pca_components() (locator/training.py) resolves "auto" to
    a concrete integer on the training split, once per run, and writes it
    back into the config so every fold and the saved model metadata share one
    rank.
  • pca_components still accepts None (off) or an explicit int; only
    "auto" is new, and the default stays None (PCA remains opt-in).

Why

A SNP-count sweep on the 10x WGS Actinemys data (10-fold CV) showed the
projection rank doesn't need to be guessed: the scree elbow is rank 6,
stable across 100k-1M SNPs, and rank 6 matched or beat a hand-picked 128 at
every SNP count (and was clearly better at 1-2M). "auto" makes that the
default way to size the projection.

Testing

tests/test_pca_init.py adds coverage for pca_components="auto" (resolves
to an int, sizes the projection layer) and for scree_elbow on synthetic
low-rank data. Full PCA suite (11) plus train/parallel sanity suites (18)
pass; lint and format clean.

…lbow

PCA-init previously required choosing the projection width by hand. Setting
pca_components="auto" now resolves the width to the genotype-PCA scree
elbow -- the chord-distance knee of the explained-variance spectrum --
computed on the training split. The resolved integer is written back to
the config, so every fold of a run and the saved model metadata share one
rank.

scree_elbow() builds the spectrum from the Gram-matrix eigenvalues, the
same on-device path as the PCA projection fit. On the 10x WGS Actinemys
data the elbow is rank 6, stable across 100k-1M SNPs, and a k-fold sweep
found rank 6 matches or beats a hand-picked 128.
@andrewkern andrewkern merged commit 0e68955 into main May 21, 2026
4 checks passed
@andrewkern andrewkern deleted the feature/pca-auto-rank branch May 21, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant