Add pca_components="auto" to pick the projection rank from the scree elbow by andrewkern · Pull Request #54 · kr-colab/ReLocator

andrewkern · 2026-05-21T00:44:11Z

Summary

Adds pca_components="auto": instead of hand-picking the PCA-init projection
width, locator resolves it to the genotype-PCA scree elbow of the
training data.

scree_elbow() (locator/pca.py) builds the explained-variance
spectrum from the Gram-matrix eigenvalues -- the same on-device path as the
PCA projection fit -- and returns the chord-distance elbow (the point of the
spectrum farthest below the line joining its first and last components).
_resolve_pca_components() (locator/training.py) resolves "auto" to
a concrete integer on the training split, once per run, and writes it
back into the config so every fold and the saved model metadata share one
rank.
pca_components still accepts None (off) or an explicit int; only
"auto" is new, and the default stays None (PCA remains opt-in).

Why

A SNP-count sweep on the 10x WGS Actinemys data (10-fold CV) showed the
projection rank doesn't need to be guessed: the scree elbow is rank 6,
stable across 100k-1M SNPs, and rank 6 matched or beat a hand-picked 128 at
every SNP count (and was clearly better at 1-2M). "auto" makes that the
default way to size the projection.

Testing

tests/test_pca_init.py adds coverage for pca_components="auto" (resolves
to an int, sizes the projection layer) and for scree_elbow on synthetic
low-rank data. Full PCA suite (11) plus train/parallel sanity suites (18)
pass; lint and format clean.

…lbow PCA-init previously required choosing the projection width by hand. Setting pca_components="auto" now resolves the width to the genotype-PCA scree elbow -- the chord-distance knee of the explained-variance spectrum -- computed on the training split. The resolved integer is written back to the config, so every fold of a run and the saved model metadata share one rank. scree_elbow() builds the spectrum from the Gram-matrix eigenvalues, the same on-device path as the PCA projection fit. On the 10x WGS Actinemys data the elbow is rank 6, stable across 100k-1M SNPs, and a k-fold sweep found rank 6 matches or beats a hand-picked 128.

andrewkern merged commit 0e68955 into main May 21, 2026
4 checks passed

andrewkern deleted the feature/pca-auto-rank branch May 21, 2026 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pca_components="auto" to pick the projection rank from the scree elbow#54

Add pca_components="auto" to pick the projection rank from the scree elbow#54
andrewkern merged 1 commit into
mainfrom
feature/pca-auto-rank

andrewkern commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewkern commented May 21, 2026

Summary

Why

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant