Add pca_components="auto" to pick the projection rank from the scree elbow#54
Merged
Conversation
…lbow PCA-init previously required choosing the projection width by hand. Setting pca_components="auto" now resolves the width to the genotype-PCA scree elbow -- the chord-distance knee of the explained-variance spectrum -- computed on the training split. The resolved integer is written back to the config, so every fold of a run and the saved model metadata share one rank. scree_elbow() builds the spectrum from the Gram-matrix eigenvalues, the same on-device path as the PCA projection fit. On the 10x WGS Actinemys data the elbow is rank 6, stable across 100k-1M SNPs, and a k-fold sweep found rank 6 matches or beats a hand-picked 128.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
pca_components="auto": instead of hand-picking the PCA-init projectionwidth, locator resolves it to the genotype-PCA scree elbow of the
training data.
scree_elbow()(locator/pca.py) builds the explained-variancespectrum from the Gram-matrix eigenvalues -- the same on-device path as the
PCA projection fit -- and returns the chord-distance elbow (the point of the
spectrum farthest below the line joining its first and last components).
_resolve_pca_components()(locator/training.py) resolves"auto"toa concrete integer on the training split, once per run, and writes it
back into the config so every fold and the saved model metadata share one
rank.
pca_componentsstill acceptsNone(off) or an explicitint; only"auto"is new, and the default staysNone(PCA remains opt-in).Why
A SNP-count sweep on the 10x WGS Actinemys data (10-fold CV) showed the
projection rank doesn't need to be guessed: the scree elbow is rank 6,
stable across 100k-1M SNPs, and rank 6 matched or beat a hand-picked 128 at
every SNP count (and was clearly better at 1-2M).
"auto"makes that thedefault way to size the projection.
Testing
tests/test_pca_init.pyadds coverage forpca_components="auto"(resolvesto an int, sizes the projection layer) and for
scree_elbowon syntheticlow-rank data. Full PCA suite (11) plus train/parallel sanity suites (18)
pass; lint and format clean.