Skip to content

Commit 7bbc582

Browse files
authored
Merge pull request #55 from kr-colab/docs/pca-projection-guide
Add a documentation guide for the PCA model
2 parents 0e68955 + bfd57ec commit 7bbc582

2 files changed

Lines changed: 153 additions & 0 deletions

File tree

docs/source/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Locator is a deep learning-based tool for predicting geographic coordinates from
1111
usage
1212
cli
1313
ensemble_guide
14+
pca_guide
1415
parallel_analysis_guide
1516
plotting_guide
1617
na_handling_guide
@@ -24,6 +25,7 @@ Quick Links
2425
* :doc:`usage` - Basic and advanced usage guide
2526
* :doc:`cli` - Command-line interface
2627
* :doc:`ensemble_guide` - Ensemble models and k-fold cross-validation
28+
* :doc:`pca_guide` - The PCA model for data with many SNPs
2729
* :doc:`parallel_analysis_guide` - Multi-GPU parallel analysis guide
2830
* :doc:`plotting_guide` - Visualization and plotting guide
2931
* :doc:`na_handling_guide` - Guide for handling missing coordinates

docs/source/pca_guide.rst

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
The PCA Model
2+
=============
3+
4+
Locator's networks struggle when the genotype data has many more SNPs than
5+
samples -- for example, whole-genome data with hundreds of thousands or
6+
millions of SNPs but only a few hundred individuals. The first layer of the
7+
network grows with the number of SNPs, so with millions of SNPs it has
8+
hundreds of millions of values to learn. That is far too many for a few
9+
hundred samples: the network memorizes the training samples instead of
10+
learning a real pattern, and its predictions on new samples get *worse* as
11+
more SNPs are added.
12+
13+
The PCA model fixes this. Before the network sees the genotypes, Locator runs
14+
a PCA and keeps only a handful of components. The network then learns from
15+
those few components instead of from millions of raw SNPs, so it stays small
16+
and accurate no matter how many SNPs you give it.
17+
18+
.. contents:: Table of Contents
19+
:local:
20+
:depth: 2
21+
22+
When to use it
23+
--------------
24+
25+
Turn the PCA model on when you have many more SNPs than samples -- roughly,
26+
whole-genome data with 100,000 or more SNPs and a few hundred samples. With
27+
fewer SNPs the plain network works well and PCA is not needed. The PCA model's
28+
job is to keep accuracy steady as the SNP count grows into the millions, where
29+
a plain network gets worse.
30+
31+
It works with normal training, holdouts, k-fold and leave-one-out
32+
cross-validation, and ensembles.
33+
34+
Basic usage
35+
-----------
36+
37+
Let Locator choose the size (recommended)
38+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39+
40+
Set ``pca_components`` to ``"auto"`` and Locator decides how many components to
41+
keep, based on the data:
42+
43+
.. code-block:: python
44+
45+
from locator import Locator
46+
47+
config = {
48+
"out": "wgs_analysis",
49+
"batch_size": 32,
50+
"width": 256,
51+
"nlayers": 8,
52+
"max_epochs": 500,
53+
"patience": 100,
54+
"pca_components": "auto",
55+
}
56+
57+
locator = Locator(config)
58+
genotypes, samples = locator.load_genotypes(zarr="genotypes.zarr")
59+
locator.train(genotypes=genotypes, samples=samples)
60+
61+
Locator prints the number it chose at the start of training and stores it, so
62+
every fold of a run uses the same number.
63+
64+
Set the size yourself
65+
~~~~~~~~~~~~~~~~~~~~~~
66+
67+
Pass a number instead to keep exactly that many components:
68+
69+
.. code-block:: python
70+
71+
config["pca_components"] = 64
72+
73+
Leaving ``pca_components`` out (or set to ``None``) turns the PCA model off.
74+
That is the default.
75+
76+
How it works
77+
------------
78+
79+
When ``pca_components`` is set, the genotypes pass through an extra PCA step
80+
before the rest of the network::
81+
82+
genotype data -> PCA step (keeps a few components) -> rest of the network
83+
84+
A few details:
85+
86+
* **The PCA is run on the training samples only.** Samples held out for
87+
testing are never used to build it, so cross-validation stays fair.
88+
* **The PCA step starts as an exact PCA.** It is set up to reproduce the PCA
89+
result exactly, and training is then allowed to adjust it. The rest of the
90+
network starts from random values, as usual.
91+
* **Training happens in two stages.** In the first stage the PCA step is held
92+
fixed while the rest of the network learns. In the second stage the PCA step
93+
is allowed to change too, more slowly, so it can adjust to better predict
94+
location. Set ``pca_finetune`` to ``False`` to skip the second stage and
95+
keep the PCA step fixed throughout.
96+
97+
Choosing how many components to keep
98+
------------------------------------
99+
100+
With ``pca_components="auto"``, Locator looks at how much variation each
101+
component captures. The first few components capture a lot; after that, each
102+
one adds little. Locator keeps components up to the point where the curve
103+
levels off -- the natural cut-off in the data. This is usually a small number,
104+
and in practice it predicts just as well as a much larger hand-picked number.
105+
106+
To see this cut-off yourself before choosing a number:
107+
108+
.. code-block:: python
109+
110+
from locator.pca import scree_elbow
111+
112+
# training genotypes, shape (samples, SNPs)
113+
n_components = scree_elbow(train_genotypes)
114+
115+
The number of components cannot be larger than the number of training samples
116+
or the number of SNPs; a larger value raises a clear error.
117+
118+
Settings
119+
--------
120+
121+
.. list-table::
122+
:header-rows: 1
123+
:widths: 25 15 60
124+
125+
* - Setting
126+
- Default
127+
- What it does
128+
* - ``pca_components``
129+
- ``None``
130+
- Turns the PCA model on or off: ``None`` is off, a number keeps that
131+
many components, and ``"auto"`` lets Locator choose.
132+
* - ``pca_finetune``
133+
- ``True``
134+
- Whether the second training stage adjusts the PCA step. ``False`` keeps
135+
the PCA step fixed the whole time.
136+
* - ``pca_finetune_lr``
137+
- ``1e-4``
138+
- How fast the PCA step is allowed to change in the second stage.
139+
140+
When you cannot use it
141+
----------------------
142+
143+
The PCA model does not work with:
144+
145+
* **Bootstrap or jacknife runs.** These resample or reorder the SNPs on every
146+
replicate, and the PCA step needs a fixed set of SNPs. Passing ``site_order``
147+
together with ``pca_components`` raises an error.
148+
* **Windowed analysis.** Each window uses its own set of SNPs, so windowed
149+
runs reject ``pca_components``.
150+
151+
For these, leave ``pca_components`` off.

0 commit comments

Comments
 (0)