|
| 1 | +The PCA Model |
| 2 | +============= |
| 3 | + |
| 4 | +Locator's networks struggle when the genotype data has many more SNPs than |
| 5 | +samples -- for example, whole-genome data with hundreds of thousands or |
| 6 | +millions of SNPs but only a few hundred individuals. The first layer of the |
| 7 | +network grows with the number of SNPs, so with millions of SNPs it has |
| 8 | +hundreds of millions of values to learn. That is far too many for a few |
| 9 | +hundred samples: the network memorizes the training samples instead of |
| 10 | +learning a real pattern, and its predictions on new samples get *worse* as |
| 11 | +more SNPs are added. |
| 12 | + |
| 13 | +The PCA model fixes this. Before the network sees the genotypes, Locator runs |
| 14 | +a PCA and keeps only a handful of components. The network then learns from |
| 15 | +those few components instead of from millions of raw SNPs, so it stays small |
| 16 | +and accurate no matter how many SNPs you give it. |
| 17 | + |
| 18 | +.. contents:: Table of Contents |
| 19 | + :local: |
| 20 | + :depth: 2 |
| 21 | + |
| 22 | +When to use it |
| 23 | +-------------- |
| 24 | + |
| 25 | +Turn the PCA model on when you have many more SNPs than samples -- roughly, |
| 26 | +whole-genome data with 100,000 or more SNPs and a few hundred samples. With |
| 27 | +fewer SNPs the plain network works well and PCA is not needed. The PCA model's |
| 28 | +job is to keep accuracy steady as the SNP count grows into the millions, where |
| 29 | +a plain network gets worse. |
| 30 | + |
| 31 | +It works with normal training, holdouts, k-fold and leave-one-out |
| 32 | +cross-validation, and ensembles. |
| 33 | + |
| 34 | +Basic usage |
| 35 | +----------- |
| 36 | + |
| 37 | +Let Locator choose the size (recommended) |
| 38 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 39 | + |
| 40 | +Set ``pca_components`` to ``"auto"`` and Locator decides how many components to |
| 41 | +keep, based on the data: |
| 42 | + |
| 43 | +.. code-block:: python |
| 44 | +
|
| 45 | + from locator import Locator |
| 46 | +
|
| 47 | + config = { |
| 48 | + "out": "wgs_analysis", |
| 49 | + "batch_size": 32, |
| 50 | + "width": 256, |
| 51 | + "nlayers": 8, |
| 52 | + "max_epochs": 500, |
| 53 | + "patience": 100, |
| 54 | + "pca_components": "auto", |
| 55 | + } |
| 56 | +
|
| 57 | + locator = Locator(config) |
| 58 | + genotypes, samples = locator.load_genotypes(zarr="genotypes.zarr") |
| 59 | + locator.train(genotypes=genotypes, samples=samples) |
| 60 | +
|
| 61 | +Locator prints the number it chose at the start of training and stores it, so |
| 62 | +every fold of a run uses the same number. |
| 63 | + |
| 64 | +Set the size yourself |
| 65 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 66 | + |
| 67 | +Pass a number instead to keep exactly that many components: |
| 68 | + |
| 69 | +.. code-block:: python |
| 70 | +
|
| 71 | + config["pca_components"] = 64 |
| 72 | +
|
| 73 | +Leaving ``pca_components`` out (or set to ``None``) turns the PCA model off. |
| 74 | +That is the default. |
| 75 | + |
| 76 | +How it works |
| 77 | +------------ |
| 78 | + |
| 79 | +When ``pca_components`` is set, the genotypes pass through an extra PCA step |
| 80 | +before the rest of the network:: |
| 81 | + |
| 82 | + genotype data -> PCA step (keeps a few components) -> rest of the network |
| 83 | + |
| 84 | +A few details: |
| 85 | + |
| 86 | +* **The PCA is run on the training samples only.** Samples held out for |
| 87 | + testing are never used to build it, so cross-validation stays fair. |
| 88 | +* **The PCA step starts as an exact PCA.** It is set up to reproduce the PCA |
| 89 | + result exactly, and training is then allowed to adjust it. The rest of the |
| 90 | + network starts from random values, as usual. |
| 91 | +* **Training happens in two stages.** In the first stage the PCA step is held |
| 92 | + fixed while the rest of the network learns. In the second stage the PCA step |
| 93 | + is allowed to change too, more slowly, so it can adjust to better predict |
| 94 | + location. Set ``pca_finetune`` to ``False`` to skip the second stage and |
| 95 | + keep the PCA step fixed throughout. |
| 96 | + |
| 97 | +Choosing how many components to keep |
| 98 | +------------------------------------ |
| 99 | + |
| 100 | +With ``pca_components="auto"``, Locator looks at how much variation each |
| 101 | +component captures. The first few components capture a lot; after that, each |
| 102 | +one adds little. Locator keeps components up to the point where the curve |
| 103 | +levels off -- the natural cut-off in the data. This is usually a small number, |
| 104 | +and in practice it predicts just as well as a much larger hand-picked number. |
| 105 | + |
| 106 | +To see this cut-off yourself before choosing a number: |
| 107 | + |
| 108 | +.. code-block:: python |
| 109 | +
|
| 110 | + from locator.pca import scree_elbow |
| 111 | +
|
| 112 | + # training genotypes, shape (samples, SNPs) |
| 113 | + n_components = scree_elbow(train_genotypes) |
| 114 | +
|
| 115 | +The number of components cannot be larger than the number of training samples |
| 116 | +or the number of SNPs; a larger value raises a clear error. |
| 117 | + |
| 118 | +Settings |
| 119 | +-------- |
| 120 | + |
| 121 | +.. list-table:: |
| 122 | + :header-rows: 1 |
| 123 | + :widths: 25 15 60 |
| 124 | + |
| 125 | + * - Setting |
| 126 | + - Default |
| 127 | + - What it does |
| 128 | + * - ``pca_components`` |
| 129 | + - ``None`` |
| 130 | + - Turns the PCA model on or off: ``None`` is off, a number keeps that |
| 131 | + many components, and ``"auto"`` lets Locator choose. |
| 132 | + * - ``pca_finetune`` |
| 133 | + - ``True`` |
| 134 | + - Whether the second training stage adjusts the PCA step. ``False`` keeps |
| 135 | + the PCA step fixed the whole time. |
| 136 | + * - ``pca_finetune_lr`` |
| 137 | + - ``1e-4`` |
| 138 | + - How fast the PCA step is allowed to change in the second stage. |
| 139 | + |
| 140 | +When you cannot use it |
| 141 | +---------------------- |
| 142 | + |
| 143 | +The PCA model does not work with: |
| 144 | + |
| 145 | +* **Bootstrap or jacknife runs.** These resample or reorder the SNPs on every |
| 146 | + replicate, and the PCA step needs a fixed set of SNPs. Passing ``site_order`` |
| 147 | + together with ``pca_components`` raises an error. |
| 148 | +* **Windowed analysis.** Each window uses its own set of SNPs, so windowed |
| 149 | + runs reject ``pca_components``. |
| 150 | + |
| 151 | +For these, leave ``pca_components`` off. |
0 commit comments