Skip to content

Add pyani subsample command #135

Open
@widdowquinn

Description

@widdowquinn

We could have a pyani subsample subcommand that can populate a new input directory of genomes. This would be useful when the total number of available genomes for analysis is large.

The kind of structure that the command could take would be along the lines of:

  • pyani subsample - basic subcommand
  • -n --num_genomes - total number of genomes
  • --balance_classes - if not set, the genomes are selected randomly from those in the input directory; if set, then an attempt is made to balance each class.

The way balancing might work is as follows: say there are 200 genomes, and you want to subsample 50. If there are two classes with 100 members each, we'd want to have 25 from each - a random sampling from each would be find. But if there are two classes with 190 and 10 members, we could only balance up to 20 genomes (10 from the group with 10, 10 from the group with 190) - so we'd either have to warn that the outcome was unbalanced, or we'd only be able to balance 10 randomly-selected from each class. So we might want another argument:

  • --enforce_balance - which enforces equal numbers from each class. So if there are $k$ classes where the smallest class has $m$ members, the total number of genomes subsampled is $k \times m$.

This would provide three ways of getting a subsample of size $n$ from the original set:

  1. randomly subsample $n$ genomes (and hope you cover all your classes)
  2. make a best effort to balance classes, recognising that there may be some poorly-represented classes; this could be implemented as sampling without replacement from each class in turn until we reach $n$. To avoid systematic bias (and restricting $n > k$ where there are $k$ classes) we should shuffle the order of class-sampling at each round.
  3. enforce balancing: select the nearest multiple of $k$ ($pk$) to $n$ which is less than or equal to $k \times m$ (where $m$ is the smallest class size), and randomly subsample $p$ genomes within each class.

Metadata

Metadata

Assignees

Labels

enhancementsomething we'd like pyani to do that it doesn't alreadyinterfaceissues related to how the user tells pyani to do somethingmethodthe issue relates to how results are calculated

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions