SelectIntegrationFeatures - ordering of features deemed variable in more than tie.val many datasets

Hi! Thanks for this nice package and your activity here!

I have a question regarding `SelectIntegrationFeatures`:

The [reference](https://satijalab.org/seurat/reference/selectintegrationfeatures) reads
> Choose the features to use when integrating multiple datasets. This function ranks features by the number of datasets they are deemed variable in, breaking ties by the median variable feature rank across datasets. It returns the top scoring features by this ranking.

matching the description in pages e3 and e4 in [your paper](https://www.cell.com/cell/pdf/S0092-8674(19)30559-8.pdf).

When I wanted to check the code for `SelectIntegrationFeatures`, I got the impression that the following is done:
1. Compute variable features per dataset [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L2977C5-L2977C5)
2. Sort genes by number-of-datasets-variable [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L2981)
3. Choose the threshold number (`tie.val`) of number-of-datasets-variable [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L2985) 
4. Select all "safe" genes (`features`) which have number-of-datasets-variable > `tie.val` [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L2986)
5. Order all of these "save" genes by median rank [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L2998)
6. Compute median rank for genes that have number-of-datasets-variable == `tie.val` [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L3001) 
7.  Use the top median rank features from 6. to fill up the "save" genes up to `nfeatures` [here](https://github.com/satijalab/seurat/blob/41d19a8a55350bff444340d6ae7d7e03417d4173/R/integration.R#L3010)

This does indeed, as the documentation says, return the top scoring features by this ranking.
However, if I laid this out correctly, the ordering of the "save" genes is not by number-of-datasets-variable first, and median ranks to break ties; but only by median ranks.
I am not sure if users would care about this ordering as long as the `nfeatures` many top genes are selected - to me it would come unexpectedly.

Since I am not particularly competent in R, I would like to ask:
Is this observation correct?

many thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SelectIntegrationFeatures - ordering of features deemed variable in more than tie.val many datasets #8289

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SelectIntegrationFeatures - ordering of features deemed variable in more than tie.val many datasets #8289

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions