-
Notifications
You must be signed in to change notification settings - Fork 4
Description
The section
Identification of marker genes
from the paper here
Looks like they do something a bit involved. They z-score the TPM matrix (per gene) to get z_tpm_ng
. Then they compute the (unnormalized consensus) usage matrix alpha_nk
and fit the model
z_tpm_ng ~ beta_kg * alpha_nk
using OLS regression and interpret beta_kg
as the association between gene g
and program k
. By using z-scored TPMs z_tpm_ng
, they say that beta_kg
can then be interpreted as by how many standard deviations the expression of gene [g] should increase for an additional count of usage being attributed to GEP k. We regress against z-scored expression values rather than the un-normalized expression values so that the coefficients will be comparable between genes expressed on different scales
Otherwise I've noticed that highly expressed genes (MALAT1, ribosomal genes, etc.) get priority in pretty much all the programs...