Review on using correlation as standard metric #253
Replies: 1 comment 1 reply
-
|
@XingerTang good that you are questioning the metrics! It depends on the downstream use cases, so it depends what metric is “best”. I am not behind computer so can’t check, but are you keeping 9 in the vectors when calculating correlations? They should be omitted because they imply a missing value, not an actual value. One typical downstream use case of imported genotypes is genomic prediction where we are aiming to maximise cor(est, true) where est is a vector of estimated genetic/breeding values and true a vector of true genetic/breeding values. Here we are aiming to maximise ranking of individuals so that follow-up decisions will be as correct as possible. Assuming that est = Ma and true = Qb with M a matrix of marker genotypes and their effects an and Q a matrix of QTL genotypes and their effects. We don’t know Q or b in real life, but the idea is that there is some correlation between M and Q, due to shared haplotype structure, giving cor(Ma, Qb)>0. We often have incomplete M, which is where M_imp comes in so we actually are looking at cor(M_imp a, Qb). Given this setting, which metric would you recommend? There is a couple of publications that have looked into this: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently, we are using correlation as the standard metric to evaluate the accuracy of the output in all accuracy tests and some examples. But sometimes it may not provide the information that best describes the accuracy of the result, especially for discrete ones.
By far, the covariance is calculated via
1. Not necessary to get a higher correlation value with a more accurate prediction
real = [0, 1, 1, 0],output = [0, 0, 1, 1], we have two loci correct (loci 1, 3), two loci wrong (loci 2, 4), the corresponding corr value is0.real = [0, 1, 1, 0],output = [0, 9, 1, 1], we have two loci correct (loci 1, 3), one locus wrong (locus 4), and one locus uncalled (locus 2), the corresponding corr value is0.61958555.real = [0, 1, 1, 0],output = [0, 1, 1, 1], we have three loci correct (loci 1, 2, 3), one locus wrong (locus 4), the corresponding corr value is0.57735027.real = [0, 1, 1, 0],output = [9, 1, 1, 1], we have two loci correct (loci 2, 3), one locus wrong (locus 4), and one locus uncalled (locus 1), the corresponding corr value is-0.57735027.real = [0, 1, 1, 0],output = [0, 1, 1, 9], we have three loci correct (loci 1, 2, 3), one locus uncalled (locus 4), the corresponding corr value is-0.48189987.I'm going to rank the 5 cases with a rule that
+1,0,-1.From highest to lowest, we have:
But if we rank them by the corr value, we have:
We can see that:
This situation does not change with longer array.
real = [0, 1, 1, 0, 1, 1],output = [0, 1, 1, 9, 1, 1], we have five loci correct (loci 1, 2, 3, 5, 6), one locus uncalled (locus 4), the corresponding corr value is-0.53608771.real = [0, 1, 1, 0, 1, 1],output = [0, 1, 1, 1, 1, 1], we have five loci correct (loci 1, 2, 3, 5, 6), one locus wrong (locus 4), the corresponding corr value is0.63245553.real = [0, 1, 1, 0, 1, 1],output = [0, 1, 1, 0, 1, 0], we have five loci correct (loci 1, 2, 3, 4, 5), one locus wrong (locus 6), the corresponding corr value is0.70710678.If we rank by the score we define by above, we have
If we rank by the corr value,
2. For very large datasets, the difference between correlation is not representative of the level of difference in the actual output
For one example dataset, we have two outputs that are evaluated in both absolute differences between output and true values, and the correlation value.
Output 1:
Output 2:
The difference in the average correlation is much smaller than the difference in the abs diff.
Conclusion
We may need to change our accuracy evaluation metric.
Beta Was this translation helpful? Give feedback.
All reactions