|
1 | 1 | ## Introduction |
2 | 2 |
|
3 | | -New technologies have vastly improved data collection, generating a deluge of information across different disciplines. |
4 | | -This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns. |
5 | | -Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971]. |
6 | | -Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109]. |
7 | | -Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976]. |
8 | | -The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas. |
9 | | -Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research. |
| 3 | +The deluge of data generated by new technologies has opened up new opportunities for addressing unanswered scientific questions. |
| 4 | +To take advantage of this, efficient tools are required to identify multiple types of underlying patterns. |
| 5 | +Correlation analysis is a key statistical technique for understanding relationships between variables [@pmid:21310971]. |
| 6 | +Correlation coefficients are often used to measure similarity between pairs of objects, such as genes [@pmid:27479844] or lifestyle factors [@doi:10.1073/pnas.1217269109], and are employed in exploratory data mining techniques like clustering and community detection. |
| 7 | +Furthermore, they are used in supervised tasks like feature selection, which can improve prediction accuracy [@pmid:27006077; @pmid:33729976]. |
| 8 | +The Pearson correlation coefficient is widely used in many different application domains and scientific areas. |
| 9 | +Therefore, even small improvements to this technique could have a big impact on industry and research. |
10 | 10 |
|
11 | 11 |
|
12 | | -In transcriptomics, many analyses start with estimating the correlation between genes. |
13 | | -More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540]. |
14 | | -The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573]. |
15 | | -Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003]. |
16 | | -In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342]. |
17 | | -These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks. |
18 | | -Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field. |
| 12 | +In transcriptomics, many analyses begin by estimating the correlation between genes. |
| 13 | +This correlation can be used to suggest gene function [@pmid:21241896], discover common and cell lineage-specific regulatory networks [@pmid:25915600], and uncover important interactions in a living organism [@pmid:21606319; @pmid:16968540]. |
| 14 | +Large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573]. |
| 15 | +Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships have become increasingly important in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], including in the field of polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003]. |
| 16 | +Recent approaches have combined disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes that directly affect diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342]. |
| 17 | +These core genes are not identified by standard statistical methods but are believed to form highly-interconnected, disease-relevant regulatory networks. |
| 18 | +Therefore, advanced correlation coefficients could be applied across many areas of biology, including for the prioritization of candidate drug targets in precision medicine. |
19 | 19 |
|
20 | 20 |
|
21 | | -The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly. |
22 | | -However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships. |
23 | | -Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. |
24 | | -MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. |
25 | | -However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001]. |
26 | | -Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855]. |
27 | | -We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899]. |
28 | | -Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables. |
| 21 | +The Pearson and Spearman correlation coefficients are widely used because they can quickly reveal linear or monotonic relationships. |
| 22 | +However, they may miss more complex yet critical patterns. |
| 23 | +To capture nonlinear relationships, researchers have proposed metrics such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. |
| 24 | +MIC has been applied successfully across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077], but its computational complexity makes it impractical for moderately sized datasets [@pmid:33972855; @pmid:27333001]. |
| 25 | +We previously developed a clustering method that was able to detect clusters of simulated linear and nonlinear relationships with varying noise levels, and outperformed Pearson, Spearman, DC and MIC [@doi:10.1093/bioinformatics/bty899]. |
| 26 | +Here we introduce the Clustermatch Correlation Coefficient (CCC), a not-only-linear coefficient that works for both quantitative and qualitative variables. |
29 | 27 | CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time. |
30 | | -CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships. |
31 | | -We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions. |
| 28 | +We provide an efficient CCC implementation that is highly parallelizable, allowing for faster computation across variable pairs with millions of objects or conditions. |
32 | 29 | To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776]. |
33 | 30 | CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients. |
34 | 31 | For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples. |
35 | | -We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute. |
| 32 | +We also found that CCC behaves similarly to MIC in several cases, although it is much faster to compute. |
36 | 33 | Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259]. |
37 | | -Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories. |
| 34 | +Additionally, its ability to efficiently handle numerical and categorical features reduces preprocessing steps and makes it suitable for analyzing large and heterogeneous repositories. |
0 commit comments