Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 5 additions & 8 deletions content/01.abstract.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
## Abstract {.page_break_before}

Correlation coefficients are widely used to identify patterns in data that may be of particular interest.
In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes.
Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models.
CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.
CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.
When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients.
Gene pairs highly ranked by CCC were enriched for interactions in integrated networks built from protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC could detect functional relationships that linear-only methods missed.
CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.
This paper presents the Clustermatch Correlation Coefficient (CCC), an efficient and not-only-linear correlation coefficient based on machine learning models, to identify linear and nonlinear patterns in transcriptomics data.
We aim to determine if CCC can detect meaningful linear and nonlinear relationships in gene expression data, including those missed by linear-only correlation coefficients, and if highly-ranked gene pairs by CCC are enriched for interactions in integrated networks.
When applied to human gene expression data, CCC identifies robust linear relationships and nonlinear patterns associated with sex differences.
Our results suggest that CCC can detect functional relationships not captured by linear-only methods.
CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can be applied to genome-scale data and other domains across different data types.
47 changes: 23 additions & 24 deletions content/02.introduction.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,36 @@
## Introduction

New technologies have vastly improved data collection, generating a deluge of information across different disciplines.
This large amount of data provides new opportunities to address unanswered scientific questions, provided we have efficient tools capable of identifying multiple types of underlying patterns.
Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971].
Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109].
Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976].
The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas.
Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research.
The increasing availability of data has opened up new possibilities for scientific exploration.
To take advantage of this, we need efficient tools to identify multiple types of relationships between variables.
Correlation analysis is a useful statistical technique to uncover such relationships [@pmid:21310971].
Correlation coefficients are often used in data mining techniques, such as clustering or community detection, to calculate the similarity between two objects, like genes [@pmid:27479844] or lifestyle factors related to diseases [@doi:10.1073/pnas.1217269109].
They are also used in supervised tasks, like feature selection, to boost prediction accuracy [@pmid:27006077; @pmid:33729976].
The Pearson correlation coefficient is widely used across many application domains and scientific disciplines.
Therefore, even small improvements in this technique can have a huge impact on industry and research.


In transcriptomics, many analyses start with estimating the correlation between genes.
More sophisticated approaches built on correlation analysis can suggest gene function [@pmid:21241896], aid in discovering common and cell lineage-specific regulatory networks [@pmid:25915600], and capture important interactions in a living organism that can uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540].
The analysis of large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573].
Since the introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098], gene-gene relationships are playing an increasingly important role in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], even in specific fields such as polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003].
In this context, recent approaches combine disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes directly affecting diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342].
These core genes are not captured by standard statistical methods but are believed to be part of highly-interconnected, disease-relevant regulatory networks.
Therefore, advanced correlation coefficients could immediately find wide applications across many areas of biology, including the prioritization of candidate drug targets in the precision medicine field.
In transcriptomics, correlation analysis is used to estimate the relationship between genes.
This approach has been used to suggest gene function [@pmid:21241896], discover common and cell lineage-specific regulatory networks [@pmid:25915600], and uncover molecular mechanisms in other species [@pmid:21606319; @pmid:16968540].
Large RNA-seq datasets [@pmid:32913098; @pmid:34844637] can also be used to reveal complex transcriptional mechanisms underlying human diseases [@pmid:27479844; @pmid:31121115; @pmid:30668570; @pmid:32424349; @pmid:34475573].
The introduction of the omnigenic model of complex traits [@pmid:28622505; @pmid:31051098] has highlighted the importance of gene-gene relationships in genetic studies of human diseases [@pmid:34845454; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342; @doi:10.1038/s41588-021-00913-z], including those related to polygenic risk scores [@doi:10.1016/j.ajhg.2021.07.003].
Combining disease-associated genes from genome-wide association studies (GWAS) with gene co-expression networks to prioritize "core" genes has become a popular approach to find genes that directly affect diseases [@doi:10.1186/s13040-020-00216-9; @doi:10.1101/2021.07.05.450786; @doi:10.1101/2021.10.21.21265342].
These core genes are not identified by standard statistical methods, but are believed to be part of highly-interconnected, disease-relevant regulatory networks.
Therefore, advanced correlation coefficients have potential applications across many areas of biology, including the identification of candidate drug targets in precision medicine.


The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly.
However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships.
Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505].
MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077].
However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001].
Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855].
We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899].
Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables.
The Pearson and Spearman correlation coefficients are widely used to measure the strength of linear or monotonic relationships between two variables, and can be computed quickly.
However, they may miss complex, yet critical nonlinear relationships.
To capture these, novel correlation coefficients have been proposed, such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505].
MIC, in particular, has been applied successfully across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077], but its computational complexity makes it impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001].
For example, it can take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855].
To address this issue, we previously developed a clustering method that significantly outperformed Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899].
Here, we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables.
CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time.
CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships.
CCC provides flexibility to detect specific types of patterns, while providing safe defaults to capture general relationships.
We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions.
To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776].
CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients.
For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples.
We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute.
Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259].
Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories.
Furthermore, its ability to efficiently handle different data types (including numerical and categorical features) reduces preprocessing steps and makes it attractive for analyzing large and heterogeneous repositories.
60 changes: 27 additions & 33 deletions content/04.05.results_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,36 +9,30 @@ Each panel shows the correlation value using Pearson ($p$), Spearman ($s$) and C
Vertical and horizontal red lines show how CCC clustered data points using $x$ and $y$.
](images/intro/relationships.svg "Different types of relationships in data"){#fig:datasets_rel width="100%"}

The CCC provides a similarity measure between any pair of variables, either with numerical or categorical values.
The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**.
In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters).
Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1.
Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo).


We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns.
In the first row of Figure @fig:datasets_rel, we examine the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation).
This kind of simulated data, recently revisited with the "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233], is used as a reminder of the importance of going beyond simple statistics, where either undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone.


Anscombe I contains a noisy but clear linear pattern, similar to Anscombe III where the linearity is perfect besides one outlier.
In these two examples, CCC separates data points using two clusters (one red line for each variable $x$ and $y$), yielding 1.0 and thus indicating a strong relationship.
Anscombe II seems to follow a partially quadratic relationship interpreted as linear by Pearson and Spearman.
In contrast, for this potentially undersampled quadratic pattern, CCC yields a lower yet non-zero value of 0.34, reflecting a more complex relationship than a linear pattern.
Anscombe IV shows a vertical line of data points where $x$ values are almost constant except for one outlier.
This outlier does not influence CCC as it does for Pearson or Spearman.
Thus $c=0.00$ (the minimum value) correctly indicates no association for this variable pair because, besides the outlier, for a single value of $x$ there are ten different values for $y$.
This pair of variables does not fit the CCC assumption: the two clusters formed with $x$ (approximately separated by $x=13$) do not match the three clusters formed with $y$.
The Pearson's correlation coefficient is the same across all these Anscombe's examples ($p=0.82$), whereas Spearman is 0.50 or greater.
These simulated datasets show that both Pearson and Spearman are powerful in detecting linear patterns.
However, any deviation in this assumption (like nonlinear relationships or outliers) affects their robustness.


We simulated additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273].
For the random/independent pair of variables, all coefficients correctly agree with a value close to zero.
The non-coexistence pattern, captured by all coefficients, represents a case where one gene ($x$) might be expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene).
For the other two examples (quadratic and two-lines), Pearson and Spearman do not capture the nonlinear pattern between variables $x$ and $y$.
These patterns also show how CCC uses different degrees of complexity to capture the relationships.
For the quadratic pattern, for example, CCC separates $x$ into more clusters (four in this case) to reach the maximum ARI.
The two-lines example shows two embedded linear relationships with different slopes, which neither Pearson nor Spearman detect ($p=-0.12$ and $s=0.05$, respectively).
Here, CCC increases the complexity of the model by using eight clusters for $x$ and six for $y$, resulting in $c=0.31$.
The CCC provides a similarity measure between any pair of variables, with numerical or categorical values.
It assumes that if two variables have a relationship, the clusters generated by each variable should match.
To separate numerical data into clusters, CCC uses quantiles (e.g., median).
The CCC is defined as the maximum adjusted Rand index (ARI) between the clusters, ranging from 0 to 1.
Details of the CCC algorithm can be found in the Methods section.


We examined the behavior of the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients on different simulated data patterns.
Figure @fig:datasets_rel shows the classic Anscombe's quartet [@doi:10.1080/00031305.1973.10478966], which comprises four synthetic datasets with different patterns but the same data statistics (mean, standard deviation and Pearson's correlation).
The "Datasaurus" [@url:http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html; @doi:10.1145/3025453.3025912; @doi:10.1111/dsji.12233] is a reminder of the importance of going beyond simple statistics, as undesirable patterns (such as outliers) or desirable ones (such as biologically meaningful nonlinear relationships) can be masked by summary statistics alone.


CCC, on the other hand, is more suitable for detecting nonlinear patterns and it is robust to outliers.

The Anscombe datasets illustrate the behaviour of the correlation coefficient based on machine learning (CCC).
In Anscombe I, with a noisy but clear linear pattern, CCC yields 1.0, indicating a strong relationship.
In Anscombe II, which follows a partially quadratic relationship, CCC yields 0.34, reflecting a more complex relationship than a linear pattern.
For Anscombe IV, with a vertical line of data points, CCC yields 0.00, correctly indicating no association.
The Pearson and Spearman correlation coefficients are the same across all these Anscombe's examples, but any deviation in the linear assumption affects their robustness.
CCC is more suitable for detecting nonlinear patterns and is robust to outliers.


Simulations of additional types of relationships (Figure @fig:datasets_rel, second row), including some previously described from gene expression data [@doi:10.1126/science.1205438; @doi:10.3389/fgene.2019.01410; @doi:10.1091/mbc.9.12.3273], showed that for random/independent variables, all coefficients correctly agreed with a value close to zero.
The non-coexistence pattern, captured by all coefficients, represented a case where one gene ($x$) is expressed while the other one ($y$) is inhibited, highlighting a potentially strong biological relationship (such as a microRNA negatively regulating another gene).
Pearson and Spearman did not capture the nonlinear patterns between variables $x$ and $y$ in the quadratic and two-lines examples, while CCC increased the complexity of the model by using different degrees of complexity to capture the relationships.
For the quadratic pattern, CCC used four clusters for $x$ and achieved the maximum ARI.
In the two-lines example, CCC used eight clusters for $x$ and six for $y$, resulting in $c=0.31$, while Pearson and Spearman gave $p=-0.12$ and $s=0.05$, respectively.
Loading