-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathresults.tex
1048 lines (963 loc) · 47.6 KB
/
results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\part{Results}
\label{pa:results}
\chapter{Ranklust as a tool for researchers}
\section{The general workflow of a Ranklust analysis}
\begin{figure}[H]
\includegraphics[scale=0.8]{ranklust_workflow}
\caption{General workflow of using Ranklust to rank clusters}
\label{fig:ranklust-workflow}
\end{figure}
This workflow (\ref{fig:ranklust-workflow}) represents what planned actions the
user can take in order to rank and score network clusters. To demonstrate
Ranklust's ability to prioritize network biomarkers, this workflow was followed
through the figure:
\begin{enumerate}
\item Launched Cytoscape
\item Imported undirected PPI network from iRefWeb that had proteins converted to genes
\item Cluster with or without weights - without
\item Clustered the network with \gls{mcl}, inflation set to 1.8 and iterations to 200
\item Rank with or without weights - with
\item Imported weights from the \gls{golden} (DisGeNET and \gls{dragon} scores combined)
\item Ranked clusters with PageRankWithPriors, alpha set to 0.3 and max interations to 30
\item Visualized the cluster ranks
\item Exported the node table for further analysis outside of Cytoscape
\end{enumerate}
\section{Detailed walk-through of using Ranklust}
To go into more detail, here is the startup screen (\ref{fig:startup}) that the
user is met with after launching Cytoscape. This assumes that clusterMaker2 with
the Ranklust contribution is already installed in Cytoscape.
\begin{figure}[H]
\includegraphics[width=15cm]{1-startup}
\caption{Startup screen greeting the user at Cytoscape startup}
\label{fig:startup}
\end{figure}
This screen (\ref{fig:import}) is what opens up after opting to import a network
and choose a network file consisting of a header that represents source and
target node columns, a\_alias and b\_alias. The names for the columns does not
have to be exactly a\_alias and b\_alias, but there should be some header
information indicating a source and target gene/node. Having no header in the
data imported into cytoscape is not a problem, but it requires the user to
explicitly specify this in the "Advanced Options" menu and check off the "Use
first line as column names" option.
\begin{figure}[H]
\includegraphics[width=15cm]{2-import}
\caption{Importing a network into Cytoscape}
\label{fig:import}
\end{figure}
It is important to tell Cytoscape which column is considered as the source and
target column, as shown here. In this thesis, approved gene symbols from HGNC
has been the primary key identifier in both source and target column
(\ref{fig:nodes}).
\begin{figure}[H]
\includegraphics[width=15cm]{3-nodes}
\caption{Choosing source and target nodes of the interections in the network}
\label{fig:nodes}
\end{figure}
This is how Cytoscape looks after importing the network and it has finished its
standard style layout algorithm, which happens automatically after clicking "OK"
in the previous screen - having set everything to the correct settings
(\ref{fig:imported-network}).
\begin{figure}[H]
\includegraphics[width=15cm]{4-imported-network}
\caption{Cytoscape view after network import has finished}
\label{fig:imported-network}
\end{figure}
If the user wishes to weight nodes for clustering or the ranking of them, it is
possible to do this with the "Import Columns From Table" function. Adding prior
scores to the iRefWeb network that was created, the file representing the table
to import scores from had two columns, "geneName", which representet the gene
symbol and was the primary key for matching the score with the right row in the
Cytoscape tables, and "score", which contained scores between 0 and 1
(\ref{fig:import-table}).
\begin{figure}[H]
\includegraphics[width=15cm]{5-import-table}
\caption{Adding prior scores to the network}
\label{fig:import-table}
\end{figure}
To cluster the network, the user has to access the \textit{Apps} menu at the top
in the toolbar of Cytoscape. In clusterMaker2 there are three rows belonging
to the plugin. \textit{clusterMaker} row, where the clustering algorithms are
located. \textit{clusterMaker Ranking} row, where Ranklust's cluster ranking
algorithms \gls{maa},\gls{mam},\gls{pr},\gls{prwp} and \gls{hits} are. Finally,
the \textit{clusterMaker Visualizatons} row, where different visualizations in
clusterMaker2 can be queried. This row is also where option to visualize
Ranklust's ranked clusters resides (\ref{fig:cluster}).
\begin{figure}[H]
\includegraphics[width=15cm]{6-cluster}
\caption{Choosing the MCL cluster algorithm in clusterMaker2}
\label{fig:cluster}
\end{figure}
This is the view after clusterMaker2 has run the \textit{MCL cluster} method on
the current network. As seen on the left side menu, a new network has been
listed. This network only appears if the user sets the "Create new clustered
network"-option in the clustering parameter dialog that appears after choosing a
clustering algorithm from the clusterMaker2 menu (\ref{fig:done-cluster}).
\begin{figure}[H]
\includegraphics[width=15cm]{7-done-cluster}
\caption{Network view of the clustered network created by MCL - note the
added column in the Node Table}
\label{fig:done-cluster}
\end{figure}
Here the cluster ranking algorithm menu is shown (\ref{fig:choose-ranking}).
\begin{figure}[H]
\includegraphics[width=15cm]{8-choose-ranking}
\caption{Choosing the PageRankWithPriors (PRWP) ranking algorithm to rank
the clusters in the network}
\label{fig:choose-ranking}
\end{figure}
The "Create rank from the PageRankWithPriors algorithm" (\gls{prwp}) was chosen
as the ranking algorithm. Node and edge attributes can be combined by the user
in through simple selection. Selecting no attributes will cause \gls{prwp} to
not calculate scores at all. If the user does not want to use attributes to rank
the clusters, but still use PageRank, the "Create rank from the PageRank
algorithm" (\gls{pr}) algorithm should be used (\ref{fig:pagerank}).
\begin{figure}[H]
\includegraphics[width=15cm]{9-pagerank}
\caption{Setting the PRWP parameters before executing the algorithm}
\label{fig:pagerank}
\end{figure}
If the user is interested in visualizing the ranks, it is possible to click the
"Show results from ranking clusters" option in the visualization menu of
clustermaker2 (\ref{fig:show-results}). No menu will appear after that, but
rather will Cytoscape open a loading dialog similar to when clustering and
ranking algorithms has been tasked to start. After the loading dialog is
finished, the results panel for the ranked clusters will be shown.
\begin{figure}[H]
\includegraphics[width=15cm]{10-show-results}
\caption{Choosing to visualize the results after the PRWP algorithm has
finished}
\label{fig:show-results}
\end{figure}
Here is the results panel displaying the ranked clusters from top to bottom,
descending scores. The title for each results panel is formatted as in the
example.
\begin{Verbatim}[fontsize=\scriptsize]
[<clustering algorithm>]{<ranking algorithm>}(<network name>)
\end{Verbatim}
So with MCL clustering, \gls{prwp} ranking and the
"mitab\_lite\_4100\_final.tsv--clustered" network, this is what the title
becomes:
\begin{Verbatim}[fontsize=\scriptsize]
[mcl]{PRWP}(mitab_lite_4100_final.tsv--clustered)
\end{Verbatim}
As seen in the title (\ref{fig:result-colors}). The coloring has been discussed
earlier.
\begin{figure}[H]
\includegraphics[width=15cm]{11-result-colors}
\caption{The visualization of the ranked clusters are finished, both colored
nodes in the network and the results panel on the right side is
displayed to the user}
\label{fig:result-colors}
\end{figure}
Here is an example of how the clusters change color when the same cluster is
selected from the results panel menu (\ref{fig:rank-selection}).
\begin{figure}[H]
\includegraphics[width=15cm]{12-rank-selection}
\caption{Change the color of the nodes in a cluster when the cluster is
selected from the results panel}
\label{fig:rank-selection}
\end{figure}
\section{Design}
The class relations in Ranklust's ranking algorithms and ranking results panel
are described below in simple UML diagrams. The goal of this design was to have
an implementation of the ranking algorithms and the panel as close to the
existing clustering algorithms and results panel.
\begin{figure}[H]
\includegraphics[width=\textwidth]{ranklust-algorithm}
\caption{UML diagram for Ranklust's class relations for the ranking
algorithms}
\label{fig:rank-alg}
\end{figure}
The ranking algorithm relations (Figure: \ref{fig:rank-alg}) shows how the
classes are connected together and what they offer. The context is not
self-explanatory, but it has responsibility for the GUI component that
represents each specific ranking algorithm. Each algorithm in Ranklust has its
own instance of the following classes:
\begin{itemize}
\item AlgorithmTaskFactory
\item Context
\item Algorithm
\end{itemize}
The rest of the classes are are not unique to any of the ranking algorithms.
\begin{figure}[H]
\includegraphics[width=\textwidth]{ranklust-panel}
\caption{UML diagram for Ranklust's class relations for the ranking results
panel}
\label{fig:rank-panel}
\end{figure}
The ranking panel relations (Figure: \ref{fig:rank-panel}) shows how the classes
are connected to the ranking panel. The "Create" and "Destroy" task factories
exists to be able to destroy all of the panels ranking clusters, or to create
a new one. They are two separate classes to appear in the Cytoscape menu as two
separate options. Both of these two classes returns a RankingPanelTask object,
where the arguments sent to its class constructor defines whether if the task
should create or destroy a panel. An advantage to have two separate classes
responsible for creating or destroying a panel is the ability to make them
unavailable. For example, it is not possible for the user to select the "Destroy
All Cluster Results Panels" in the "clusterMaker Visualizations" menu, if no
ranking results panels exists.
\chapter{Prioritizing network biomarkers in prostate cancer through graph analysis}
\section{Network generation for the cross-validation and benchmarking of Ranklust}
After querying iRefWeb for the PPI-network with the query displayed earlier
(figure \ref{fig:irefweb}) the resulting network consisted of 109276 interactions.
After filtering the same network through the protein-to-gene mapping constructed
from HGNC, the final network consisted of 9500 nodes (genes) and 43706 edges
(interactions). At this point, all of the nodes and edges were undirected and
unweighted. Converting from proteins to genes ended up with a 60\% perturbation
of the network in the form of removed edges. Clustering the network resulted in
a further perturbation of 69.8\% of edge-removal when compared to the
HGNC-filtered network, 87.9\% when compared to the unfiltered iRefWeb network.
The creation of the \gls{golden} through combining DisGeNET and \gls{dragon}
resulted in a file with two columns. One representing a gene, the second one
representing the score of a gene, so a single row would contain a unique gene in
the list and the score it received.
\section{Which parameters for clustering was used and why}
\begin{table}[H]
\centering
\begin{tabular}{| l | p{2cm} | p{2cm} | p{2cm} | p{2cm} | p{2cm} | p{2.1cm} |}
\hline
\textbf{Inflation} & \textbf{Clusters} & \textbf{Avg. cluster size} &
\textbf{Max. cluster size} & \textbf{Min. cluster size} &
\textbf{Modularity} \\
\hline
1.6 & 1068 & 8.88 & 968 & 2 & 0.367 \\
1.8 & 1400 & 6.60 & 660 & 2 & 0.307 \\
2.0 & 1599 & 5.68 & 405 & 2 & 0.269 \\
2.5 & 2053 & 4.20 & 179 & 2 & 0.223 \\
3.0 & 2210 & 3.75 & 122 & 2 & 0.199 \\
\hline
\end{tabular}
\caption{MCL clustering parameter and statistic results}
\label{tab:mcl-inflation}
\end{table}
The modularity of the clustered networks gives an indicator of how well the
process of creating the clusters went. Modularity is given as a score from 0 to
1. A score closer to 1 is more preferable, as this indicates that the clusters
created have a good degree of separation to the other clusters in the network.
The preferred score to end up with would be around 0.8, but in this network
there has been a good amount of perturbation through the protein-to-gene
process. Modularity is not the only indicator of how well a network was
clustered, hence the choice of not setting the inflation value in \gls{mcl} to
1.6, but rather 1.8. When a lower inflation value is set, \gls{mcl} does not
separate edges between nodes as vigorously and as a direct cause, inflation will
go up. Taking the other attributes in the table (table: \ref{tab:mcl-inflation}) into
consideration, 1.8 seemed like the best inflation value. An inflation value of
1.8 has also been proved to be good for large high-throughput constructed
protein-protein networks with a large amount of alterations\cite{mcl-inflation}.
The amount of iterations used for \gls{mcl} ended up being 200. It started out
at 1000, but the results converged somewhere between 170 and 200 iterations, so
it was decreased form 1000 to 200 to speed up the time used in the
\gls{pipeline}.
\section{Cross-validation reveals a trend towards highest rank clusters having
prostate cancer relevance}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{cv_dist_total_filtered_prwp}
\caption{Distribution of combined averages of genes, which had their scores
\label{fig:irefweb-prwp}
removed by cross-validation, ranked by PRWP}
\end{sidewaysfigure}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{cv_dist_total_filtered_maa}
\caption{Distribution of combined averages of genes, which had their scores
\label{fig:irefweb-maa}
removed by cross-validation, ranked by MAA}
\end{sidewaysfigure}
The plots in figure \ref{fig:irefweb-prwp} and figure \ref{fig:irefweb-maa} is
developed from the 10 random cross-validation runs ranked with \gls{prwp} and
\gls{maa}. The x-axis represents the cluster ranks which is represented as
a blue dot in the scatter plot. Each rank consists of a single cluster which has
an arbitrary number of genes above 1. The y-axis represents a result from two
steps. The first step was to go through each of the
10 cross-validated results from ranking the iRefWeb network with \gls{prwp} and
\gls{maa}. In each of the cross-validated results an average was calculated for
each cluster. The average was calculated from dividing the number of genes, that
had their prior score removed as a result of the cross-validation, by the total
number of genes in the cluster. Each cluster would at this point have an average
representing the average of cross-validated genes in a cluster. The second step
completes the values for the y-axis in the plot and represents the score in
a cluster when additively combining the averages from each of the 10
cross-validated results that was found by \gls{prwp} and \gls{maa}. To filter
out the uninteresting results, all zero values on the y-axis was removed.
The analysis of which clusters contained the largest combined average of genes,
which had their prior score removed by cross-validation, shows that the topmost
ranked clusters had the highest combined average. This information indicates
that ranking clusters with \gls{prwp} and \gls{maa} have a tendency towards
ranking the larger part of the population of prostate cancer biomarkers at the
top of the cluster ranks, and the lower at the bottom.
Executing a cross-validation on the iRefWeb with \gls{prwp} and \gls{maa}
rankings had two purposes. The first being to prove the fact that every gene
that had its prior score removed by the cross-validation, should be found in the
results of the cluster ranking and identified as candidate biomarkers. The
second, to prove that the distribution of the combined average in the clusters
should correlate to the rank they obtained through Ranklust's use of \gls{prwp}
and \gls{maa}.
\chapter{Benchmarking Ranklust against text mined, manually knowledge curated and experimental test data}
\section{Retrieving the test data from the DISEASE database}
Benchmarking Ranklust is done against three resources of data from a single
database called DISEASE\cite{jensen}. This database can be queried for diseases
or genes. The query in this database was limited to searching for a single
disease or gene name. No API was found to access the database directly, so to
retrieve the gene names related to prostate cancer, whole files was downloaded.
There were three files, one for each type of research put into retrieving the
data: text mined, manually curated knowledge and experimental data. Each of
these files was populated with genes and their relation to different diseases, so
they had to be filtered to only contain genes with information about prostate
cancer. This was done by only using genes that contained "prostate cancer" in
the column indicating which disease the specific gene was related to.
The figures \ref{fig:txt-iref-prwp}, \ref{fig:txt-iref-maa},
\ref{fig:know-iref-prwp}, \ref{fig:know-iref-maa}, \ref{fig:exp-iref-prwp} and
\ref{fig:exp-iref-maa} illustrate plots of the benchmarking of Ranklust against
the DISEASE database. The plots representing values in clusters from each of
these three files have split each cluster in two parts, blue and orange. The
blue dots in the scatter plot represents the genes in a cluster that has prior
scores. The orange dots in the scatter plot represent genes in the same cluster
as the blue ones in terms of which rank they are in based on the x-axis, but
they do not have prior scores. The blue dots are also mentioned as prostate
cancer genes and orange dots as prostate candidate cancer genes, because they
have no score, but they are in some cases related to other prostate cancer genes
to such a degree that they are susceptible to be candidate cancer biomarkers for
prostate cancer. As with the cross-validation plots, the zero
values in the plots have been removed.
\section{Z-scores for text mined genes in clusters}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{prwp_txt_split}
\caption{Average z-score in a cluster ranked by PRWP}
\label{fig:txt-iref-prwp}
\end{sidewaysfigure}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{maa_txt_split}
\caption{Average z-score in a cluster ranked by MAA}
\label{fig:txt-iref-maa}
\end{sidewaysfigure}
The text mined scores are represented by a z-score. The z-score to a gene in the
text mined data from the DISEASE database is a developed from a co-occurrence
score, which increased when a gene and a disease was mentioned together, but
also decreased when they were mentioned with multiple other genes or diseases.
This co-occurrence score was later converted to z-scores to be more robust to
changes to the size of the text corpus in the DISEASE database\cite{jensen}.
This results in the average z-score to a cluster, which is based on the average
z-score of each gene in a cluster, to be a benchmark as to how high the cluster
should be ranked in terms of being a relevant network biomarker for prostate
cancer.
Since the plots have split each cluster into two parts, the cluster part of
genes with priors and the ones without priors, the expected outcome, should the
ranking algorithms perform as expected, would be to have the blue dots
descending and the orange dots ascending, when looking at a linear regression
fit going from the topmost ranked cluster to the lowest.
\subsection{PRWP benchmarked with text mined genes}
For \gls{prwp} (figure: \ref{fig:txt-iref-prwp}), the prostate cancer candidate genes is
descending in z-values from the topmost ranked cluster to the lowest, which is
contributing to showing \gls{prwp}'s suitability for ranking clusters as
candidate biomarkers.
The prostate candidate cancer biomarkers are ascending in z-value from the
topmost to the lowest ranked cluster. High z-values could contribute to the fact
that ranklust has found actual prostate candidate cancer biomarkers. However,
a low z-values does not contradict it. The only fact to deduce from low z-values
is that they have not been examinated to the degree that they are not mentioned
as a single gene related to prostate cancer in scientific papers.
\subsection{MAA benchmarked with text mined genes}
For \gls{maa} (figure: \ref{fig:txt-iref-maa}), the prostate cancer genes have the same
distinct descension in z-values from the topmost to the lowest cluster ranks.
The difference from \gls{prwp} to \gls{maa} being that \gls{maa} has a more
distinct ascending linear regression fit for the z-values in the prostate
candidate cancer genes.
\section{Manually knowledge curated genes in a cluster}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{prwp_know_split}
\caption{Average distribution of curated knowledge mined genes in clusters
ranked by PRWP.}
\label{fig:know-iref-prwp}
\end{sidewaysfigure}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{maa_know_split}
\caption{Average distribution of curated knowledge mined genes in clusters
ranked by MAA.}
\label{fig:know-iref-maa}
\end{sidewaysfigure}
This data had no score except for a confidence score in the \gls{jensen}
database. Every gene in the manually knowledge curated file had a confidence
score of 5 stars, due to being manually curated by researchers\cite{jensen}.
Therefore, the average number of genes in a cluster would receive a knowledge
score based on if the gene occurs in the knowledge curated part of the
\gls{jensen} database or not. So the knowledge curated data is only based on
occurrence, and not a specific value, in contrast to the text mined and
experimentally mined genes in the database. The clusters are split two-ways, in
a similar way that the benchmark with the text mined genes was, blue for genes
in the cluster which have priors and orange for the genes that do not have prior
scores.
Another trait the knowledge curated data possesses is the amount entries in the
\gls{jensen} database that has. Text mined data can be seen as the
high-throughput technology of retrieving relevant data from papers, while the
knowledge curated data is manually curated knowledge by researchers. This is why
the amount of entries for knowledge curated data is so sparse when compared to
text mined data.
For the manually knowledge curated genes to indicate valuable rankings of the
clusters, the genes with prior scores should have a descending trend from the
topmost ranked cluster to the lowest. If the genes without prior scores, have
a clear ascending trend from the topmost to the lowest ranked cluster, it would
have been a direct contradiction to the fact that the ranking algorithms should
be able to rank genes without prior scores in a reasonable way if they are
related to genes with priors. A higher frequency of manually knowledge curated
genes would increase the validity of these trends, should they occur,
especially if they are subtle.
\subsection{PRWP benchmarked by manually knowledge curated genes}
For \gls{prwp} (figure: \ref{fig:know-iref-prwp}), the genes with prior scores in
a cluster have a clear descending trend from the topmost to the lowest ranked
cluster. This builds up under the validitiy of \gls{prwp} being able to rank
prior scored genes correctly. The genes without prior scores in a cluster does
ascend to a low degree and the \gls{rsquared} for the fit that has this
ascending trend is not deemed as a good fit for the scores, at a \gls{rsquared}
value of only 0.002.
\subsection{MAA benchmarked by manually knowledge curated genes}
For \gls{maa} (figure: \ref{fig:know-iref-maa}), the genes with prior scores in a
cluster have the same trend as in \gls{prwp}. For the genes without prior scores
in a cluster, the trend also is the same as in \gls{prwp}, but to a higher
degree. The fit for the ascending average in manually knowledge curated genes in
a cluster, for the genes in the cluster without prior scores is an
\gls{rsquared} value of 0.33, which is considerably higher than the ascending
value for \gls{prwp}.
\section{Experimental genes distribution of p-values in genes}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{prwp_exp_split}
\caption{Average distribution of p-values in clusters ranked by PRWP.}
\label{fig:exp-iref-prwp}
\end{sidewaysfigure}
\begin{sidewaysfigure}
\includegraphics[scale=0.63]{maa_exp_split}
\caption{Average distribution of p-values in clusters ranked by MAA.}
\label{fig:exp-iref-maa}
\end{sidewaysfigure}
All of the experimental genes are from experiments and are a result from
genome-wide association studies (GWAS). Each gene in this test data are scored
after p-values from "the most statistically significant SNP within the
block."\cite{distild} in experiments related to prostate cancer. The
\gls{jensen} database has experimental data from both the Catalogue of Somatic
Mutations in Cancer (COSMIC) and DistiLD, but for prostate cancer, only
experimental data from DistiLD was available.
The score for each cluster is calculated from the average p-value from each gene
in the cluster. In contrast to the previous plots, the trend required to
validate \gls{prwp} and \gls{maa} as cluster ranking algorithms for prioritizing
network biomarkers in prostate cancer, is an ascending trend in both genes with
and without prior scores, when ranking clusters from the topmost to the lowest
cluster. An ascending score for the genes with prior scores proves that the
highly ranked clusters have low p-values, which is proof of a low chance of the
null hypothesis being true. Here, the null hypothesis would be that a gene has
no relevance to prostate cancer. The \gls{golden} is based on \gls{dragon} and
DisGeNET scores. The DisGeNET scores are developed as a score based on the
supporting evidence for the context of a gene and a disease being
true\cite{disgenet}. Based on this fact, it is feasible if \gls{prwp} and
\gls{maa} would demonstrate the previously defined trend.
The experimental data has one of the same feats as the manually knowledge
curated data, it is very sparse. As with the previous plots, the genes with
prior scores in a cluster are colored blue, and the ones without prior scores as
orange.
\subsection{PRWP benchmarked by experimentally mined genes}
For \gls{prwp} (figure: \ref{fig:exp-iref-prwp}), there is a trend for descending
p-values for the genes with prior scores in a cluster and an ascending p-value
trend for the genes without prior scores in a cluster. For the genes without
prior scores, this is a feasible result in order to validate Ranklust as a tool
for prioritizing network biomarkers. For the genes with prior scores, there is
a single cluster that seems to be the cause of the descending trend but the
\gls{rsquared} value for the fit of the linear regression is not high.
\subsection{MAA benchmarked by experimentally mined genes}
For \gls{maa} (figure: \ref{fig:exp-iref-maa}), the trends are the same as with
\gls{prwp}, but the trends are more noticeable. For the genes with prior scores,
the descending p-value trend towards the lower ranked clusters are steeper than
the one in the plot for \gls{prwp}, but the \gls{rsquared} score is very low and
the as it is in the \gls{prwp} plot, it seems to be a single cluster that is
responsible for the descending trend in p-values.
\chapter{Comparison of PRWP and MAA to prostate cancer relevant genes}
Using \gls{prwp} and \gls{maa}, 6 final lists of prioritized network biomarkers
for prostate cancer will be presented. These 6 lists will be the result of
3 \gls{prwp} and 3 \gls{maa} ranked clusters from the whole iRefWeb filtered
network with gene names and the scored with the full \gls{golden}. Each of the
3 ranking results from both algorithms will use the same three sets of data,
\gls{movember} from the Movember Prostate Cancer Project, COSMIC HGNC
genes\cite{cosmic-download} and a gene-signature of 157
genes\cite{psa-overtreatment}, labeled in this assignment as "lethal prostate
cancer genes".
\section{Using Ranklust to rank the clusters of a network}
The exact workflow of using Ranklust to create the ranks that the test data is
compared to is exactly as explained earlier in both the methods chapter and the
workflow part of this results chapter.
\subsection{Step 1 - create the network and fill it with prior scores}
The network was downloaded from iRefWeb. This network consisted of protein
interactions to create an unweighted and undirected PPI network. These proteins
was then filtered with their corresponding gene names taken from HGNC. Most of
the proteins in the network did not have a corresponding gene name that HGNC
could match, so over half of the interactions was removed.
The \gls{golden} was created as explained earlier, with prostate cancer data
from DisGeNET and \gls{dragon}. The network was uploaded into Cytoscape and
populated with the prior scores.
\subsection{Step 2 - Clustering the network}
The Markov Cluster algorithm, MCL, was used to cluster the iRefWeb network with
an inflation parameter of 1.8 and 200 iterations. This process took about 15
hours on the High-Performance Computing (HPC) instance Invitro at the University
of Oslo. It used a total of 64 cores, which averaged at a 90-95\% load for the
majority of the time the cluster algorithm ran.
\subsection{Step 3 - Ranking the clusters of the network}
\gls{prwp} and \gls{maa} was ran on the network, choosing only the priors in the
nodes given by the \gls{golden}. The alpha value for \gls{prwp} was set to 0.3
together with 30 max iterations.
\subsection{Step 4 - Exporting cluster ranks from Cytoscape and comparison with test data}
The node table with the results from \gls{prwp} and \gls{maa} was exported to
csv files, cleaned up with python scripts so it would form clusters containing
information about which genes was in which clusters, the rank of the cluster and
which genes in the cluster had a prior score or not.
The test data is in the form of a single column text file where each row
contains the name of a gene. By traversing the ranked list of clusters, four
categories were made from these test data genes.
The first two was named "test biomarkers" and "test candidates". These two have
in common that they both list genes from the cluster in its column if it is
contained in the test data set of genes. The "test biomarkers" are genes that
both have a prior score and are contained in the test data set of genes. The
"test candidates" are also in the test data set of genes, but they do not have
prior scores.
The next two categories are "remaining biomarkers" and "remaining candidates".
These two categories have in common that neither of them will list genes that
was in the test data set of genes. The difference is that "remaining biomarkers"
have prior scores, and "remaining candidates" have no prior scores.
Each cluster had their genes split into these four categories, represented as
four columns in the next upcoming tables (tables: \ref{tab:prwp-movember}
\ref{tab:maa-movember}, \ref{tab:prwp-cosmic}, \ref{tab:maa-cosmic},
\ref{tab:prwp-lethal}, \ref{tab:maa-lethal}). The last column represents the
rank of the cluster from either \gls{prwp} or \gls{maa}, depending on the table.
From each of these tests, it is displayed the top 10 clusters, which had
a combined amount of genes in either the "test biomarkers" or the "test
candidates" column above 0.
As a comment to each table, the topmost ranked cluster in each table will be
analyzed when it comes to each of the genes it contains. The genes will be
assigned their functional classification according to PANTHERDB and \gls{jensen}
database\cite{pantherdb,panther,disgenet}.
\section{Top 10 clusters from each ranking algorithm in Ranklust}
These two tables of top 10 ranked clusters
(\ref{tab:top10-prwp},\ref{tab:top10-maa}) are not tested or benchmarked. They
are the direct result of what Ranklust produced as ranked clusters. The network
is the iRefWeb network and the priors used to score the nodes are from the
\gls{golden}. The tests identifying which genes is where and in which is produced by
applying the Movember Prostate Cancer Project data, COSMIC and the lethal
prostate cancer gene signature data to filter this network.
\begin{table}[H]
\begin{tabular}{l l l l}
\textbf{Cluster rank} & \textbf{Cluster number} & \textbf{Cluster score} & \textbf{Genes} \\
\hline
1 & 1136 & 1.0 & LZTS1, CDC25C \\
2 & 690 & 0.918902810704 & SOX9, SCX, CREB3L4 \\
3 & 1004 & 0.758524309337 & MMP14, MMP13 \\
4 & 1364 & 0.69307698666 & SSTR2, SSTR3 \\
5 & 1227 & 0.689177108028 & MLNR, GHRHR \\
6 & 721 & 0.671675886965 & TAGLN, RNF14, TNFAIP3 \\
7 & 1110 & 0.666772639835 & ADAMTS5, TIMP3 \\
8 & 1359 & 0.636059086342 & GSTA1, GSTA2 \\
9 & 527 & 0.585248909289 & KLK14, KLK5, SPINK5, CAMP \\
10 & 1143 & 0.576075890388 & LYPLA2, ITGA4 \\
\hline
\end{tabular}
\caption{Top 10 clusters from ranking the iRefWeb network with the golden
standard priors and PRWP ranking algorithm - total amount of ranked
clusters: 1340}
\label{tab:top10-prwp}
\end{table}
\begin{table}[H]
\begin{tabular}{l l l l}
\textbf{Cluster rank} & \textbf{Cluster number} & \textbf{Cluster score} & \textbf{Genes} \\
\hline
1 & 721 & 1.0 & TAGLN, RNF14, TNFAIP3 \\
2 & 1110 & 1.0 & ADAMTS5, TIMP3 \\
3 & 1364 & 1.0 & SSTR2, SSTR3 \\
4 & 1305 & 0.768245431105 & CDK5R1, LMTK2 \\
5 & 1136 & 0.725502913802 & LZTS1, CDC25C \\
6 & 1359 & 0.725005246913 & GSTA1, GSTA2 \\
7 & 1004 & 0.720010369387 & MMP14, MMP13 \\
8 & 680 & 0.666666666667 & PRTN3, F2R, F2RL1 \\
9 & 690 & 0.666666666667 & SOX9, SCX, CREB3L4 \\
10 & 719 & 0.666666666667 & AGER, RHOA, GMIP \\
\hline
\end{tabular}
\caption{Top 10 clusters from ranking the iRefWeb network with the golden
standard priors and MAA ranking algorithm - total amount of ranked
clusters: 440}
\label{tab:top10-maa}
\end{table}
\section{Prostate cancer genes manually curated from the Movember Prostate Cancer Group}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
13 & TNFRSF11B & THBS1 & VEGFA & - \\
\hline
19 & - & F12 & MMP12 & - \\
\hline
25 & F2R & - & F2RL1 & PRTN3 \\
\hline
28 & - & RRM2 & MAGEA3,MAGEA1 & DNM1L,PGAM5,SCG3 \\
\hline
29 & CEACAM1 & - & - & CLEC4M \\
\hline
31 & ALOX15B & - & - & ERAL1 \\
\hline
46 & SMAD4 & - & - & ZMIZ1 \\
\hline
48 & STAT6 & - & ACSL3,IFI16 & TRIM56,TMEM173,SLC39A14 \\
\hline
53 & RNASEL & IQGAP1 & - & GSPT1,NPHS2 \\
\hline
55 & DPP4 & - & VIP,GHRH,ADCYAP1 & PYY,AVPR1A,GCG,GIP,TAC1,FAP,NPPB \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with PRWP and movember data - matched 254
test genes from movember data set out of 271 possible}
\label{tab:prwp-movember}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with PRWP and tested with Movember data}
(table: \ref{tab:prwp-movember})
\begin{itemize}
\item TNFRSF11B
\begin{itemize}
\item PantherDB subfamily - Tumor necrosis factor receptor
superfamily, member 11b
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 4.2)
\end{itemize}
\item THBS1
\begin{itemize}
\item PantherDB subfamily - Thrombospondin-1
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 5.0)
\end{itemize}
\item VEGFA
\begin{itemize}
\item PantherDB subfamily - Vascular endothelial growth factor A
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 7.5)
\end{itemize}
\end{itemize}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
8 & F2R & - & F2RL1 & PRTN3 \\
\hline
12 & TNFRSF11B & THBS1 & VEGFA & - \\
\hline
14 & STAT6 & - & ACSL3,IFI16 & TRIM56,TMEM173,SLC39A14 \\
\hline
20 & - & F12 & MMP12 & - \\
\hline
25 & SMAD4 & - & - & ZMIZ1 \\
\hline
26 & CD44 & - & - & SCYL3 \\
\hline
35 & CEACAM1 & - & - & CLEC4M \\
\hline
57 & BIRC5 & - & - & KCNJ6 \\
\hline
58 & ALOX15B & - & - & ERAL1 \\
\hline
69 & - & CRIP2 & UXT,RELA,ALPL & NR1H4,NME5 \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with MAA and movember data - matched 172
test genes from movember data set out of 271 possible}
\label{tab:maa-movember}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with MAA and tested with Movember data}
(table: \ref{tab:maa-movember})
\begin{itemize}
\item F2R
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 1
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 3.3)
\end{itemize}
\item F2RL1
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 2
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 2.4)
\end{itemize}
\item PRTN3
\begin{itemize}
\item PantherDB subfamily - Myeloblastin
\item \gls{jensen} relation - Relation to several diseases, cancer
is not among them
\end{itemize}
\end{itemize}
\section{Curated prostate cancer genes from the COSMIC database}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
6 & TNFAIP3 & - & RNF14,TAGLN & - \\
\hline
11 & BIRC3 & - & - & BIRC2 \\
\hline
12 & CXCR4 & - & - & CXCL14 \\
\hline
17 & - & DAXX & TGFBR3,ACVR2A & TCTEX1D4 \\
\hline
18 & RHOA & - & AGER & GMIP \\
\hline
24 & - & ELN & EFEMP2,SOD3,FBLN5 & - \\
\hline
37 & AR & KMT2A & MAK,TSPY1 & PKLR,HHAT \\
\hline
39 & TOP1 & - & - & RCVRN \\
\hline
44 & WIF1 & - & - & WNT11 \\
\hline
46 & SMAD4 & - & - & ZMIZ1 \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with PRWP and COSMIC data - matched 423 test
genes from the COSMIC data set out of 580 possible}
\label{tab:prwp-cosmic}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with PRWP and tested with COSMIC data}
(table: \ref{tab:prwp-cosmic})
\begin{itemize}
\item TNFAIP3
\begin{itemize}
\item PantherDB subfamily - Tumor necrosis factor alpha-induced
protein 3
\item \gls{jensen} relation - Relation to several diseases, among them
cancer (z-score of 3.3)
\end{itemize}
\item RNF14
\begin{itemize}
\item PantherDB subfamily - Ubiquitin-protein ligase
\item \gls{jensen} relation - Relation to several diseases, among them
specifically prostate cancer (z-score of 3.6)
\end{itemize}
\item TAGLN
\begin{itemize}
\item PantherDB subfamily - Transgelin
\item \gls{jensen} relation - Relation to several diseases, among them
cancer (z-score of 4.2)
\end{itemize}
\end{itemize}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
1 & TNFAIP3 & - & RNF14,TAGLN & - \\
\hline
10 & RHOA & - & AGER & GMIP \\
\hline
13 & AR & KMT2A & MAK,TSPY1 & PKLR,HHAT \\
\hline
14 & STAT6,ACSL3 & - & IFI16 & TRIM56,TMEM173,SLC39A14 \\
\hline
15 & - & DAXX & TGFBR3,ACVR2A & TCTEX1D4 \\
\hline
17 & BIRC3 & - & - & BIRC2 \\
\hline
19 & - & SUFU & PIAS1 & - \\
\hline
25 & SMAD4 & - & - & ZMIZ1 \\
\hline
29 & SET & - & - & TAF1C \\
\hline
37 & TOP1 & - & - & RCVRN \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with MAA and COSMIC data - matched 277 test
genes from the COSMIC data set out of 580 possible}
\label{tab:maa-cosmic}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with MAA and tested with COSMIC data}
(table: \ref{tab:maa-cosmic})
\begin{itemize}
\item TNFAIP3
\begin{itemize}
\item PantherDB subfamily - Tumor necrosis factor alpha-induced
protein 3
\item \gls{jensen} relation - Relation to several diseases, among them
cancer (z-score of 3.3)
\end{itemize}
\item RNF14
\begin{itemize}
\item PantherDB subfamily - Ubiquitin-protein ligase
\item \gls{jensen} relation - Relation to several diseases, among them
specificly prostate cancer (z-score of 3.6)
\end{itemize}
\item TAGLN
\begin{itemize}
\item PantherDB subfamily - Transgelin
\item \gls{jensen} relation - Relation to several diseases, among them
cancer (z-score of 4.2)
\end{itemize}
\end{itemize}
\section{Proven prostate cancer genes that resulted in lethal outcome for the patient}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
25 & F2R & - & F2RL1 & PRTN3 \\
\hline
28 & - & RRM2 & MAGEA3, MAGEA1 & DNM1L, PGAM5, SCG3 \\
\hline
31 & ALOX15B & - & - & ERAL1 \\
\hline
55 & DPP4 & - & VIP, GHRH, ADCYAP1 & PYY, AVPR1A, GCG, GIP, TAC1, FAP, NPPB \\
\hline
79 & BIRC5 & - & - & KCNJ6 \\
\hline
88 & SERPINA3 & - & KLK4 & CTRC, GZMM, SGCD \\
\hline
91 & - & CRIP2 & UXT, RELA, ALPL & NR1H4, NME5 \\
\hline
92 & CCNB1 & UBE2C & - & UBE3D \\
\hline
114 & JAG1 & - & - & NEURL1, CD46 \\
\hline
120 & - & CYB5A & CYP17A1, CYP3A4, CYP3A5, CYP2E1 & CYP4F2, CYP4A11 \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with PRWP and lethal prostate cancer data
- matched 99 test genes form the lethal prostate cancer data set out of 157
possible}
\label{tab:prwp-lethal}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with PRWP and tested with Lethal prostate cancer data}
(table: \ref{tab:prwp-lethal})
\begin{itemize}
\item F2R
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 1
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 3.3)
\end{itemize}
\item F2RL1
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 2
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 2.4)
\end{itemize}
\item PRTN3
\begin{itemize}
\item PantherDB subfamily - Myeloblastin
\item \gls{jensen} relation - Relation to several diseases, cancer
is not among them
\end{itemize}
\end{itemize}
\begin{sidewaystable}
\begin{tabular}{|l|l|l|l|l|}
\hline
\textbf{Rank}
& \textbf{Test biomarkers}
& \textbf{Test candidates}
& \textbf{Remaining biomarkers}
& \textbf{Remaining candidates} \\
\hline
8 & F2R & - & F2RL1 & PRTN3 \\
\hline
57 & BIRC5 & - & - & KCNJ6 \\
\hline
58 & ALOX15B & - & - & ERAL1 \\
\hline
69 & - & CRIP2 & UXT,RELA,ALPL & NR1H4,NME5 \\
\hline
79 & - & RRM2 & MAGEA3,MAGEA1 & DNM1L,PGAM5,SCG3 \\
\hline
92 & CCNB1 & UBE2C & - & UBE3D \\
\hline
105 & JAG1 & - & - & NEURL1,CD46 \\
\hline
106 & DPP4 & - & VIP,GHRH,ADCYAP1 & PYY,AVPR1A,GCG,GIP,TAC1,FAP,NPPB \\
\hline
109 & SERPINA3 & - & KLK4 & CTRC,GZMM,SGCD \\
\hline
112 & - & CYB5A & CYP17A1,CYP3A4,CYP3A5,CYP2E1 & CYP4F2,CYP4A11 \\
\hline
\end{tabular}
\caption{iRefWeb network ranked with MAA and lethal prostate cancer data
- matched 66 test genes form the lethal prostate cancer data set out of 157
possible}
\label{tab:maa-lethal}
\end{sidewaystable}
\textbf{Top ranked cluster ranked with MAA and tested with Lethal prostate cancer data}
(table: \ref{tab:maa-lethal})
\begin{itemize}
\item F2R
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 1
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 3.3)
\end{itemize}
\item F2RL1
\begin{itemize}
\item PantherDB subfamily - Proteinase-activated receptor 2
\item \gls{jensen} relation - Relation to several diseases, among
them cancer (z-score of 2.4)
\end{itemize}
\item PRTN3