Skip to content

Commit 736a29f

Browse files
committed
update netbooks
1 parent 39a84f2 commit 736a29f

18 files changed

+2621
-666
lines changed

netbooks/Welcome_to_netBooks.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
"#### Run netbooks on the server\n",
2929
"When you log in to netbooks, you will be given an anonymous login token which can be seen in the URL (http://netbooks.networkmedicine.org/user/token/). Each user gets a dedicated space on disk and enough access to CPU and RAM to be able to run the original tutorials. You can change the parameters in each tutorial to further explore the use cases, and you can even create a new notebook.\n",
3030
" \n",
31-
"However, since token are temporary user IDs, the working space and all the files will be deleted as soon as you logout from netbooks or after your session has been idle for 1 hour. Therefore, please don't import work on netbooks as the user files are not persistent on disk. netbooks is meant for learning and exploring application cases of network biology and to promote reproducibile analyses by providing containerized software tools. Please check how to run netbooks locally or consider [Google colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) for persistent work spaces.\n",
31+
"However, since token are temporary user IDs, the working space and all the files will be deleted as soon as you logout from netbooks or after your session has been idle for 2 hours. Therefore, please don't import work on netbooks as the user files are not persistent on disk. netbooks is meant for learning and exploring application cases of network biology and to promote reproducibile analyses by providing containerized software tools. Please check how to run netbooks locally or consider [Google colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) for persistent work spaces.\n",
3232
"\n",
3333
"#### Run netbooks locally\n",
3434
"To run the netbooks on your local machine, you can clone the netbooks [GitHub repository](https://github.com/netZoo/netbooks) and install the dependencies required in the beginning of each tutorial. We've also provided links to a public AWS S3 bucket to download all the data needed to run the analysis. These files can be downloaded using the file URLs in the netbook as follows: `curl -O urlToFile`\n",
@@ -38,14 +38,14 @@
3838
"\n",
3939
"- **Vignettes** are brief code samples that demonstrate how the methods can be used, their inputs, and how to interpret their output\n",
4040
"\n",
41-
"- **Case studies and published studies** are investigatations that provide biological or methodlogical insights. Published studies netbooks allow to reproduce the numerical results and figures of published papers.\n",
41+
"- **Case studies and published studies** are investigations that provide biological or methodlogical insights. Published studies netbooks allow to reproduce the numerical results and figures of published papers.\n",
4242
"\n",
4343
"### Requirements\n",
4444
"Netbooks works best on Google Chrome. Some network visualization features are only available through Google Chrome.\n",
4545
"\n",
4646
"### Issues and suggestions\n",
4747
"\n",
48-
"To open an issue, please use netbooks's [GitHub repository](https://github.com/netZoo/netbooks/issues).\n",
48+
"To open an issue, please use netbooks' [GitHub repository](https://github.com/netZoo/netbooks/issues).\n",
4949
"\n",
5050
"### Notebook catalog\n",
5151
"\n",
@@ -66,7 +66,7 @@
6666
" \n",
6767
" - [Using CONDOR for community detection in bipartite graphs](netZooR/CONDOR.ipynb)\n",
6868
" \n",
69-
" - [Using netZooR to analyze the regulatory processes of obesity in colon cancer](netZooR/netZooR_tutorial_coloncancer.ipynb)\n",
69+
" - [Using netZooR to find associations betwen colon cancer and obesity](netZooR/netZooR_tutorial_coloncancer.ipynb)\n",
7070
" \n",
7171
" - [YARN: Robust Multi-Tissue RNA-Seq Preprocessing and Normalization](netZooR/yarn.ipynb)\n",
7272
" \n",

netbooks/netZooPy/Building_a_regulation_prior_network.ipynb

Lines changed: 70 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
"cell_type": "markdown",
1515
"metadata": {},
1616
"source": [
17-
"## Introduction\n",
17+
"# 1. Introduction\n",
1818
"\n",
1919
"Several Network Zoo [netzoo](netzoo.github.io) tools require a regulation prior network ($W_0$) to use it as an initial estimate and a starting point for the inference of the final network $W$. Regulation prior networks are based on Transcription Factor Binding Sites (TFBS) detected in the promoter regions of target genes.\n",
2020
"\n",
@@ -28,7 +28,7 @@
2828
"\n",
2929
"- Finally, we will derive the regulation prior network as a discrete binary network and a continuous network by using several derivations. We are particularly interested in the continuous derivations because [as shown previously](Controlling_The_Variance_Of_PANDA_Networks.ipynb) binary priors induce a strong bias on the final network.\n",
3030
"\n",
31-
"Some parts of this notebook are intended for demonstration purposes only, because running the whole pipeline takes about a week on a 36 core machine. The variable `precomputed` can be set to 1 to avoid executing the code on the server of the time-consuming part."
31+
"Some parts of this notebook are intended for demonstration purposes only, because running the whole pipeline takes about a week on a 36 core machine. This pipeline can be ran on the server or locally by setting the `runserver` variable to 1 to avoid executing the code on the server of the time-consuming part."
3232
]
3333
},
3434
{
@@ -37,7 +37,40 @@
3737
"metadata": {},
3838
"outputs": [],
3939
"source": [
40-
"# precomputed=1"
40+
"runserver=1"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"We also need to set the paths to the files on the server, if this notebook is ran on the server."
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"if runserver==1:\n",
57+
" ppath='/opt/data/netZooPy/regPrior/'"
58+
]
59+
},
60+
{
61+
"cell_type": "markdown",
62+
"metadata": {},
63+
"source": [
64+
" The variable `precomputed` can further speed up the analysis by using precomputed data at diffrent points of the analysis and can be set to 1."
65+
]
66+
},
67+
{
68+
"cell_type": "code",
69+
"execution_count": null,
70+
"metadata": {},
71+
"outputs": [],
72+
"source": [
73+
"precomputed=0"
4174
]
4275
},
4376
{
@@ -61,7 +94,6 @@
6194
"metadata": {},
6295
"outputs": [],
6396
"source": [
64-
"precomputed=0\n",
6597
"iterlimit=50"
6698
]
6799
},
@@ -78,20 +110,20 @@
78110
"metadata": {},
79111
"outputs": [],
80112
"source": [
81-
"import matplotlib.pyplot as plt\n",
82-
"import numpy as np\n",
113+
"import matplotlib.pyplot as plt # For plotting\n",
114+
"import numpy as np \n",
83115
"import os\n",
84-
"from Bio import SeqIO # to run Biopython\n",
85-
"import pandas as pd\n",
86-
"import multiprocessing # to run FIMO in parallel\n",
87-
"from functools import partial"
116+
"from Bio import SeqIO # To run Biopython\n",
117+
"import pandas as pd # To read input data\n",
118+
"import multiprocessing # To run FIMO in parallel\n",
119+
"from functools import partial # For parallel computing"
88120
]
89121
},
90122
{
91123
"cell_type": "markdown",
92124
"metadata": {},
93125
"source": [
94-
"## Extracting the sequences of promoter regions\n",
126+
"# 2. Extracting the sequences of promoter regions\n",
95127
"\n",
96128
"First, you need the sequences of human genes from the latest builld (hg38). The sequences can be downloaded from the UCSC website https://genome.ucsc.edu/cgi-bin/hgGateway. When you download the gene sequences, you have the option to pick the start nucleotide in relation to the Transcription Start Site (TSS) and the end nucleotide relative to the Transcription End Site (TSE). Here, we chose gene sequences that start at TSS-1000 base pairs and end at TSE+1000 basepairs. In total, there are 38723 gene sequences.\n",
97129
"\n",
@@ -123,7 +155,7 @@
123155
"outputs": [],
124156
"source": [
125157
"# read conversion file\n",
126-
"geneCorr = pd.read_csv('/opt/data/netZooPy/regPrior/hg38_Tss_coordinates.csv',sep='\\t')"
158+
"geneCorr = pd.read_csv(ppath+'hg38_Tss_coordinates.csv',sep='\\t')"
127159
]
128160
},
129161
{
@@ -140,7 +172,7 @@
140172
"outputs": [],
141173
"source": [
142174
"k=0 # iteration counter\n",
143-
"input_file ='/opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tse1000.fasta'\n",
175+
"input_file =ppath+'hg38_sequence_Tss1000_Tse1000.fasta'\n",
144176
"output_file='../data/hg38_sequence_Tss1000_Tss1000.fasta'\n",
145177
"if precomputed==0:\n",
146178
" fasta_sequences = SeqIO.parse(open(input_file),'fasta')\n",
@@ -188,7 +220,7 @@
188220
"cell_type": "markdown",
189221
"metadata": {},
190222
"source": [
191-
"## Getting and cleaning position weight matrices (PWMs)\n",
223+
"# 3. Getting and cleaning position weight matrices (PWMs)\n",
192224
"The second step is to collect the PWMs for each TF which characterize their DNA binding motifs. The PWMs will allow us afterwards to scan the sequences obtained earlier for TFBS using the FIMO<sup>1</sup> software. \n",
193225
"\n",
194226
"We collected PWMs from [the companion website](http://humantfs.ccbr.utoronto.ca/download.php) to Lambert et al.,<sup>2</sup> which correspond to the database [CIS-BP](http://cisbp.ccbr.utoronto.ca/) 1.94d. In total, there are PWMs for 1149 TFs.\n",
@@ -223,15 +255,15 @@
223255
"outputs": [],
224256
"source": [
225257
"if precomputed==0:\n",
226-
" os.chdir('/opt/data/netZooPy/regPrior/PWMs')\n",
258+
" os.chdir(ppath+'PWMs')\n",
227259
" # convert CIS-BP matrices to meme format\n",
228260
" for file in os.listdir():\n",
229261
" df=pd.read_csv(file,sep='\\t')\n",
230262
" df=df.iloc[:,1:]\n",
231263
" os.chdir(pathos)\n",
232264
" os.chdir('../data/convPWMs')\n",
233265
" df.to_csv(file, header=False, index=False, sep='\\t')\n",
234-
" os.chdir('/opt/data/netZooPy/regPrior/PWMs')\n",
266+
" os.chdir(ppath+'PWMs')\n",
235267
"\n",
236268
" # call meme suite meme2mat, some files were not analyzed because some nucleotide positions summed to zero\n",
237269
" os.chdir(pathos)\n",
@@ -265,7 +297,7 @@
265297
"source": [
266298
"if precomputed==0:\n",
267299
" # read TF motif table to select the \"best\" motif per TF\n",
268-
" tf =pd.read_csv(\"/opt/data/netZooPy/regPrior/Human_TF_MotifList_v_1.01.csv\",dtype=str)\n",
300+
" tf =pd.read_csv(ppath+\"Human_TF_MotifList_v_1.01.csv\",dtype=str)\n",
269301
" indTF =np.in1d(tf.iloc[:,6],finalTfList)\n",
270302
" tff =tf\n",
271303
" tf =tf.iloc[indTF,:]\n",
@@ -306,7 +338,7 @@
306338
"source": [
307339
"if precomputed==0:\n",
308340
" # put pwms in the same file\n",
309-
" os.chdir('/opt/data/netZooPy/regPrior/convPWM')\n",
341+
" os.chdir(ppath+'convPWM')\n",
310342
" finalMeme = 'MEME version 4\\n\\nALPHABET= ACGT\\n\\nstrands: + -\\n\\nBackground letter frequencies (from uniform background):\\nA 0.25000 C 0.25000 G 0.25000 T 0.25000 \\n\\n'\n",
311343
" k=0\n",
312344
" finalTFName=[]\n",
@@ -337,7 +369,7 @@
337369
"cell_type": "markdown",
338370
"metadata": {},
339371
"source": [
340-
"## Building the regulation prior network\n",
372+
"# 4. Building the regulation prior network\n",
341373
"\n",
342374
"In this step, we will scan the promoter sequences for TF binding motifs using FIMO. Our final network will determine the interactions between 1,149 TFs and 38,723 genes. First, we need to extract gene names and the number of TFs."
343375
]
@@ -349,11 +381,11 @@
349381
"outputs": [],
350382
"source": [
351383
"tfNames=[]\n",
352-
"with open(\"/opt/data/netZooPy/regPrior/tfNames.txt\", \"r\") as f:\n",
384+
"with open(ppath+\"tfNames.txt\", \"r\") as f:\n",
353385
" for line in f:\n",
354386
" tfNames.append(str(line.strip()))\n",
355387
"\n",
356-
"input_file = '/opt/data/netZooPy/regPrior/hg38_sequence_Tss1000_Tss1000.fasta'\n",
388+
"input_file = ppath+'hg38_sequence_Tss1000_Tss1000.fasta'\n",
357389
"geneNames = []\n",
358390
"fasta_sequences = SeqIO.parse(open(input_file),'fasta')\n",
359391
"for fasta in fasta_sequences:\n",
@@ -379,7 +411,7 @@
379411
"\n",
380412
"The following continuous models are structurally simialr with vairations in the parameters.\n",
381413
"\n",
382-
"### 1. Garcia-Alonso model\n",
414+
"## 4.1. Garcia-Alonso model\n",
383415
"This model has been applied on ChIP-seq data<sup>5</sup> to compute the strength of regulation $s$ between a gene $g$ and a TF $t$ using the following equation\n",
384416
"\n",
385417
"$\n",
@@ -425,7 +457,7 @@
425457
"cell_type": "markdown",
426458
"metadata": {},
427459
"source": [
428-
"### 2. Ouyang model\n",
460+
"## 4.2. Ouyang model\n",
429461
"The previous model is a modification of the Ouyang model<sup>6</sup> which has two main differences:\n",
430462
"\n",
431463
"- A scaling parameter that was set in the exponential to 5000bp for all TFs except E2f1 that was set to 500bp. In the previous model, the parameter was replaced by the median of all binding sites, therfore we will keep this modification for our study.\n",
@@ -461,7 +493,7 @@
461493
"cell_type": "markdown",
462494
"metadata": {},
463495
"source": [
464-
"### 3. RP model\n",
496+
"## 4.3. RP model\n",
465497
"\n",
466498
"The Regulatory Potential (RP) model<sup>7</sup> assumes a decay function as the distance from the TSS increases.\n",
467499
"\n",
@@ -500,7 +532,7 @@
500532
"cell_type": "markdown",
501533
"metadata": {},
502534
"source": [
503-
"## Computing the motif networks\n",
535+
"# 5. Computing motif networks\n",
504536
"\n",
505537
"Now, let's define a parallel loop to call FIMO on our sequences and compute two binary network and three continuous networks. First, we need to switch our local working directory where we have read and write rights."
506538
]
@@ -583,7 +615,6 @@
583615
"metadata": {},
584616
"outputs": [],
585617
"source": [
586-
"computeNetworks=1 # can be set to zero to skip the computation and go to the next section\n",
587618
"if iterlimit > -1:\n",
588619
" nTFs=iterlimit # Number of TFs is set ot the iteration limit defined earlier\n",
589620
"numPool=2 # the number of parallel workers"
@@ -595,7 +626,7 @@
595626
"metadata": {},
596627
"outputs": [],
597628
"source": [
598-
"if computeNetworks==1:\n",
629+
"if precomputed==0:\n",
599630
" # p-value binary network\n",
600631
" pool = multiprocessing.Pool(numPool)\n",
601632
" res = pool.map(partial(parallelFIMO,tfNames=tfNames,geneNames=geneNames,pqval=0), range(nTFs))\n",
@@ -631,8 +662,8 @@
631662
"cell_type": "markdown",
632663
"metadata": {},
633664
"source": [
634-
"## Processing the final network\n",
635-
"### Binary networks\n",
665+
"# 6. Processing the final network\n",
666+
"## 6.1. Binary networks\n",
636667
"We computed two binary networks: the first one is and FDR-corrected p-value network thresholded at 0.05 and the second is a p-value network thresholded at $1e^{-5}$. In other words, if the significance of binding determined by FIMO scan is less than a certain significance threshold, we will assign a binding event and the edge weight will be set to 1, otherwise the edge will be set to 0."
637668
]
638669
},
@@ -642,7 +673,7 @@
642673
"metadata": {},
643674
"outputs": [],
644675
"source": [
645-
"regmatqval=pd.read_csv('/opt/data/netZooPy/regPrior/regMatQval005.csv',header=0,index_col=0)\n",
676+
"regmatqval=pd.read_csv(ppath+'regMatQval005.csv',header=0,index_col=0)\n",
646677
"tresh=0.05"
647678
]
648679
},
@@ -663,7 +694,7 @@
663694
"metadata": {},
664695
"outputs": [],
665696
"source": [
666-
"regmatpval=pd.read_csv('/opt/data/netZooPy/regPrior/regMatPval1e3.csv',header=0,index_col=0)\n",
697+
"regmatpval=pd.read_csv(ppath+'regMatPval1e3.csv',header=0,index_col=0)\n",
667698
"tresh=1e-5"
668699
]
669700
},
@@ -703,10 +734,10 @@
703734
"cell_type": "markdown",
704735
"metadata": {},
705736
"source": [
706-
"### Scaling the continuous networks\n",
737+
"## 6.2. Scaling the continuous networks\n",
707738
"We will first start by exploring the edge distribution that each continuous network has and scale them in order to be able to use them for network reconstruction methods.\n",
708739
"\n",
709-
"### 1. Garcia-Alonso model\n",
740+
"### 6.2.1. Garcia-Alonso model\n",
710741
"\n",
711742
"We start by loading the network that we computed to explore basic statistics."
712743
]
@@ -717,7 +748,7 @@
717748
"metadata": {},
718749
"outputs": [],
719750
"source": [
720-
"cont1=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont1.csv',header=0,index_col=0)\n",
751+
"cont1=pd.read_csv(ppath+'regMatCont1.csv',header=0,index_col=0)\n",
721752
"\n",
722753
"print('the maximum value is ',np.max(cont1.max()))\n",
723754
"print('the minimum value is ',np.min(cont1.min()))\n",
@@ -801,7 +832,7 @@
801832
"cell_type": "markdown",
802833
"metadata": {},
803834
"source": [
804-
"### 2. Oyuang model\n",
835+
"### 6.2.2. Oyuang model\n",
805836
"We do the same for the second model. Matrix density is identical to the first one, however, the edge weights are more dispersed because we added a significance term to each distance. Edge values are between 0 and 4624 and matrix density is 91% which is equal to the density of the previous model because the only difference was the addition of a significance term."
806837
]
807838
},
@@ -811,7 +842,7 @@
811842
"metadata": {},
812843
"outputs": [],
813844
"source": [
814-
"cont2=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont2.csv',header=0,index_col=0)\n",
845+
"cont2=pd.read_csv(ppath+'regMatCont2.csv',header=0,index_col=0)\n",
815846
"\n",
816847
"print('the maximum value is ',np.max(cont2.max()))\n",
817848
"print('the minimum value is ',np.min(cont2.min()))\n",
@@ -858,7 +889,7 @@
858889
"cell_type": "markdown",
859890
"metadata": {},
860891
"source": [
861-
"### 3. RP model\n",
892+
"### 6.2.3. RP model\n",
862893
"The RP model edge weight distribution varies between 0 and 7513. The weighted adjacency matrix has the same density than the two other matrices (91%). When we look at the equation of the RP model, we see that when the motif is exactly at the TSS or after the TSS, the distance $d$ to the TSS was set to 0. In the RP model equation, a distance 0 gives an edge weight of 0.6."
863894
]
864895
},
@@ -868,7 +899,7 @@
868899
"metadata": {},
869900
"outputs": [],
870901
"source": [
871-
"cont3=pd.read_csv('/opt/data/netZooPy/regPrior/regMatCont3.csv',header=0,index_col=0)\n",
902+
"cont3=pd.read_csv(ppath+'regMatCont3.csv',header=0,index_col=0)\n",
872903
"\n",
873904
"print('the maximum value is ',np.max(cont3.max()))\n",
874905
"print('the minimum value is ',np.min(cont3.min()))\n",
@@ -915,7 +946,7 @@
915946
"cell_type": "markdown",
916947
"metadata": {},
917948
"source": [
918-
"## References\n",
949+
"# References\n",
919950
"\n",
920951
"1- Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. \"FIMO: scanning for occurrences of a given motif.\" Bioinformatics 27.7 (2011): 1017-1018.\n",
921952
"\n",

0 commit comments

Comments
 (0)