netZoo
diff --git a/‎netbooks/netZooR/ApplicationwithTBdataset.ipynb
Lines changed: 102 additions & 32 deletions b/‎netbooks/netZooR/ApplicationwithTBdataset.ipynb
Lines changed: 102 additions & 32 deletions
@@ -179,10 +179,10 @@
    "source": [
     "## 1.2. Data Sources\n",
     "\n",
-    "PANDA<sup>1</sup> builds a gene regulatory network by integrating three sources of data: 1) TF motif data, 2) TF PPI network , and 3) gene expression data.\n",
+    "PANDA<sup>1</sup> builds a gene regulatory network by integrating three sources of data: 1) TF motif data, 2) TF PPI network , and 3) gene expression data. \n",
     "\n",
     "### Motif data\n",
-    "An example specie-sepcific PANDA-ready transcription factor binding motif data is included in the netZooR package, which are derived from motif scan and motif info files located on https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resourcesby.\n",
+    "An example specie-sepcific PANDA-ready transcription factor binding motif data is included in the netZooR package, which are derived from motif scan and motif info files located on https://sites.google.com/a/channing.harvard.edu/kimberlyglass/tools/resourcesby. Motif data is a data frame that contains three columns: 1) TF (source node), 2) Gene (target node), and 3) weight is binary (0/1) value to indicate the presence of a TF motif in the promoter region of the target egen.\n",
     "\n",
     "### PPI\n",
     "This package includes a function `source.PPI` to build a Protein-Protein Interactions (PPI) througt STRING database given a list of proteins of interest. The [STRINGdb](http://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) is already loaded while loading netZooR."
@@ -201,7 +201,15 @@
     "motif <- read.table(motif_file_path, sep=\"\\t\")\n",
     "# create a data frame with the TF column\n",
     "TF  <- data.frame(motif[,1])\n",
-    "PPI <- source.PPI(TF, STRING.version=\"10\", species.index=83332, score_threshold=0)"
+    "PPI <- source.PPI(TF, STRING.version=\"10\", species.index=83332, score_threshold=0)\n",
+    "PPI"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "PPI data has three columns: 1) source node (TF), 2) target node (TF), 3) weight which a value between and 0 and 1 that indicates the strength of connection between these 2 TFs."
    ]
   },
   {
@@ -218,7 +226,7 @@
    },
    "source": [
     "We will use TB example datasets that are integrated in netZooR package.\n",
-    "In this application, we will build a case and control network using 2 gene expression dataset, one transcription factor binding motifs dataset, and one protein-protein interaction datasets from the netZooR package or they can be fetched through AWS.\n",
+    "In this application, we will build a case and control network using 2 gene expression dataset, one transcription factor binding motifs dataset, and one protein-protein interaction datasets from the netZooR package. This data can also be fetched through AWS.\n",
     "\n",
     "Using the data in the package, we need to specify the file path of these files as follows:"
    ]
@@ -233,7 +241,7 @@
     "treated_expression_file_path <- system.file(\"extdata\", \"expr4.txt\", package = \"netZooR\", mustWork = TRUE)\n",
     "control_expression_file_path <- system.file(\"extdata\", \"expr10.txt\", package = \"netZooR\", mustWork = TRUE)\n",
     "motif_file_path <- system.file(\"extdata\", \"chip.txt\", package = \"netZooR\", mustWork = TRUE)\n",
-    "ppi_file_path <- system.file(\"extdata\", \"ppi.txt\", package = \"netZooR\", mustWork = TRUE)"
+    "ppi_file_path   <- system.file(\"extdata\", \"ppi.txt\", package = \"netZooR\", mustWork = TRUE)"
    ]
   },
   {
@@ -242,7 +250,7 @@
     "lines_to_next_cell": 0
    },
    "source": [
-    "or, they can be downloaded to working directory from AWS."
+    "They can be downloaded to working directory from AWS."
    ]
   },
   {
@@ -252,9 +260,13 @@
    "outputs": [],
    "source": [
     "if (runserver==0){\n",
+    "    # case gene expression\n",
     "    system(\"curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/expr4.txt\")\n",
+    "    # control gene expression\n",
     "    system(\"curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/expr10.txt\")\n",
+    "    # motif data\n",
     "    system(\"curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/chip.txt\")\n",
+    "    # PPI data\n",
     "    system(\"curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/example_datasets/ppi.txt\")\n",
     "}"
    ]
@@ -277,7 +289,7 @@
     "treated_expression_file_path <- paste0(ppath,\"expr4.txt\")\n",
     "control_expression_file_path <- paste0(ppath,\"expr10.txt\")\n",
     "motif_file_path <- paste0(ppath,\"chip.txt\")\n",
-    "ppi_file_path <- paste0(ppath,\"ppi.txt\")"
+    "ppi_file_path   <- paste0(ppath,\"ppi.txt\")"
    ]
   },
   {
@@ -288,9 +300,9 @@
    "source": [
     "# 3. PANDA algorithm\n",
     "\n",
-    "Then, we assign the file paths defined previously in the PANDA call to \"expression dataset\", \"motif dataset\", and \"PPI\" dataset. Then we set option `rm_missing` to `TRUE` to remove TFs and genes that are not present in all three inputs.\n",
+    "Then, we assign the file paths defined previously in the PANDA call to `expr_file`, `motif_file`, and `ppi_file` arguments. Then we set option `rm_missing` to `TRUE` to remove TFs and genes that are not present in all three inputs.\n",
     "\n",
-    "We do this operation for both case and control networks. "
+    "We do this operation for both case and control networks. First with the case network"
    ]
   },
   {
@@ -304,7 +316,22 @@
    },
    "outputs": [],
    "source": [
-    "treated_all_panda_result <- panda.py(expr_file = treated_expression_file_path, motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess=\"legacy\",  remove_missing = TRUE )\n",
+    "treated_all_panda_result <- panda.py(expr_file = treated_expression_file_path, motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess=\"legacy\",  remove_missing = TRUE )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, the control network:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "control_all_panda_result <- panda.py(expr_file = control_expression_file_path,motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess=\"legacy\",  remove_missing = TRUE )"
    ]
   },
@@ -314,9 +341,9 @@
     "lines_to_next_cell": 0
    },
    "source": [
-    "Vector `treated_all_panda_result` and vector `control_all_panda_result` below are large lists with three elements: the entire PANDA network, indegree (\"to\" nodes) nodes and score, outdegree (\"from\" nodes) nodes and score. Use `$panda`,`$indegree` and `$outdegree` to access each list item resepctively.\n",
+    "The result vector `treated_all_panda_result` and vector `control_all_panda_result` below are large lists with three elements: the entire PANDA network in the `$panda` slot, the gene targeting scores or node indegree, and the TF targeting scores or node outdegree. Use `$panda`,`$indegree` and `$outdegree` to access each list item resepctively.\n",
     "\n",
-    "Use `$panda`to access the entire PANDA network."
+    "We can use `$panda`to access the entire PANDA network."
    ]
   },
   {
@@ -326,15 +353,25 @@
    "outputs": [],
    "source": [
     "treated_net <- treated_all_panda_result$panda\n",
-    "control_net <- control_all_panda_result$panda"
+    "control_net <- control_all_panda_result$panda\n",
+    "treated_net"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The PANDA network is a data frame that has 4 columns. A source column (TFs), a target column (Genes), a binary motif column that is identical to the input motif network, and a force column that has the edge weight in the PANDA network."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## PANDA Cytoscape Plotting\n",
-    "Cytoscape is an interactivity network visualization tool highly recommanded to explore the PANDA network. Before using this function `plot.panda.in.cytoscape`, please install and launch Cytoscape (3.6.1 or greater) and keep it running whenever using."
+    "Cytoscape is an interactivity network visualization tool highly recommanded to explore the PANDA network. Before using this function `plot.panda.in.cytoscape`, please install and launch Cytoscape (3.6.1 or greater) and keep it running whenever using this function. \n",
+    "\n",
+    "Before, calling this function, we need to reduce the network size by selecting the top 1000 edges in PANDA network by edge weight."
    ]
   },
   {
@@ -345,7 +382,6 @@
    },
    "outputs": [],
    "source": [
-    "# select top 1000 edges in PANDA network by edge weight.\n",
     "panda.net <- head(treated_net[order(control_net$Score,decreasing = TRUE),], 1000)"
    ]
   },
@@ -371,7 +407,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "On netbooks, we can use the visNetwork library to plot the largest 100 edges of the graph. We need to prepare the data in the required format."
+    "On netbooks server, we can use the visNetwork library to plot the largest 100 edges of the graph. We need to prepare the data in the required format."
    ]
   },
   {
@@ -399,7 +435,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Then, we can call visNetwork."
+    "Then, we can call visNetwork on the newly contructed data frame. TFs are yellow triangles, genes are blue circles, positive edges are colored in green, and negative edges in red."
    ]
   },
   {
@@ -421,7 +457,7 @@
    "metadata": {},
    "source": [
     "# 4. LIONESS Algorithm \n",
-    "How to run LIONESS is mostly idential with method how to run PANDA in this package, unless the return values of `lioness.py()` is a data frame where first two columns represent TFs (regulators) and Genes (targets) while the rest columns represent each sample. each cell filled with estimated score calculated by LIONESS."
+    "LIONESS reconstructs single-sample networks for each gene expression sample from an aggregate network such as PANDA. LIONESS uses the same arguments as PANDA. In this example, we will run LIONESS algorithm for the first two samples. If we don't specify the `start_sample` and `end_sample` arguments, LIONESS will generate networks for all samples."
    ]
   },
   {
@@ -435,18 +471,25 @@
    },
    "outputs": [],
    "source": [
-    "# Run LIONESS algorithm for the first two samples\n",
-    "# removing start_sample and end_sample arguments to generate whole LIONESS network with all samples.\n",
-    "control_lioness_result <- lioness.py(expr_file = control_expression_file_path,motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess=\"legacy\",  remove_missing = TRUE, start_sample=1, end_sample=2)"
+    "control_lioness_result <- lioness.py(expr_file = control_expression_file_path,motif_file = motif_file_path, ppi_file= ppi_file_path,modeProcess=\"legacy\",  remove_missing = TRUE, start_sample=1, end_sample=2)\n",
+    "control_lioness_result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output values of `lioness.py()` is a data frame where first two columns represent TFs (regulators) and Genes (targets) while the rest columns represent each sample. Each cell has the estimated edge weights calculated by LIONESS."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# 5. CONDOR Algorithm and plotting\n",
-    "PANDA network can simply be converted into condor.object by `panda.to.condor.object(panda.net, threshold)`\n",
-    "Defaults option  `threshold` is the average of [median weight of non-prior edges] and [median weight of prior edges], all weights mentioned previous are transformationed with formula `w'=ln(e^w+1)` before calculating the median and average. But all the edges selected will remain the orginal weights calculated by PANDA."
+    "CONDOR allows to detect communities in gene regulatory networks, like those built by PANDA. However, there a few processing steps to make the network complient with CONDOR format.\n",
+    "PANDA networks can simply be converted into condor.object by `panda.to.condor.object(panda.net, threshold)`\n",
+    "Defaults option  `threshold` is the average of [median weight of non-prior edges] and [median weight of prior edges], all weights mentioned previously are transformationed with formula `w'=ln(e^w+1)` before calculating the median and average which makes all edge weights positive for CONDOR. All the edges selected will remain the orginal weights calculated by PANDA."
    ]
   },
   {
@@ -458,13 +501,29 @@
     "treated_condor_object <- panda.to.condor.object(treated_net, threshold = 0)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, CONDOR can be called on the PANDA object"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "treated_condor_object <-condor.cluster(treated_condor_object,project = FALSE)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
     "lines_to_next_cell": 0
    },
    "source": [
-    "The communities structure can be plotted by igraph."
+    "The result of CONDOR is community assignment for each node of the network. The communities structure can be plotted by igraph."
    ]
   },
   {
@@ -473,19 +532,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "treated_condor_object <-condor.cluster(treated_condor_object,project = FALSE)\n",
     "treated_color_num <- max(treated_condor_object$red.memb$com)\n",
-    "treated_color <- viridis(treated_color_num, alpha = 1, begin = 0, end = 1, direction = 1, option = \"D\")\n",
+    "treated_color     <- viridis(treated_color_num, alpha = 1, begin = 0, end = 1, direction = 1, option = \"D\")\n",
     "condor.plot.communities(treated_condor_object, color_list=treated_color, point.size=0.04, xlab=\"Genes\", ylab=\"TFs\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This plot shows that CONDOR estimates the TB network to have three distinct communities."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# 6. ALPACA Algorithm\n",
     "\n",
-    "ALPACA community structure can also be generated from two PANDA network by `panda.to.alpaca`"
+    "ALPACA compares 2 networks by detecting differences in their community structure. ALPACA can be called on 2 PANDA network for example. The function `panda.to.alpaca` allows to link both methods"
    ]
   },
   {
@@ -501,9 +566,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## More tutorials\n",
+    "The result list `alpaca` contains 2 slots. The first one is a community assignement for each node and the second one is a modularity score for each node, which indicates the contribution of each node to the modularity of the community that it belongs to. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# More tutorials\n",
     "\n",
-    "Browse with `browseVignettes(\"netZooR\")` locally or check this link for [cloud notebooks](http://netbooks.networkmedicine.org/).\n",
+    "Browse with `browseVignettes(\"netZooR\")` locally or check [this link for cloud notebooks](http://netbooks.networkmedicine.org/).\n",
     "\n",
     "## Note\n",
     "If there is an error like `Error in fetch(key) : lazy-load database.rdb' is corrupt` when accessing the help pages of functions in this package after being loaded. It's [a limitation of base R](https://github.com/r-lib/devtools/issues/1660) and has not been solved yet. Restart R session and re-load this package will help.\n"
@@ -523,9 +595,7 @@
     "\n",
     "4- Padi, Megha, and John Quackenbush. \"Detecting phenotype-driven transitions in regulatory network structure.\" NPJ systems biology and applications 4.1 (2018): 1-12.\n",
     "\n",
-    "5- Kuijjer, Marieke Lydia, et al. \"Cancer subtype identification using somatic mutation data.\" British journal of cancer 118.11 (2018): 1492-1501.\n",
-    "\n",
-    "6- Schlauch, Daniel, et al. \"Estimating drivers of cell state transitions using gene regulatory network models.\" BMC systems biology 11.1 (2017): 1-10."
+    "5- Kuijjer, Marieke Lydia, et al. \"Cancer subtype identification using somatic mutation data.\" British journal of cancer 118.11 (2018): 1492-1501."
    ]
   }
  ],