merge docu branch

gaddamshreya1 · gaddamshreya1 · commit beaf1767de1f · 2021-11-10T19:27:15.000-08:00
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![PyPI version](https://badge.fury.io/py/tangram-sc.svg)](https://badge.fury.io/py/tangram-sc)
 
-Tangram is a Python package, written in [PyTorch](https://pytorch.org/) and based on [scanpy](https://scanpy.readthedocs.io/en/stable/), for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. The single-cell dataset and the spatial dataset should be collected from the same anatomical region/tissue type, ideally from a biological replicate, and need to share a set of genes. Tangram aligns the single-cell data in space by fitting gene expression on the shared genes. The best way to familiarize yourself with Tangram is to check out [our tutorial](https://github.com/broadinstitute/Tangram/blob/master/tangram_tutorial.ipynb). [![colab tutorial](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SVLUIZR6Da6VUyvX_2RkgVxbPn8f62ge?usp=sharing)
+Tangram is a Python package, written in [PyTorch](https://pytorch.org/) and based on [scanpy](https://scanpy.readthedocs.io/en/stable/), for mapping single-cell (or single-nucleus) gene expression data onto spatial gene expression data. The single-cell dataset and the spatial dataset should be collected from the same anatomical region/tissue type, ideally from a biological replicate, and need to share a set of genes. Tangram aligns the single-cell data in space by fitting gene expression on the shared genes. The best way to familiarize yourself with Tangram is to check out [our tutorial](https://github.com/broadinstitute/Tangram/blob/master/example/1_tutorial_tangram.ipynb). [![colab tutorial](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1gDmtiRN45OwCMu4n6l1uygQ_jIGe7NgJ)
 
 ![Tangram_overview](https://raw.githubusercontent.com/broadinstitute/Tangram/master/figures/tangram_overview.png)
 Tangram has been tested on various types of transcriptomic data (10Xv3, Smart-seq2 and SHARE-seq for single cell data; MERFISH, Visium, Slide-seq, smFISH and STARmap as spatial data). In our [preprint](https://www.biorxiv.org/content/10.1101/2020.08.29.272831v1), we used Tangram to reveal spatial maps of cell types and gene expression at single cell resolution in the adult mouse brain. More recently, we have applied our method to different tissue types including human lung, human kidney developmental mouse brain and metastatic breast cancer.
@@ -21,11 +21,16 @@ Tangram has been tested on various types of transcriptomic data (10Xv3, Smart-se
 
 To install Tangram, make sure you have [PyTorch](https://pytorch.org/) and [scanpy](https://scanpy.readthedocs.io/en/stable/) installed. If you need more details on the dependences, look at the `environment.yml` file. 
 
+* set up conda environment for Tangram 
+```
+    conda env create -f environment.yml
+```
 * install tangram-sc from shell:
 ```
+    conda activate tangram-env
     pip install tangram-sc
 ```
-* import tangram
+* To start using Tangram, import tangram in your jupyter notebooks or/and scripts 
 ```
     import tangram as tg
 ```
@@ -52,7 +57,7 @@ The returned AnnData,`ad_map`, is a cell-by-voxel structure where `ad_map.X[i, j
 
 The returned `ad_ge` is a voxel-by-gene AnnData, similar to spatial data `ad_sp`, but where gene expression has been projected from the single cells. This allows to extend gene throughput, or correct for dropouts, if the single cells have higher quality (or more genes) than single cell data. It can also be used to transfer cell types onto space. 
 
-For more details on how to use Tangram check out [our tutorial](https://github.com/broadinstitute/Tangram/blob/master/tangram_tutorial.ipynb). [![colab tutorial](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SVLUIZR6Da6VUyvX_2RkgVxbPn8f62ge?usp=sharing)
+For more details on how to use Tangram check out [our tutorial](https://github.com/broadinstitute/Tangram/blob/master/example/1_tutorial_tangram.ipynb). [![colab tutorial](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SVLUIZR6Da6VUyvX_2RkgVxbPn8f62ge?usp=sharing)
 
 ***
 
@@ -111,9 +116,6 @@ You do not need to segment cells in your histology for mapping on spatial transc
 #### I run out of memory when I map: what should I do?
 Reduce your spatial data in various parts and map each single part. If that is not sufficient, you will need to downsample your single cell data as well.
 
-#### How to use Tangram with Squidpy?
-For tutorial, please reference the example [here](https://github.com/broadinstitute/Tangram/blob/master/tutorial_sq_tangram.ipynb). For environment setup, please use squidpy=1.1.0 and reference this [yml file](https://github.com/broadinstitute/Tangram/blob/master/environment.yml).
-
 ***
 ## How to cite Tangram
 Tangram has been released in the following publication
@@ -127,6 +129,7 @@ If you have questions, please contact the authors of the method:
 PyPI maintainer:
 - Tommaso Biancalani - <biancalt@gene.com>
 - Ziqing Lu - <luz21@gene.com>
+- Shreya Gaddam - <gaddams@gene.com>
 
 The artwork has been curated by:
-- Anna Hupalowska <ahupalow@broadinstitute.org>
+- Anna Hupalowska <ahupalow@broadinstitute.org>
diff --git a/docs/source/classes/tangram.plot_utils.plot_cell_annotation_sc.rst b/docs/source/classes/tangram.plot_utils.plot_cell_annotation_sc.rst
@@ -0,0 +1,6 @@
+tangram.plot\_utils.plot\_cell\_annotation\_sc
+==============================================
+
+.. currentmodule:: tangram.plot_utils
+
+.. autofunction:: plot_cell_annotation_sc
diff --git a/docs/source/classes/tangram.plot_utils.plot_genes_sc.rst b/docs/source/classes/tangram.plot_utils.plot_genes_sc.rst
@@ -0,0 +1,6 @@
+tangram.plot\_utils.plot\_genes\_sc
+===================================
+
+.. currentmodule:: tangram.plot_utils
+
+.. autofunction:: plot_genes_sc
diff --git a/docs/source/classes/tangram.plot_utils.rst b/docs/source/classes/tangram.plot_utils.rst
@@ -27,10 +27,14 @@
     
     plot_cell_annotation
     
+    plot_cell_annotation_sc
+    
     plot_gene_sparsity
     
     plot_genes
     
+    plot_genes_sc
+    
     plot_test_scores
     
     plot_training_scores
diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst
@@ -17,8 +17,13 @@ Cell Level
 **************************
 To install Tangram, make sure you have `PyTorch <https://pytorch.org/>`_ and `scanpy <https://scanpy.readthedocs.io/en/stable/>`_ installed. If you need more details on the dependences, look at the `environment.yml <https://github.com/broadinstitute/Tangram/blob/master/environment.yml>`_ file. 
 
+Create a conda environment for Tangram::
+
+    conda env create --file environment.yml
+
 Install tangram-sc from shell::
-    
+
+    conda activate tangram-env
     pip install tangram-sc
     
 Import tangram::
diff --git a/docs/source/news.rst b/docs/source/news.rst
@@ -3,4 +3,8 @@ Tangram News
 
 - On Jan 28th 2021, Sten Linnarsson gave a `talk <https://www.youtube.com/watch?v=0mxIe2AsSKs>`_ at the WWNDev Forum and demostrated their mappings of the developmental mouse brain using Tangram.
 
-- On Mar 9th 2021, Nicholas Eagles wrote a `blog post <http://research.libd.org/rstatsclub/2021/03/09/lessons-learned-applying-tangram-on-visium-data/#.YPsZphNKhb->`_ about applying Tangram on Visium data.
+- On Mar 9th 2021, Nicholas Eagles wrote a `blog post <http://research.libd.org/rstatsclub/2021/03/09/lessons-learned-applying-tangram-on-visium-data/#.YPsZphNKhb->`_ about applying Tangram on Visium data.
+
+- The Tangram method has been used by our colleagues at Harvard and Broad Institute, to map cell types for the developmental mouse brain -see Fig. 2  (`Nature(2021) <https://www.nature.com/articles/s41586-021-03670-5>`_ )
+
+- Tangram is now officially a part of `Squidpy <https://squidpy.readthedocs.io/en/stable/index.html>`_
diff --git a/docs/source/working.rst b/docs/source/working.rst
@@ -2,8 +2,8 @@ Tangram Under the Hood
 ===========================
 
 Tangram instantiates a `Mapper` object passing the following arguments:
-* _S_: single cell matrix with shape cell-by-gene. Note that genes is the number of training genes.
-* _G_: spatial data matrix with shape voxels-by-genes. Voxel can contain multiple cells.
+| _S_: single cell matrix with shape cell-by-gene. Note that genes is the number of training genes.
+| _G_: spatial data matrix with shape voxels-by-genes. Voxel can contain multiple cells.
 
 Then, Tangram searches for a mapping matrix *M*, with shape voxels-by-cells, where the element *M\_ij* signifies the probability of cell *i* of being in spot *j*. Tangram computes the matrix *M* by minimizing the following loss:
 
diff --git a/environment.yml b/environment.yml
@@ -1,5 +1,6 @@
+name: tangram-env
 dependencies:
-  - python=3.8.5
+  - python>=3.8.5
   - pip=20.2.2
   - pytorch=1.4.0
   - scipy=1.5.2
diff --git a/tangram/mapping_utils.py b/tangram/mapping_utils.py
@@ -78,21 +78,18 @@ def pp_adatas(adata_sc, adata_sp, genes=None):
     )
 
     # Calculate uniform density prior as 1/number_of_spots
-    rna_count_per_spot = adata_sp.X.sum(axis=1)
     adata_sp.obs["uniform_density"] = np.ones(adata_sp.X.shape[0]) / adata_sp.X.shape[0]
     logging.info(
         f"uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata."
     )
 
     # Calculate rna_count_based density prior as % of rna molecule count
-    rna_count_per_spot = adata_sp.X.sum(axis=1)
-    adata_sp.obs["rna_count_based_density"] = rna_count_per_spot / np.sum(
-        rna_count_per_spot
-    )
+    rna_count_per_spot = np.array(adata_sp.X.sum(axis=1)).squeeze()
+    adata_sp.obs["rna_count_based_density"] = rna_count_per_spot / np.sum(rna_count_per_spot)
     logging.info(
         f"rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata."
     )
-
+        
 
 def adata_to_cluster_expression(adata, cluster_label, scale=True, add_density=True):
     """
diff --git a/tangram/plot_utils.py b/tangram/plot_utils.py
@@ -172,20 +172,41 @@ def construct_obs_plot(df_plot, adata, perc=0, suffix=None):
     adata.obs = pd.concat([adata.obs, df_plot], axis=1)
 
 
-def plot_cell_annotation_sc(adata_sp, annotation_list, perc=0):
-
+def plot_cell_annotation_sc(
+    adata_sp, 
+    annotation_list, 
+    x="x", 
+    y="y", 
+    spot_size=None, 
+    scale_factor=0.1, 
+    perc=0,
+    ax=None
+):
+        
     # remove previous df_plot in obs
     adata_sp.obs.drop(annotation_list, inplace=True, errors="ignore", axis=1)
 
     # construct df_plot
     df = adata_sp.obsm["tangram_ct_pred"][annotation_list]
     construct_obs_plot(df, adata_sp, perc=perc)
-
+    
+    #non visium data 
+    if 'spatial' not in adata_sp.obsm.keys():
+        #add spatial coordinates to obsm of spatial data 
+        coords = [[x,y] for x,y in zip(adata_sp.obs[x].values,adata_sp.obs[y].values)]
+        adata_sp.obsm['spatial'] = np.array(coords)
+    
+    if 'spatial' not in adata_sp.uns.keys() and spot_size == None and scale_factor == None:
+        raise ValueError("Spot Size and Scale Factor cannot be None when ad_sp.uns['spatial'] does not exist")
+    
+    #REVIEW
+    if 'spatial' in adata_sp.uns.keys() and spot_size != None and scale_factor != None:
+        raise ValueError("Spot Size and Scale Factor should be None when ad_sp.uns['spatial'] exists")
+    
     sc.pl.spatial(
-        adata_sp, color=annotation_list, cmap="viridis", show=False, frameon=False,
+        adata_sp, color=annotation_list, cmap="viridis", show=False, frameon=False, spot_size=spot_size, scale_factor=scale_factor, ax=ax
     )
 
-    # remove df_plot in obs
     adata_sp.obs.drop(annotation_list, inplace=True, errors="ignore", axis=1)
 
 
@@ -289,7 +310,18 @@ def plot_cell_annotation(
         fig.suptitle(annotation)
 
 
-def plot_genes_sc(genes, adata_measured, adata_predicted, cmap="inferno", perc=0):
+def plot_genes_sc(
+    genes, 
+    adata_measured, 
+    adata_predicted,
+    x="x",
+    y = "y",
+    spot_size=None, 
+    scale_factor=0.1, 
+    cmap="inferno", 
+    perc=0,
+    return_figure=False
+):
 
     # remove df_plot in obs
     adata_measured.obs.drop(
@@ -350,11 +382,24 @@ def plot_genes_sc(genes, adata_measured, adata_predicted, cmap="inferno", perc=0
 
     fig = plt.figure(figsize=(7, len(genes) * 3.5))
     gs = GridSpec(len(genes), 2, figure=fig)
+    
+    #non visium data
+    if 'spatial' not in adata_measured.obsm.keys():
+        #add spatial coordinates to obsm of spatial data 
+        coords = [[x,y] for x,y in zip(adata_measured.obs[x].values,adata_measured.obs[y].values)]
+        adata_measured.obsm['spatial'] = np.array(coords)
+        coords = [[x,y] for x,y in zip(adata_predicted.obs[x].values,adata_predicted.obs[y].values)]
+        adata_predicted.obsm['spatial'] = np.array(coords)
+
+    if ("spatial" not in adata_measured.uns.keys()) and (spot_size==None and scale_factor==None):
+        raise ValueError("Spot Size and Scale Factor cannot be None when ad_sp.uns['spatial'] does not exist")
+        
     for ix, gene in enumerate(genes):
-
         ax_m = fig.add_subplot(gs[ix, 0])
         sc.pl.spatial(
             adata_measured,
+            spot_size=spot_size,
+            scale_factor=scale_factor,
             color=["{} (measured)".format(gene)],
             frameon=False,
             ax=ax_m,
@@ -364,13 +409,15 @@ def plot_genes_sc(genes, adata_measured, adata_predicted, cmap="inferno", perc=0
         ax_p = fig.add_subplot(gs[ix, 1])
         sc.pl.spatial(
             adata_predicted,
+            spot_size=spot_size,
+            scale_factor=scale_factor,
             color=["{} (predicted)".format(gene)],
             frameon=False,
             ax=ax_p,
             show=False,
             cmap=cmap,
         )
-
+        
     #     sc.pl.spatial(adata_measured, color=['{} (measured)'.format(gene) for gene in genes], frameon=False)
     #     sc.pl.spatial(adata_predicted, color=['{} (predicted)'.format(gene) for gene in genes], frameon=False)
 
@@ -387,6 +434,8 @@ def plot_genes_sc(genes, adata_measured, adata_predicted, cmap="inferno", perc=0
         errors="ignore",
         axis=1,
     )
+    if return_figure==True:
+        return fig
 
 
 def plot_genes(
@@ -631,8 +680,7 @@ def plot_auc(df_all_genes, test_genes=None):
     textstr = 'auc_score={}'.format(np.round(metric_dict['auc_score'], 3))
     props = dict(boxstyle='round', facecolor='wheat', alpha=0.3)
     # place a text box in upper left in axes coords
-    plt.text(0.03, 0.1, textstr, fontsize=11,
-    verticalalignment='top', bbox=props);
+    plt.text(0.03, 0.1, textstr, fontsize=11, verticalalignment='top', bbox=props);
 
     
 # Colors used in the manuscript for deterministic assignment.
diff --git a/tangram/utils.py b/tangram/utils.py
@@ -73,13 +73,13 @@ def get_matched_genes(prior_genes_names, sn_genes_names, excluded_genes=None):
         prior_genes_names (sequence): List of gene names in the spatial data.
         sn_genes_names (sequence): List of gene names in the single nuclei data.
         excluded_genes (sequence): Optional. List of genes to be excluded. These genes are excluded even if present in both datasets.
-            If None, no genes are excluded. Default is None.
+        If None, no genes are excluded. Default is None.
 
     Returns:
         A tuple (mask_prior_indices, mask_sn_indices, selected_genes), with:
-            mask_prior_indices (list): List of indices for the selected genes in 'prior_genes_names'.
-            mask_sn_indices (list): List of indices for the selected genes in 'sn_genes_names'.
-            selected_genes (list): List of names of the selected genes.
+        mask_prior_indices (list): List of indices for the selected genes in 'prior_genes_names'.
+        mask_sn_indices (list): List of indices for the selected genes in 'sn_genes_names'.
+        selected_genes (list): List of names of the selected genes.
         For each i, selected_genes[i] = prior_genes_names[mask_prior_indices[i]] = sn_genes_names[mask_sn_indices[i].
     """
     prior_genes_names = np.array(prior_genes_names)
@@ -115,8 +115,8 @@ def one_hot_encoding(l, keep_aggregate=False):
 
     Returns:
         A DataFrame with a column for each unique value in the sequence and a one-hot-encoding, and an additional
-            column with the input list if 'keep_aggregate' is True.
-            The number of rows are equal to len(l).
+        column with the input list if 'keep_aggregate' is True.
+        The number of rows are equal to len(l).
     """
     df_enriched = pd.DataFrame({"cl": l})
     for i in l.unique():
@@ -137,7 +137,7 @@ def project_cell_annotations(
         adata_sp (AnnData): spatial data used to save the mapping result.
         annotation (str): Optional. Cell annotations matrix with shape (number_cells, number_annotations). Default is 'cell_type'.
         threshold (float): Optional. Valid for using with adata_map.obs['F_out'] from 'constrained' mode mapping. 
-                           Cell's probability below this threshold will be dropped. Default is 0.5.
+        Cell's probability below this threshold will be dropped. Default is 0.5.
     Returns:
         None.
         Update spatial Anndata by creating `obsm` `tangram_ct_pred` field with a dataframe with spatial prediction for each annotation (number_spots, number_annotations) 
@@ -797,10 +797,10 @@ def df_to_cell_types(df, cell_types):
 
     Args:
         df (DataFrame): Columns correspond to cell types.  Each row in the DataFrame corresponds to a voxel and
-            specifies the known number of cells in that voxel for each cell type (int).
-            The additional column 'centroids' specifies the coordinates of the cells in the voxel (sequence of (x,y) pairs).
+        specifies the known number of cells in that voxel for each cell type (int).
+        The additional column 'centroids' specifies the coordinates of the cells in the voxel (sequence of (x,y) pairs).
         cell_types (sequence): Sequence of cell type names to be considered for deconvolution.
-            Columns in 'df' not included in 'cell_types' are ignored for assignment.
+        Columns in 'df' not included in 'cell_types' are ignored for assignment.
 
     Returns:
         A dictionary <cell type name> -> <list of (x,y) coordinates for the cell type>
diff --git a/tangram_tutorial.ipynb b/tangram_tutorial.ipynb
diff --git a/tutorial_sq_tangram.ipynb b/tutorial_sq_tangram.ipynb