Batch correction tutorial for single cell #5717

MarisaJL · 2025-01-27T10:40:34Z

Batch Correction/Integration tutorial for Scanpy and Seurat

TODO

Hands on sections for Scanpy
Add more background/explanation on what batch correction/integration does
Write last section comparing before and after batch correction

…material into batch_correct

nomadscientist · 2025-02-13T10:13:28Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+If we simply run these different batches or combined datasets through a clustering pipeline such as Scanpy or Seurat, we might not get useful results. Clustering prioritises the genes that show the biggest differences in expression and uses these to identify groups of cells that share similar expression patterns. When we have different experimental batches or have combined multiple studies, these big differences might relate more to the differences between batches or datasets than to the biological differences we're interested in, such as cell type. Our clusters could therefore end up representing different batches or datasets, rather than anything more useful.
+
+In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, we will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.


Suggested change

In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, we will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.

In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, you will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.

nomadscientist · 2025-02-13T10:41:44Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+# Scanpy or Seurat?
+
+Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often the first step in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.


Suggested change

Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often the first step in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.

Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often a key goal in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.

nomadscientist · 2025-02-13T10:44:52Z

Kind of a general comment that there's a lot of switching between 'we' and 'you' forms in the introduction bit, which is a bit jarring, but relatively minor!

I will move onto the Seurat section as I see the Scanpy section is pending

nomadscientist · 2025-02-13T12:56:37Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+> > <solution-title></solution-title>
+> >
+> > 1. The dataset that we're using comes from a study that compared different single cell techniques. The `Method` column tells us which technique was used on each cell.
+> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used - batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique, so it seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.


Suggested change

> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used - batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique, so it seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.

> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used. Batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique. It seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.

nomadscientist · 2025-02-13T15:15:26Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+> 1. {% tool [Seurat Data Management](toolshed.g2.bx.psu.edu/repos/iuc/seurat_data/seurat_data/5.0+galaxy0) %} with the following parameters:
+>    - *"Method used"*: `Inspect Seurat Object`
+>        - *"Display information about"*: `Cell Metadata`
+>


Suggested change

>

> 2. {% tool [**Count**

occurrences of each record](Count1) %} with the following parameters:

> - *"Count occurrences of values in column(s)"*: `c11` This is the 11th column in your table, which contains the `Method` metadata

This is nice for actually showing what's in that column - otherwise users only see a list of pbmc1.

nomadscientist · 2025-02-13T15:20:46Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+> <comment-title></comment-title>
+>
+> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from or which experimental group it was in. Sometimes, this metadata will include lots of useful details such as demographic information about human donors that can help us to better understand our results. 


Suggested change

> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from or which experimental group it was in. Sometimes, this metadata will include lots of useful details such as demographic information about human donors that can help us to better understand our results.

> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from, or which experimental group it was in. Sometimes, this metadata will include lots of useful details, such as demographic information about human donors. Such information can help us to better understand our results.

nomadscientist · 2025-02-13T15:31:31Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+The terms batch correction and integration can be used somewhat interchangably, because they both refer to the same process of looking for shared cell subpopulations across groups. The same tools are used in the same way for both procedures, so you could use the workflow described in this tutorial to perform integration as well as batch correction.
+
+The only difference is that we tend to talk about batch correction when we are working with groups produced in a single study (e.g. different experimental batches), while we would say integration when we're combining separate datasets from multiple studies.


This last point is SUPER important - which is which? Perhaps move earlier, or change the title of this section to "Batch correction or integration?"

nomadscientist · 2025-02-13T15:33:59Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+Splitting the batches into separate layers could help to address some of the technical differences between them because of the separate preprocessing, but we'll have to wait for the results to see if this has been enough to eliminate these differences.
+
+If you want to understand the impact of splitting and preprocessing the batches separately, then you could skip ahead to the next section, **Clustering with Seurat** and compare your results to those shown in this tutorial.


I didn't quite understand this. What did you mean?

nomadscientist · 2025-02-13T15:42:42Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+> > <solution-title></solution-title>
+> >
+> > 1. We can see that there are now 9 layers in our SeuratObject. 
+> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the counts.Drop-seq layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in thier names.


Suggested change

> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the counts.Drop-seq layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in thier names.

> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the `counts.Drop-seq` layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in their names.

nomadscientist · 2025-02-13T15:43:27Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+We'll follow the default Seurat pipeline here, except that we'll use `30` PCs to build the neighborhood graph and cluster with a resolution of `2` as these were the parameters used in [the original Seurat version of this tutorial](https://satijalab.org/seurat/articles/seurat5_integration). We'll also give our clusters and UMAP more recognisable names as we'll be running these tools again later, after batch correction. 
+
+> <comment-title></comment-title>
+> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.


Suggested change

> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.

> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing %} tool.

nomadscientist · 2025-02-13T15:44:18Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+> <comment-title></comment-title>
+> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
+>
+> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!


Suggested change

> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!

> If you use `SCTransform` for preprocessing, then you'll need to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!

nomadscientist · 2025-02-13T15:44:59Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
+>
+> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.


Suggested change

> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.

> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform`, then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.

nomadscientist · 2025-02-13T15:47:21Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
+>
+> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.


Suggested change

> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.

> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% icon tool %} **Seurat Integrate** tool.

nomadscientist · 2025-02-13T15:48:04Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+We'll follow the default Seurat pipeline here, except that we'll use `30` PCs to build the neighborhood graph and cluster with a resolution of `2` as these were the parameters used in [the original Seurat version of this tutorial](https://satijalab.org/seurat/articles/seurat5_integration). We'll also give our clusters and UMAP more recognisable names as we'll be running these tools again later, after batch correction. 
+
+> <comment-title></comment-title>
+> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.


Suggested change

> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.

> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% icon tool %} **Seurat Preprocessing** tool.

nomadscientist · 2025-02-13T15:59:43Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>    >
+>    > We will use the output from `RunPCA` in the  following section when we perform batch correction.
+>    >
+>    > If you're already very familiar with the Seurat clustering pipeline and you just want to try using the  {% tool Seurat Integrate %} tools, then you can skip ahead to the **Clustering after Integration** step now.


Suggested change

> > If you're already very familiar with the Seurat clustering pipeline and you just want to try using the {% tool Seurat Integrate %} tools, then you can skip ahead to the **Clustering after Integration** step now.

> > If you're already familiar with the Seurat clustering pipeline and you just want to try using the {% icon tool %} **Seurat Integrate** tools, then you can skip ahead to the **Clustering after Integration** step now.

Do these 'skips' work for running through the tutorial? Or is that assuming they won't actually be using Galaxy while reading the tutorial?

nomadscientist · 2025-02-13T16:02:14Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>        - *"Algorithm for modularity optimization"*: `1. Original Louvain`
+>        - *"Name for output clusters"*: `unintegrated_clusters`
+>
+>    > <comment-title> short description </comment-title>


Suggested change

> > <comment-title> short description </comment-title>

> > <warning-title> short description </warning-title>

nomadscientist · 2025-02-13T16:02:25Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>    > <comment-title> short description </comment-title>
+>    >
+>    > Make sure that you change the default name for the clusters to `unintegrated_clusters`! 
+>    {: .comment}


Suggested change

> {: .comment}

> {: .warning}

nomadscientist · 2025-02-13T16:04:03Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>        - In *"Advanced Options"*:
+>            - *"Name for dimensional reduction"*: `umap.unintegrated`
+>
+>    > <comment-title> short description </comment-title>


Suggested change

> > <comment-title> short description </comment-title>

> > <warning-title> short description </warning-title>

nomadscientist · 2025-02-13T16:04:27Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>    > <comment-title> short description </comment-title>
+>    >
+>    > Make sure that you change the default name for the UMAP results to `umap.unintegrated`! 
+>    {: .comment}


Suggested change

> {: .comment}

> {: .warning}

nomadscientist · 2025-02-13T16:04:54Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+{: .hands_on}
+
+Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified and then colour this plot in by `Method` to see if that might be influencing our results.


Suggested change

Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified and then colour this plot in by `Method` to see if that might be influencing our results.

Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified. Then, we will colour this plot in by `Method` to see if that might be influencing our results.

nomadscientist · 2025-02-13T16:18:50Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+{: .hands_on}
+
+![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A. cluster B. Method")


Suggested change

![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A. cluster B. Method")

![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A: cluster, and B: Method")

nomadscientist · 2025-02-13T16:20:55Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+# Clustering without Batch Correction
+
+We suspect that batch correction will be needed because of the different technologies used to construct this dataset, but we'll try clustering without any correction first. This will confirm whether batch correction is truly needed on the basis of `Method`. Comparing the results we get now with those we'll get after batch correction should also help us to understand what batch correction is doing to our single cell data.


Maybe something more like "First, we need to determine whether we even need to perform batch correction. Therefore, we will try clustering without correction...."

nomadscientist · 2025-02-13T16:22:21Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+> <comment-title> short description </comment-title>
+>
+> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using one of the other methods to see how it affects the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!


Suggested change

> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using one of the other methods to see how it affects the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!

> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using different methods to see how they affect the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!

nomadscientist · 2025-02-13T16:23:51Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>        - *"Integration method to use"*: `CCA Integration`
+>        - *"Name for new dimensional reduction"*: `integrated.cca`
+>
+>    > <comment-title> short description </comment-title>


Suggested change

> > <comment-title> short description </comment-title>

> > <comment-title> Remember the name </comment-title>

nomadscientist · 2025-02-13T16:24:34Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+{: .hands_on}
+
+It's good practice to rejoin our layers now, so that those separate layers or batches will end up back in the same layer. We don't actually need to do this now as it won't affect the clustering results, but it is important if we want to perform downstream analyses such as Differential Expression analysis.


Suggested change

It's good practice to rejoin our layers now, so that those separate layers or batches will end up back in the same layer. We don't actually need to do this now as it won't affect the clustering results, but it is important if we want to perform downstream analyses such as Differential Expression analysis.

It's good practice to rejoin our layers now, so that those separate layers/batches will end up together. We don't actually need to do this now (as it won't affect the clustering results), but it is important if we want to perform downstream analyses such as Differential Expression analysis.

nomadscientist · 2025-02-13T17:04:29Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+> > <solution-title></solution-title>
+> >
+> > 1. The first plot shows 25 clusters (remember that Seurat starts from cluster 0!). Although the high resolution means we still have plenty of clusters, the batch correction has reduced the number. The clusters also look less fragmented than they did before.
+> > 2. When we colour in the plot by `Method` we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that are all one colour and there aren't any big patches of colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.


Suggested change

> > 2. When we colour in the plot by `Method` we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that are all one colour and there aren't any big patches of colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.

> > 2. When we colour in the plot by `Method`, we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that contain only one colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.

nomadscientist · 2025-02-13T17:05:45Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+
+
+> <comment-title></comment-title>


I think this is quite important, so probably not a comment. Could show the steps via bullets, i.e. "You would then...."

nomadscientist · 2025-02-13T17:06:17Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+>
+>If you look back at the cell metadata table we created at the beginning of this tutorial, you'll see there is an annotation called `CellType`. We can colour in our UMAPs using this annotation instead of the `Method`. If our clusters make biological cell sense, we should see that these cell types are clumped together because cells of the same type should be close to each other.
+>
+> If the cell types are all blended together across the entire UMAP (as with our `Method` plot) then this would be a sign that something has gone wrong. When we are performing batch correction or integration, there is a risk that we could over-integrate the data, eliminating the biological differences we're interested in alongside the technical differences we wanted to remove. 


For training purposes, it would be really valuable to give an example of over-integrated data - you don't have to make them calculate it, but adding images as a comment or detail or something else would be really helpful.

nomadscientist · 2025-02-13T17:07:05Z

topics/single-cell/tutorials/scrna_batch_correction/tutorial.md

+
+In this tutorial, we've learned how to perform batch correction or integration when analysing single cell data with either the Scanpy or Seurat pipelines. If you want to learn more about these pipelines then you might want to try analysing a slightly trickier dataset in the [Scanpy]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) or [Seurat]({% link topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.md %}) case study tutorials.
+
+This tutorial is part of the https://singlecell.usegalaxy.eu portal ({% cite tekman2020single %}).


I would remove this line about the data portal, personally.

MarisaJL added 5 commits January 24, 2025 20:49

Create batch correction tutorial

4a7b0cb

Merge branch 'batch_correct' of https://github.com/MarisaJL/training-…

e990de2

…material into batch_correct

Add images

4c902a0

Minor edits

3f430d1

Fixing citation

85c94a1

github-actions bot added the single-cell label Jan 27, 2025

MarisaJL added 2 commits January 27, 2025 11:17

Fixing zenodo link

0b9891e

Trying to fix error

fdf580d

nomadscientist reviewed Feb 13, 2025

View reviewed changes


		If we simply run these different batches or combined datasets through a clustering pipeline such as Scanpy or Seurat, we might not get useful results. Clustering prioritises the genes that show the biggest differences in expression and uses these to identify groups of cells that share similar expression patterns. When we have different experimental batches or have combined multiple studies, these big differences might relate more to the differences between batches or datasets than to the biological differences we're interested in, such as cell type. Our clusters could therefore end up representing different batches or datasets, rather than anything more useful.

		In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, we will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.


		# Scanpy or Seurat?

		Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often the first step in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.

->
+> 2. {% tool [**Count**
+occurrences of each record](Count1) %} with the following parameters:
+>    - *"Count occurrences of values in column(s)"*: `c11` This is the 11th column in your table, which contains the `Method` metadata


		The terms batch correction and integration can be used somewhat interchangably, because they both refer to the same process of looking for shared cell subpopulations across groups. The same tools are used in the same way for both procedures, so you could use the workflow described in this tutorial to perform integration as well as batch correction.

		The only difference is that we tend to talk about batch correction when we are working with groups produced in a single study (e.g. different experimental batches), while we would say integration when we're combining separate datasets from multiple studies.


		Splitting the batches into separate layers could help to address some of the technical differences between them because of the separate preprocessing, but we'll have to wait for the results to see if this has been enough to eliminate these differences.

		If you want to understand the impact of splitting and preprocessing the batches separately, then you could skip ahead to the next section, Clustering with Seurat and compare your results to those shown in this tutorial.

	> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
	> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing %} tool.

	> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
	> If you use `SCTransform` for preprocessing, then you'll need to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!

	> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.
	> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform`, then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.

	> > If you're already very familiar with the Seurat clustering pipeline and you just want to try using the {% tool Seurat Integrate %} tools, then you can skip ahead to the Clustering after Integration step now.
	> > If you're already familiar with the Seurat clustering pipeline and you just want to try using the {% icon tool %} Seurat Integrate tools, then you can skip ahead to the Clustering after Integration step now.

	> > <comment-title> short description </comment-title>
	> > <warning-title> short description </warning-title>

	Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified and then colour this plot in by `Method` to see if that might be influencing our results.
	Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified. Then, we will colour this plot in by `Method` to see if that might be influencing our results.

	![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A. cluster B. Method")
	![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A: cluster, and B: Method")


		# Clustering without Batch Correction

		We suspect that batch correction will be needed because of the different technologies used to construct this dataset, but we'll try clustering without any correction first. This will confirm whether batch correction is truly needed on the basis of `Method`. Comparing the results we get now with those we'll get after batch correction should also help us to understand what batch correction is doing to our single cell data.

	> > <comment-title> short description </comment-title>
	> > <comment-title> Remember the name </comment-title>

	It's good practice to rejoin our layers now, so that those separate layers or batches will end up back in the same layer. We don't actually need to do this now as it won't affect the clustering results, but it is important if we want to perform downstream analyses such as Differential Expression analysis.
	It's good practice to rejoin our layers now, so that those separate layers/batches will end up together. We don't actually need to do this now (as it won't affect the clustering results), but it is important if we want to perform downstream analyses such as Differential Expression analysis.

	> > 2. When we colour in the plot by `Method` we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that are all one colour and there aren't any big patches of colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.
	> > 2. When we colour in the plot by `Method`, we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that contain only one colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.


		In this tutorial, we've learned how to perform batch correction or integration when analysing single cell data with either the Scanpy or Seurat pipelines. If you want to learn more about these pipelines then you might want to try analysing a slightly trickier dataset in the [Scanpy]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) or [Seurat]({% link topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.md %}) case study tutorials.

		This tutorial is part of the https://singlecell.usegalaxy.eu portal ({% cite tekman2020single %}).

Batch correction tutorial for single cell #5717

Are you sure you want to change the base?

Batch correction tutorial for single cell #5717

Uh oh!

Conversation

MarisaJL commented Jan 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomadscientist commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomadscientist Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomadscientist Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomadscientist Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nomadscientist commented Feb 13, 2025 •

edited

Loading

nomadscientist Feb 13, 2025 •

edited

Loading

nomadscientist Feb 13, 2025 •

edited

Loading

nomadscientist Feb 13, 2025 •

edited

Loading