Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch correction tutorial for single cell #5717

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

MarisaJL
Copy link
Collaborator

Batch Correction/Integration tutorial for Scanpy and Seurat

TODO

  • Hands on sections for Scanpy
  • Add more background/explanation on what batch correction/integration does
  • Write last section comparing before and after batch correction


If we simply run these different batches or combined datasets through a clustering pipeline such as Scanpy or Seurat, we might not get useful results. Clustering prioritises the genes that show the biggest differences in expression and uses these to identify groups of cells that share similar expression patterns. When we have different experimental batches or have combined multiple studies, these big differences might relate more to the differences between batches or datasets than to the biological differences we're interested in, such as cell type. Our clusters could therefore end up representing different batches or datasets, rather than anything more useful.

In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, we will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, we will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.
In order to look beyond these technical differences, we can perform batch correction or integration. Both the Scanpy and Seurat pipelines include tools that can be used to correct for differences between experimental batches or to integrate datasets - we actually use the same tools to do both. In this tutorial, you will learn how to use these tools in either the Scanpy or Seurat pipeline - you can choose which one you would like to use.


# Scanpy or Seurat?

Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often the first step in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often the first step in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.
Scanpy and Seurat are two of the most commonly used pipelines (sets of tools) for analysing single cell data. Both pipelines have all the tools required to perform clustering, which identifies groups of cells that share similar expression profiles. Clustering is often a key goal in single cell analysis because it makes our data easier to interpret. Clusters represent groups of cells that are expressing the same genes, which often correspond to specific cell types or states. Our goal is to identify biologically relevant clusters that will help us to better understand our data.

@nomadscientist
Copy link
Collaborator

nomadscientist commented Feb 13, 2025

Kind of a general comment that there's a lot of switching between 'we' and 'you' forms in the introduction bit, which is a bit jarring, but relatively minor!

I will move onto the Seurat section as I see the Scanpy section is pending

> > <solution-title></solution-title>
> >
> > 1. The dataset that we're using comes from a study that compared different single cell techniques. The `Method` column tells us which technique was used on each cell.
> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used - batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique, so it seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used - batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique, so it seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.
> > 2. Each experimental technique can be considered as its own experimental batch. Each of these batches was processed independently, which by itself can be enough to require batch correction, even if the same experimental protocol is used. Batches can vary simply because they were processed at different times or by different people in the same lab! In this case, we have an even stronger reason to believe that these batches will differ - we know that each batch was produced using a different technique. It seems likely that we'll need to perform batch correction. We would consider this to be batch correction rather than integration because these data all came from the same original study.

> 1. {% tool [Seurat Data Management](toolshed.g2.bx.psu.edu/repos/iuc/seurat_data/seurat_data/5.0+galaxy0) %} with the following parameters:
> - *"Method used"*: `Inspect Seurat Object`
> - *"Display information about"*: `Cell Metadata`
>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>
> 2. {% tool [**Count**
occurrences of each record](Count1) %} with the following parameters:
> - *"Count occurrences of values in column(s)"*: `c11` This is the 11th column in your table, which contains the `Method` metadata

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-13 at 15 15 45 This is nice for actually showing what's in that column - otherwise users only see a list of pbmc1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-02-13 at 15 16 14


> <comment-title></comment-title>
>
> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from or which experimental group it was in. Sometimes, this metadata will include lots of useful details such as demographic information about human donors that can help us to better understand our results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from or which experimental group it was in. Sometimes, this metadata will include lots of useful details such as demographic information about human donors that can help us to better understand our results.
> The cell metadata is any information about the cells that the original authors have included in the SeuratObject. As well as the cell barcode or identifier for each individual cell, the metadata will usually include information such as which donor or sample the cell came from, or which experimental group it was in. Sometimes, this metadata will include lots of useful details, such as demographic information about human donors. Such information can help us to better understand our results.


The terms batch correction and integration can be used somewhat interchangably, because they both refer to the same process of looking for shared cell subpopulations across groups. The same tools are used in the same way for both procedures, so you could use the workflow described in this tutorial to perform integration as well as batch correction.

The only difference is that we tend to talk about batch correction when we are working with groups produced in a single study (e.g. different experimental batches), while we would say integration when we're combining separate datasets from multiple studies.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last point is SUPER important - which is which? Perhaps move earlier, or change the title of this section to "Batch correction or integration?"


Splitting the batches into separate layers could help to address some of the technical differences between them because of the separate preprocessing, but we'll have to wait for the results to see if this has been enough to eliminate these differences.

If you want to understand the impact of splitting and preprocessing the batches separately, then you could skip ahead to the next section, **Clustering with Seurat** and compare your results to those shown in this tutorial.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand this. What did you mean?

> > <solution-title></solution-title>
> >
> > 1. We can see that there are now 9 layers in our SeuratObject.
> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the counts.Drop-seq layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in thier names.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the counts.Drop-seq layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in thier names.
> > 2. We started out with one layer of raw data, called `counts`. That layer has now been split up according to `Method`. We now have nine `counts` layers. Each layer represents one of the batches named in the `Method` column of the cell metadata. We can see the names of the methods in the layer names. For example, the `counts.Drop-seq` layer contains the raw counts produced using the Drop-seq technique. Seven different methods were used in this study, but one of them was applied to three different batches - you should be able to see three layers with `Chromium_v2` in their names.

We'll follow the default Seurat pipeline here, except that we'll use `30` PCs to build the neighborhood graph and cluster with a resolution of `2` as these were the parameters used in [the original Seurat version of this tutorial](https://satijalab.org/seurat/articles/seurat5_integration). We'll also give our clusters and UMAP more recognisable names as we'll be running these tools again later, after batch correction.

> <comment-title></comment-title>
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing %} tool.

> <comment-title></comment-title>
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
>
> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
> If you use `SCTransform` for preprocessing, then you'll need to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!

>
> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
>
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform`, then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.

>
> If you use `SCTransform` for preprocessing then you'll need to click the button to choose `Yes` for `Use SCT as Normalization Method` when you run `IntegrateLayers`. The `SCTransform` normalises the data in its own way, so we just need to let the tool know what to expect!
>
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% tool Seurat Integrate %} tool.
> The next step after identifying clusters would usually be to look for marker genes that are differentially expressed between clusters. If you perform integration/batch correction after using `SCTransform` then you will need to run the `PrepSCTFindMarkers` function before using tools such as `FindMarkers`. You'll find this in the {% icon tool %} **Seurat Integrate** tool.

We'll follow the default Seurat pipeline here, except that we'll use `30` PCs to build the neighborhood graph and cluster with a resolution of `2` as these were the parameters used in [the original Seurat version of this tutorial](https://satijalab.org/seurat/articles/seurat5_integration). We'll also give our clusters and UMAP more recognisable names as we'll be running these tools again later, after batch correction.

> <comment-title></comment-title>
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% tool Seurat Preprocessing} tool.
> Seurat has another option for preprocessing - rather than use the three separate functions presented below, you can use a single function called `SCTransform` to preform normalisation, identification of variable genes, and scaling all in one go. You will find this option on Galaxy's {% icon tool %} **Seurat Preprocessing** tool.

> >
> > We will use the output from `RunPCA` in the following section when we perform batch correction.
> >
> > If you're already very familiar with the Seurat clustering pipeline and you just want to try using the {% tool Seurat Integrate %} tools, then you can skip ahead to the **Clustering after Integration** step now.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > If you're already very familiar with the Seurat clustering pipeline and you just want to try using the {% tool Seurat Integrate %} tools, then you can skip ahead to the **Clustering after Integration** step now.
> > If you're already familiar with the Seurat clustering pipeline and you just want to try using the {% icon tool %} **Seurat Integrate** tools, then you can skip ahead to the **Clustering after Integration** step now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these 'skips' work for running through the tutorial? Or is that assuming they won't actually be using Galaxy while reading the tutorial?

> - *"Algorithm for modularity optimization"*: `1. Original Louvain`
> - *"Name for output clusters"*: `unintegrated_clusters`
>
> > <comment-title> short description </comment-title>
Copy link
Collaborator

@nomadscientist nomadscientist Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > <comment-title> short description </comment-title>
> > <warning-title> short description </warning-title>

> > <comment-title> short description </comment-title>
> >
> > Make sure that you change the default name for the clusters to `unintegrated_clusters`!
> {: .comment}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> {: .comment}
> {: .warning}

> - In *"Advanced Options"*:
> - *"Name for dimensional reduction"*: `umap.unintegrated`
>
> > <comment-title> short description </comment-title>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > <comment-title> short description </comment-title>
> > <warning-title> short description </warning-title>

> > <comment-title> short description </comment-title>
> >
> > Make sure that you change the default name for the UMAP results to `umap.unintegrated`!
> {: .comment}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> {: .comment}
> {: .warning}

>
{: .hands_on}

Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified and then colour this plot in by `Method` to see if that might be influencing our results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified and then colour this plot in by `Method` to see if that might be influencing our results.
Now let's take a look at our results. We'll first plot a UMAP showing the clusters we've just identified. Then, we will colour this plot in by `Method` to see if that might be influencing our results.

>
{: .hands_on}

![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A. cluster B. Method")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A. cluster B. Method")
![Two UMAP plots showing many small and fragmented clusters of cells. Image A is coloured into 48 clusters. Image B shows many clusters as a single colour of cells analysed with the same method.](../../images/scrna_batch_correction/UMAP_Before_Seurat.png "UMAP before batch correction integration coloured by A: cluster, and B: Method")


# Clustering without Batch Correction

We suspect that batch correction will be needed because of the different technologies used to construct this dataset, but we'll try clustering without any correction first. This will confirm whether batch correction is truly needed on the basis of `Method`. Comparing the results we get now with those we'll get after batch correction should also help us to understand what batch correction is doing to our single cell data.
Copy link
Collaborator

@nomadscientist nomadscientist Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something more like "First, we need to determine whether we even need to perform batch correction. Therefore, we will try clustering without correction...."


> <comment-title> short description </comment-title>
>
> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using one of the other methods to see how it affects the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!
Copy link
Collaborator

@nomadscientist nomadscientist Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using one of the other methods to see how it affects the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!
> {% tool Seurat Integrate %} provides several integration methods, which all perform the integration or batch correction in their own way. You might want to experiment by using different methods to see how they affect the results. When you are working on your own data, it can be a good idea to try a few different integration methods to see which one produces the best results. The best integration or batch correction would be the one that eliminates the most of the technical differences between datasets or batches while producing biologically meaningful results. If we end up with completely unexpected results rather than clusters that match up well with known cell types, then we know that something has gone wrong!

> - *"Integration method to use"*: `CCA Integration`
> - *"Name for new dimensional reduction"*: `integrated.cca`
>
> > <comment-title> short description </comment-title>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > <comment-title> short description </comment-title>
> > <comment-title> Remember the name </comment-title>

>
{: .hands_on}

It's good practice to rejoin our layers now, so that those separate layers or batches will end up back in the same layer. We don't actually need to do this now as it won't affect the clustering results, but it is important if we want to perform downstream analyses such as Differential Expression analysis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It's good practice to rejoin our layers now, so that those separate layers or batches will end up back in the same layer. We don't actually need to do this now as it won't affect the clustering results, but it is important if we want to perform downstream analyses such as Differential Expression analysis.
It's good practice to rejoin our layers now, so that those separate layers/batches will end up together. We don't actually need to do this now (as it won't affect the clustering results), but it is important if we want to perform downstream analyses such as Differential Expression analysis.

> > <solution-title></solution-title>
> >
> > 1. The first plot shows 25 clusters (remember that Seurat starts from cluster 0!). Although the high resolution means we still have plenty of clusters, the batch correction has reduced the number. The clusters also look less fragmented than they did before.
> > 2. When we colour in the plot by `Method` we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that are all one colour and there aren't any big patches of colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > 2. When we colour in the plot by `Method` we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that are all one colour and there aren't any big patches of colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.
> > 2. When we colour in the plot by `Method`, we can see that all the colours are mixed together across all of the clusters. We don't have any clusters that contain only one colour. The batch correction has successfully removed the differences between the batches so that they're no longer dominating the results.




> <comment-title></comment-title>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite important, so probably not a comment. Could show the steps via bullets, i.e. "You would then...."

>
>If you look back at the cell metadata table we created at the beginning of this tutorial, you'll see there is an annotation called `CellType`. We can colour in our UMAPs using this annotation instead of the `Method`. If our clusters make biological cell sense, we should see that these cell types are clumped together because cells of the same type should be close to each other.
>
> If the cell types are all blended together across the entire UMAP (as with our `Method` plot) then this would be a sign that something has gone wrong. When we are performing batch correction or integration, there is a risk that we could over-integrate the data, eliminating the biological differences we're interested in alongside the technical differences we wanted to remove.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For training purposes, it would be really valuable to give an example of over-integrated data - you don't have to make them calculate it, but adding images as a comment or detail or something else would be really helpful.


In this tutorial, we've learned how to perform batch correction or integration when analysing single cell data with either the Scanpy or Seurat pipelines. If you want to learn more about these pipelines then you might want to try analysing a slightly trickier dataset in the [Scanpy]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) or [Seurat]({% link topics/single-cell/tutorials/scrna-case_FilterPlotandExplore_SeuratTools/tutorial.md %}) case study tutorials.

This tutorial is part of the https://singlecell.usegalaxy.eu portal ({% cite tekman2020single %}).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this line about the data portal, personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants