Skip to content

Add automated cell type annotation with CellTypist to Clustering 3K PBMCs with Scanpy tutorial and Vitessce visalization#6786

Open
dianichj wants to merge 10 commits intogalaxyproject:mainfrom
dianichj:update-tutorial-clustering-scanpy-w/celltypist-vitessce
Open

Add automated cell type annotation with CellTypist to Clustering 3K PBMCs with Scanpy tutorial and Vitessce visalization#6786
dianichj wants to merge 10 commits intogalaxyproject:mainfrom
dianichj:update-tutorial-clustering-scanpy-w/celltypist-vitessce

Conversation

@dianichj
Copy link
Copy Markdown
Member

@dianichj dianichj commented Apr 13, 2026

This PR adds automated cell type annotation as an alternative approach to the manual
annotation section of the "Clustering 3K PBMCs with Scanpy" tutorial, using a
Choose Your Own Analysis (CYOA) format so users can choose between:

  • Manual annotation using canonical marker genes (existing content, unchanged)
  • Automated annotation using CellTypist (new content)

Changes

  • Added CYOA section in # Cell type annotation with two paths: Manual and CellTypist
  • Added two CellTypist hands-on sections:
    • Train from AnnData (using the louvain column as training labels)
    • Cached model (using pre-trained models from the server)
  • Added CellTypist dotplot output image

Known issues

  • The CellTypist data manager for the Cached model option currently throws a FileNotFoundError. This needs to
    be fixed by the server admins before this path can be fully tested.
  • The tutorial needs to be re-tested once the data manager issue is resolved.

TODO in this PR

  • Add Vitessce visualization section for the CellTypist output
  • Re-test CellTypist Cached model once data manager is fixed

Related tool

https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu/repos/iuc/celltypist/celltypist/1.7.1+galaxy0

@dianichj dianichj marked this pull request as ready for review April 14, 2026 10:19
Comment thread topics/single-cell/tutorials/scrna-scanpy-pbmc3k/tutorial.md
Comment thread topics/single-cell/tutorials/scrna-scanpy-pbmc3k/tutorial.md Outdated
>
> 2. Rename the `vitessce.json` output to `Vitessce config - clusters`
>
> 3. Click on the {% icon galaxy-eye %} (**View data**) icon of the `Vitessce config - clusters` dataset to explore the clusters interactively in Vitessce
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add a small gif showing how to interactively explore?
There are some examples here: https://vitessce.github.io/easy_vitessce/ and here https://vitessce.io/examples/
If there is some missing functionality, we wil add it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am uncertain if GTN Supports gifs. Is it supported? Thank you!!@shiltemann

> {: .solution}
{: .question}

> <hands-on-title>Explore clusters interactively with Vitessce</hands-on-title>
Copy link
Copy Markdown
Member

@pavanvidem pavanvidem Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is not much to explore interactively in this plot, i would move the first interactive plotting to "Visualization of expression of the marker genes" step.

> 1. {% tool [CellTypist](toolshed.g2.bx.psu.edu/repos/iuc/celltypist/celltypist/1.7.1+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"Input AnnData file"*: `3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test, annotation`
> - *"Select model from"*: `History`
> - *"Select a models or train a model from history"*: `Train a model on an existing AnnData and use it`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need training here? can you use a reference model to annotate the cells?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both options are possible, maybe the user is exploring new cell-types. The datasete from the tutorial has very well known immune cell-types but it could be useful for other users who are studying tissue with special characteristics. Maybe it do not need to be a whole hands-on section but it would be useful to mention it. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then please make the training aspect into a comment or a question. It is confusing to see both options.

> 1. {% tool [CellTypist](toolshed.g2.bx.psu.edu/repos/iuc/celltypist/celltypist/1.7.1+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"Input AnnData file"*: `3k PBMC with only HVG, after scaling, PCA, KNN graph, UMAP, clustering, marker genes with Wilcoxon test`
> - *"Select model from"*: `Cached`
> - *"Choose CellTypist model"*: `immune sub-populations combined from 20 tissues of 18 studies (v2)`
Copy link
Copy Markdown
Member

@pavanvidem pavanvidem Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check again which model is the best fit here and guide the users a bit?

Copy link
Copy Markdown
Member Author

@dianichj dianichj Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would the one that will yield best detailed results IMHO, but I can also run the same model without sub-populations, (only populations). The other ones do not apply - one is covid related, the other one is from human fetus. This sample dataset is from immune cells so the most appropriate would be an immune cell model.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Galaxy71- on dataset 41_ Dotplot PNG

This other model offers a less detailed result! @pavanvidem However, the subtypes of Monocytes are not distinguished, for example.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better. Which one is it? BTW, the covid models also have data from healthy samples.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better but is not necessarily the best fit. The immune subtype tissue model is the best fit because it's the only one that correctly separates Classical from Non-classical monocytes, corresponding to the CD14+ and FCGR3A+ clusters in the tutorial. Also, it does not assigns ILCs with relatively high confidence, which is not expected as a prominent population in PBMCs. I wouldn't recommend the COVID model in this case even though it included healthy controls. I prefer to keep the current image :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants