Skip to content

Cami2-assembly-tutorial #3293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

PlushZ
Copy link
Contributor

@PlushZ PlushZ commented Mar 27, 2022

This PR is to add a tutorial about my Master Project "Reproducing Critical Assessment of Metagenome Interpretation assembly challenge on marine dataset with Galaxy" into training.galaxyproject

@PlushZ PlushZ requested a review from a team as a code owner March 27, 2022 09:28
Copy link
Member

@hexylena hexylena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome @PlushZ ! I've made a number of comments on the formatting of the tutorial to help it conform to GTN standards for tutorials.

This is super cool to see! I always thought the assembly challenges were a good fit for galaxy and reproducing results there.

@bebatut bebatut force-pushed the cami2-assembly branch 2 times, most recently from 410b355 to bb3608b Compare December 5, 2022 15:19
>
> ```text
> SampleID URL
> Long read sample 0 https://frl.publisso.de/data/frl:6425521/marine/long_read/marmgCAMI2_sample_0_reads.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links do not work: it is tar.gz with subfolders in. We will put the data in the shared data library


Based on {% cite meyer2022critical %}, {% cite Meyer2021 %} and {% cite Meyer2021_tutorial %}, we can compare tools on a set of metrics to select the one to use for an analysis but also here to run the challenge:

Tool | Genome fraction (%) | Mismatches per 100 kbp | Misassemblies | NGA50 | Strain recall | Strain precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add before a short explanation of the different columns

ABySS ({% cite jackman2017abyss %}) | Very accurate mean fraction (<1% divergence) | | The fewest misassemblies | 100% precision. The highest strain precision (100%) for the unique genome
Ray ({% cite Boisvert2012 %}) | | | | | | 100% precision. The highest strain precision (100%) for the unique genome
A-STAR | A-STAR excelled in terms of genome fraction on marine and strain madness data sets. A-STAR improved the genome fraction to 44.1% on the marine dataset. On marine common genomes, A-STAR (26.7%) achieved the highest genome fractions. For unique genome, A-STAR provided the most complete assemblies (55.3% genome fraction). A-STAR partially recovered 102 (78%) of 131 16S gold standard sequences. | More mismatches than others: 773/100 kb | More misassemblies than others | | 2nd highest: 7.5% recall | 2nd highest: 69.4% precision
OPERA-MS [25] | There were selected 50 unique, public genomes present as a single contig in the gold standard and with annotated 16S sequences. The hybrid assembler OPERA-MS recovered one of the most complete 16S sequences (mean recovered gene fraction 47.1%). For the unique genome, OPERA-MS has an exceptional average NGA50 (187,083, 75% of the gold standard NGA50). | | The most contiguous assemblies were provided by the hybrid assembler OPERA-MS for the marine data, with an average NGA50 of 28,244 across genomes. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the correct citations to the tool

Gold Standard Assembly (GSA) | 76.9 | 0 | 0 | 682,777 | 54.9% (upper bound) | 100
ABySS ({% cite jackman2017abyss %}) | Very accurate mean fraction (<1% divergence) | | The fewest misassemblies | 100% precision. The highest strain precision (100%) for the unique genome
Ray ({% cite Boisvert2012 %}) | | | | | | 100% precision. The highest strain precision (100%) for the unique genome
A-STAR | A-STAR excelled in terms of genome fraction on marine and strain madness data sets. A-STAR improved the genome fraction to 44.1% on the marine dataset. On marine common genomes, A-STAR (26.7%) achieved the highest genome fractions. For unique genome, A-STAR provided the most complete assemblies (55.3% genome fraction). A-STAR partially recovered 102 (78%) of 131 16S gold standard sequences. | More mismatches than others: 773/100 kb | More misassemblies than others | | 2nd highest: 7.5% recall | 2nd highest: 69.4% precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simply the content of the cells in this table and the ones in the detail box?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge?


Using the described metrics, the different tools were evaluated in the CAMI paper and aggregated in tables (Supplementary Tables 3-7) from {%cite tutorialMeyer2021%}.

In these tables there are also ranking scores of the tools shown for every statistic as well as overall ranking scores. Overall, ranking scores for every dataset are computed as a sum of all ranking scores across metrics. The average ranking score of both datasets are calculated as weighted average sum of ranking for both datasets. We created [a table showing all ranking results from previous tables](https://docs.google.com/spreadsheets/d/e/2PACX-1vQgJr3J-IyVy9IkXS9W-RZcV83Tr6f7RusG_97QwgpW2dFdCXUMroROIhy8gKjPcUgISFXW9NQwOzzK/pubhtml?gid=455354696)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the table directly in the tutorial? Thanks

Comment on lines 639 to 677
**Marine** dataset

_With tool versions_
1. HipMer
2. metaSPAdes_v3.13.1
3. metaSPAdes_v3.13.0
4. ABySS
5. Ray-Meta
6. Megahit_v1.1.2
7. SPAdes_v3.14-dev

_Without tool versions_
1. HipMer
2. **metaSPAdes**
3. **ABySS**
4. **Ray-Meta**
5. **Megahit**

**Strain madness** dataset
_With tool versions_
1. HipMer
2. Megahit_v1.1.2
3. SPAdes_v3.14-dev
4. OPERA-MS
5. Megahit_V1.2.7

_Without tool versions_
1. HipMer
2. **Megahit**
3. **SPAdes**
4. **OPERA**

**Plant-associated** dataset

There are no certain ranking tables among Supplementary tables for plant-associated dataset. However, in {%cite Meyer2021%} there is information related to tools performance on plant-associated dataset. We created the priority list of tools for the plant-associated dataset.

1. (Meta)HipMer
2. **(meta)Flye**
3. **(meta)SPAdes**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to show that maybe as a table

2. **(meta)Flye**
3. **(meta)SPAdes**

Since in this tutorial we have decided to focus on marine dataset it would be reasonable to reproduce CAMI2 assembly challenge using HipMer, metaSPAdes, ABySS, Ray-Meta, Megahit assemblers which performed better. As we know from our [comparison Galaxy and CAMI2 analysis](https://docs.google.com/spreadsheets/d/e/2PACX-1vQgJr3J-IyVy9IkXS9W-RZcV83Tr6f7RusG_97QwgpW2dFdCXUMroROIhy8gKjPcUgISFXW9NQwOzzK/pubhtml), metaSPAdes, ABySS, Megahit tools are available in Galaxy while Ray-Meta and HipMer are not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to a table above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants