-
Notifications
You must be signed in to change notification settings - Fork 964
Cami2-assembly-tutorial #3293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cami2-assembly-tutorial #3293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome @PlushZ ! I've made a number of comments on the formatting of the tutorial to help it conform to GTN standards for tutorials.
This is super cool to see! I always thought the assembly challenges were a good fit for galaxy and reproducing results there.
Co-authored-by: Helena <[email protected]>
Co-authored-by: Helena <[email protected]>
Co-authored-by: Helena <[email protected]>
410b355
to
bb3608b
Compare
bb3608b
to
c0e352f
Compare
> | ||
> ```text | ||
> SampleID URL | ||
> Long read sample 0 https://frl.publisso.de/data/frl:6425521/marine/long_read/marmgCAMI2_sample_0_reads.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These links do not work: it is tar.gz with subfolders in. We will put the data in the shared data library
|
||
Based on {% cite meyer2022critical %}, {% cite Meyer2021 %} and {% cite Meyer2021_tutorial %}, we can compare tools on a set of metrics to select the one to use for an analysis but also here to run the challenge: | ||
|
||
Tool | Genome fraction (%) | Mismatches per 100 kbp | Misassemblies | NGA50 | Strain recall | Strain precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add before a short explanation of the different columns
ABySS ({% cite jackman2017abyss %}) | Very accurate mean fraction (<1% divergence) | | The fewest misassemblies | 100% precision. The highest strain precision (100%) for the unique genome | ||
Ray ({% cite Boisvert2012 %}) | | | | | | 100% precision. The highest strain precision (100%) for the unique genome | ||
A-STAR | A-STAR excelled in terms of genome fraction on marine and strain madness data sets. A-STAR improved the genome fraction to 44.1% on the marine dataset. On marine common genomes, A-STAR (26.7%) achieved the highest genome fractions. For unique genome, A-STAR provided the most complete assemblies (55.3% genome fraction). A-STAR partially recovered 102 (78%) of 131 16S gold standard sequences. | More mismatches than others: 773/100 kb | More misassemblies than others | | 2nd highest: 7.5% recall | 2nd highest: 69.4% precision | ||
OPERA-MS [25] | There were selected 50 unique, public genomes present as a single contig in the gold standard and with annotated 16S sequences. The hybrid assembler OPERA-MS recovered one of the most complete 16S sequences (mean recovered gene fraction 47.1%). For the unique genome, OPERA-MS has an exceptional average NGA50 (187,083, 75% of the gold standard NGA50). | | The most contiguous assemblies were provided by the hybrid assembler OPERA-MS for the marine data, with an average NGA50 of 28,244 across genomes. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the correct citations to the tool
Gold Standard Assembly (GSA) | 76.9 | 0 | 0 | 682,777 | 54.9% (upper bound) | 100 | ||
ABySS ({% cite jackman2017abyss %}) | Very accurate mean fraction (<1% divergence) | | The fewest misassemblies | 100% precision. The highest strain precision (100%) for the unique genome | ||
Ray ({% cite Boisvert2012 %}) | | | | | | 100% precision. The highest strain precision (100%) for the unique genome | ||
A-STAR | A-STAR excelled in terms of genome fraction on marine and strain madness data sets. A-STAR improved the genome fraction to 44.1% on the marine dataset. On marine common genomes, A-STAR (26.7%) achieved the highest genome fractions. For unique genome, A-STAR provided the most complete assemblies (55.3% genome fraction). A-STAR partially recovered 102 (78%) of 131 16S gold standard sequences. | More mismatches than others: 773/100 kb | More misassemblies than others | | 2nd highest: 7.5% recall | 2nd highest: 69.4% precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you simply the content of the cells in this table and the ones in the detail box?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge?
|
||
Using the described metrics, the different tools were evaluated in the CAMI paper and aggregated in tables (Supplementary Tables 3-7) from {%cite tutorialMeyer2021%}. | ||
|
||
In these tables there are also ranking scores of the tools shown for every statistic as well as overall ranking scores. Overall, ranking scores for every dataset are computed as a sum of all ranking scores across metrics. The average ranking score of both datasets are calculated as weighted average sum of ranking for both datasets. We created [a table showing all ranking results from previous tables](https://docs.google.com/spreadsheets/d/e/2PACX-1vQgJr3J-IyVy9IkXS9W-RZcV83Tr6f7RusG_97QwgpW2dFdCXUMroROIhy8gKjPcUgISFXW9NQwOzzK/pubhtml?gid=455354696) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the table directly in the tutorial? Thanks
**Marine** dataset | ||
|
||
_With tool versions_ | ||
1. HipMer | ||
2. metaSPAdes_v3.13.1 | ||
3. metaSPAdes_v3.13.0 | ||
4. ABySS | ||
5. Ray-Meta | ||
6. Megahit_v1.1.2 | ||
7. SPAdes_v3.14-dev | ||
|
||
_Without tool versions_ | ||
1. HipMer | ||
2. **metaSPAdes** | ||
3. **ABySS** | ||
4. **Ray-Meta** | ||
5. **Megahit** | ||
|
||
**Strain madness** dataset | ||
_With tool versions_ | ||
1. HipMer | ||
2. Megahit_v1.1.2 | ||
3. SPAdes_v3.14-dev | ||
4. OPERA-MS | ||
5. Megahit_V1.2.7 | ||
|
||
_Without tool versions_ | ||
1. HipMer | ||
2. **Megahit** | ||
3. **SPAdes** | ||
4. **OPERA** | ||
|
||
**Plant-associated** dataset | ||
|
||
There are no certain ranking tables among Supplementary tables for plant-associated dataset. However, in {%cite Meyer2021%} there is information related to tools performance on plant-associated dataset. We created the priority list of tools for the plant-associated dataset. | ||
|
||
1. (Meta)HipMer | ||
2. **(meta)Flye** | ||
3. **(meta)SPAdes** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to show that maybe as a table
2. **(meta)Flye** | ||
3. **(meta)SPAdes** | ||
|
||
Since in this tutorial we have decided to focus on marine dataset it would be reasonable to reproduce CAMI2 assembly challenge using HipMer, metaSPAdes, ABySS, Ray-Meta, Megahit assemblers which performed better. As we know from our [comparison Galaxy and CAMI2 analysis](https://docs.google.com/spreadsheets/d/e/2PACX-1vQgJr3J-IyVy9IkXS9W-RZcV83Tr6f7RusG_97QwgpW2dFdCXUMroROIhy8gKjPcUgISFXW9NQwOzzK/pubhtml), metaSPAdes, ABySS, Megahit tools are available in Galaxy while Ray-Meta and HipMer are not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference to a table above
79daae4
to
2e0b90c
Compare
This PR is to add a tutorial about my Master Project "Reproducing Critical Assessment of Metagenome Interpretation assembly challenge on marine dataset with Galaxy" into training.galaxyproject