-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Is your feature request related to a problem? Please describe.
Yes, many public health laboratories have limited experience working with Candida auris in the lab (identifying a species via wet lab techniques, maldi-tof, colony morphology, etc.) which may lead to sequencing of samples that are not pure cultures/isolates.
One example: mixed culture of C. auris and C. parapsilosis. We recently looked at a sample that had roughly 36.6% reads assigned to C. auris and 57.7% of reads assigned to C. parapsilosis. A de novo assembly of the FASTQs from this sample resulted in a genome size of roughly 25 Mbases, which also shows evidence of a mixed isolate.
MycoSNP currently does not check for read-level contamination AFAIK. Additionally with consensus/reference-based assembly, it may be difficult to identify mixed/contaminated samples given the current QC outputs from MycoSNP. The above mentioned sample did fail MycoSNP QC, due to a low GC% of 40.3 percent, but it was not obvious what was going on with the sample
kraken2 (along with a proper database & fine-tuned parameters) could be used to screen the reads for potential contamination and ensure that the reads going into assembly are indeed from C. auris alone.
Describe the solution you'd like
- add a step early on in the workflow that runs the fastqs through
kraken2 - check the output kraken2 report and look for a significant percentage of reads (e.g. >80%) that map to Candida auris and low-to-no percentage of reads (e.g. <10%) to another Candida species or other contaminating species.
- Provide the kraken2 report file as an output from the workflow
Describe alternatives you've considered
None.
Additional context
I had a bad time using the standard Kraken2 databases built off of sequences in RefSeq, it seems there are not any C. auris assemblies included and there are few other Candida species present. Nearly all of my test FASTQs were assigned unclassified by kraken2
I had good luck with the pre-built k2 database called "EuPathDB48" for Eukaryotic pathogens found here: https://benlangmead.github.io/aws-indexes/k2#:~:text=.txt-,EuPathDB48,-3
If you visit this link and CTRL+F for "Candida" you can see all Candida species present in the database https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt
The downside of this database is that it is huge so it required 34GB of RAM to run and obviously would be cumbersome for users to routinely download and use. Not practical for routine screening of FASTQs.
One alternative idea is to create a custom and small kraken2 database, potentially hosted on Azure cloud storage, Zenodo, or some other archival service, that is built using high quality Candida auris (and other Candida spp.) reference genomes and could be used to identify contamination between Candida species as well as other common contaminants (human? others?)
Example usage & results
$ ls k2-db-EuPathDB48/
EuPathDB48_Contents.txt database150mers.kmer_distrib database250mers.kmer_distrib database50mers.kmer_distrib hash.k2d k2_eupathdb48_20201113.tar.gz seqid2taxid.map
database100mers.kmer_distrib database200mers.kmer_distrib database300mers.kmer_distrib database75mers.kmer_distrib inspect.txt opts.k2d taxo.k2d
# launch staphb kraken2 v2.1.2 docker image; fastq files and EuPathDB database files are in PWD
$ docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data -ti staphb/kraken2:2.1.2-no-db
# mostly standard parameters
$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
--paired mixed-sample*.fastq.gz \
--output mixed-sample.k2-EuPathDB48.out \
--report mixed-sample.EuPathDB48.report.out
$ kraken2 --db k2-db-EuPathDB48/ --threads 8 --gzip-compressed \
--paired good-Cauris-sample*.fastq.gz \
--output good-Cauris-sample.k2-EuPathDB48.out \
--report good-Cauris-sample.EuPathDB48.report.out
$ head -n 30 mixed-sample.EuPathDB48.report.out
4.11 119260 119260 U 0 unclassified
95.89 2785280 0 R 1 root
95.89 2785280 0 R1 131567 cellular organisms
95.89 2785280 1935 D 2759 Eukaryota
95.72 2780203 0 D1 33154 Opisthokonta
95.72 2780203 386 K 4751 Fungi
95.70 2779695 282 K1 451864 Dikarya
95.65 2778276 5 P 4890 Ascomycota
95.64 2777974 25 P1 716545 saccharomyceta
95.38 2770257 0 P2 147537 Saccharomycotina
95.38 2770257 0 C 4891 Saccharomycetes
95.38 2770257 119 O 4892 Saccharomycetales
57.73 1676934 0 F 766764 Debaryomycetaceae
57.73 1676934 0 F1 1535325 Candida/Lodderomyces clade
57.73 1676934 41 G 1535326 Candida
57.68 1675417 0 S 5480 Candida parapsilosis
57.68 1675417 1675417 S1 578454 Candida parapsilosis CDC317
0.04 1294 35 S 5476 Candida albicans
0.04 1163 1163 S1 237561 Candida albicans SC5314
0.00 96 96 S1 294748 Candida albicans WO-1
0.01 182 0 S 5482 Candida tropicalis
0.01 182 182 S1 294747 Candida tropicalis MYA-3404
37.63 1093076 0 F 27319 Metschnikowiaceae
37.63 1093076 517 G 36910 Clavispora
37.43 1087135 1289 G1 1540022 Clavispora/Candida clade
36.59 1062774 1062774 S 498019 [Candida] auris
0.48 13932 13932 S 45357 [Candida] haemulonis
0.31 9140 9140 S 1231522 [Candida] duobushaemulonis
0.19 5424 0 S 36911 Clavispora lusitaniae
0.19 5424 5424 S1 306902 Clavispora lusitaniae ATCC 42720
$ head -n 30 good-Cauris-sample.EuPathDB48.report.out
7.12 107209 107209 U 0 unclassified
92.88 1398000 0 R 1 root
92.88 1398000 0 R1 131567 cellular organisms
92.88 1398000 884 D 2759 Eukaryota
92.71 1395407 0 D1 33154 Opisthokonta
92.71 1395407 0 K 4751 Fungi
92.70 1395389 1 K1 451864 Dikarya
92.70 1395371 3 P 4890 Ascomycota
92.69 1395139 26 P1 716545 saccharomyceta
92.52 1392677 0 P2 147537 Saccharomycotina
92.52 1392677 0 C 4891 Saccharomycetes
92.52 1392677 90 O 4892 Saccharomycetales
92.46 1391641 0 F 27319 Metschnikowiaceae
92.46 1391641 556 G 36910 Clavispora
92.13 1386684 1478 G1 1540022 Clavispora/Candida clade
90.84 1367342 1367342 S 498019 [Candida] auris
0.73 10992 10992 S 45357 [Candida] haemulonis
0.46 6872 6872 S 1231522 [Candida] duobushaemulonis
0.29 4401 0 S 36911 Clavispora lusitaniae
0.29 4401 4401 S1 306902 Clavispora lusitaniae ATCC 42720
0.06 872 0 F 766764 Debaryomycetaceae
0.06 872 0 F1 1535325 Candida/Lodderomyces clade
0.06 872 1 G 1535326 Candida
0.03 499 7 S 5476 Candida albicans
0.03 492 492 S1 237561 Candida albicans SC5314
0.02 348 0 S 5480 Candida parapsilosis
0.02 348 348 S1 578454 Candida parapsilosis CDC317
0.00 24 0 S 5482 Candida tropicalis
0.00 24 24 S1 294747 Candida tropicalis MYA-3404
0.00 58 0 F 4893 Saccharomycetaceae