Hello all. I have a query about the FASTQ files available via the ENA/SRA, and if they are indeed the raw files? Or have theses sequences already been subject to primer trimming at their starts, perhaps as part of the Illumina de-muliplexing?
Using two examples:
- SRR3359939 Mock-community_09-12S
- SRR3359940 Mock-community_09-CytB
I have download the FASTQ files from https://www.ebi.ac.uk/ena/data/view/PRJNA313432
12S
After merging overlapping reads with Flash and primer trimming, I find the most common sequence in the SRR3359939 aka Mock-community_09-12S to be this Abramis brama sequence (which in reference sequences is flanked by the primers):
ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
However, note the following:
$ cat raw_data/SRR3359939_1.fastq.gz | gunzip \
| grep -c ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
13731
$ cat raw_data/SRR3359939_1.fastq.gz | gunzip \
| grep -c ^ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
13731
Using the reverse complement GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT looking at the reverse reads we have the same pattern:
$ cat raw_data/SRR3359939_2.fastq.gz | gunzip \
| grep -c GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT
13499
$ cat raw_data/SRR3359939_2.fastq.gz | gunzip \
| grep -c ^GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT
13496
i.e. All those perfect matches start with the marker, so the opening forward primer has apparently been removed from the forward FASTQ, and the reverse primer removed from the start of the reverse FASTQ reads.
CytB
Looking at the SRR3359940 Mock-community_09-CytB raw data, it does appear the primer sequences have been removed here too (although as you note, with the longer product we do not expect as much read-though and so it will rarely if ever appear after the product).
This time after merging overlaps (and without doing any primer trimming), I find this perfect Abramis brama match to be the most common sequence (which is flanked by the CytB primers):
CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACCATCTAACATTTCAACACTATGAAACTTCGGATCCCTCCTAGGATTATGTTTAATTACCCAAATCCTCACGGGATTATTTCTAGCCATACACTACACCTCTGATATCTCCACCGCATTTTCATCAGTAACCCACATCTGCCGAGACGTTAACTACGGCTGACTTATTCGAAACTTACATGCTAATGGAGCATCATTCTTCTTTATCTGCCTTTATATACATATTGCACGAGGCCTATACTACGGGTCATATCTTTACAAAGAAACCTGAAATATTGGCGTAGTCCTATTTCTTCTAGTTATAATAACAGCCTTCGTCGGCTACGTACTTCCAT
Again looking at the raw reads, and checking for the fist 80bp:
$ cat raw_data/SRR3359940_1.fastq.gz | gunzip | grep -c CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACC
20106
$ cat raw_data/SRR3359940_1.fastq.gz | gunzip | grep -c ^CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACC
20106
Lots of matches, and all at the very start of the forward read. And looking at the RC of the final 80bp, the same - lots of perfect matches all at the very start of the reverse read:
$ cat raw_data/SRR3359940_2.fastq.gz | gunzip | grep -c ATGGAAGTACGTAGCCGACGAAGGCTGTTATTATAACTAGAAGAAATAGGACTACGCCAATATTTCAGGTTTCTTTGTAA
21869
$ cat raw_data/SRR3359940_2.fastq.gz | gunzip | grep -c ^ATGGAAGTACGTAGCCGACGAAGGCTGTTATTATAACTAGAAGAAATAGGACTACGCCAATATTTCAGGTTTCTTTGTAA
21869
Work though
This does not contradict https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/12S/12S.ipynb which says:
The 12S amplicon sequenced here is only 106bp long. Readlength used in the MiSeq run was 2x300bp. Our reads are thus longer than our amplicon and we so expect to find primer/adapter sequences in our reads that need to be removed as part of the raw data processing.
Specifically, forward reads are expected to contain the reverse complement of the reverse primer plus the reverse Illumina adapter (FA501 - FA508), and reverse reads will contain reverse complements of the forward primers and adapters (RB701 - RB712).
Now, to CytB. Quoting https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/CytB/CytB.ipynb
The amplicon is expected to be > 400 bp long. With a readlength of 300 bp we don't expect to see any primer sequences, so it's not necessary to provide the Primer sequence for the trimming algorithm.
Again, no contradiction - rather the work though seems to assume that the primers at the start of the forward and reverse reads have all ready been removed?
Note those crude counts from grep are consistent with the paper, Table S4 lists 13454 reads for the 12S Abramis brama, and Table S5 lists 22141 reads for the CytB Abramis brama.
Hello all. I have a query about the FASTQ files available via the ENA/SRA, and if they are indeed the raw files? Or have theses sequences already been subject to primer trimming at their starts, perhaps as part of the Illumina de-muliplexing?
Using two examples:
I have download the FASTQ files from https://www.ebi.ac.uk/ena/data/view/PRJNA313432
12S
After merging overlapping reads with Flash and primer trimming, I find the most common sequence in the SRR3359939 aka Mock-community_09-12S to be this Abramis brama sequence (which in reference sequences is flanked by the primers):
ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCCHowever, note the following:
Using the reverse complement
GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGTlooking at the reverse reads we have the same pattern:i.e. All those perfect matches start with the marker, so the opening forward primer has apparently been removed from the forward FASTQ, and the reverse primer removed from the start of the reverse FASTQ reads.
CytB
Looking at the SRR3359940 Mock-community_09-CytB raw data, it does appear the primer sequences have been removed here too (although as you note, with the longer product we do not expect as much read-though and so it will rarely if ever appear after the product).
This time after merging overlaps (and without doing any primer trimming), I find this perfect Abramis brama match to be the most common sequence (which is flanked by the CytB primers):
CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACCATCTAACATTTCAACACTATGAAACTTCGGATCCCTCCTAGGATTATGTTTAATTACCCAAATCCTCACGGGATTATTTCTAGCCATACACTACACCTCTGATATCTCCACCGCATTTTCATCAGTAACCCACATCTGCCGAGACGTTAACTACGGCTGACTTATTCGAAACTTACATGCTAATGGAGCATCATTCTTCTTTATCTGCCTTTATATACATATTGCACGAGGCCTATACTACGGGTCATATCTTTACAAAGAAACCTGAAATATTGGCGTAGTCCTATTTCTTCTAGTTATAATAACAGCCTTCGTCGGCTACGTACTTCCATAgain looking at the raw reads, and checking for the fist 80bp:
Lots of matches, and all at the very start of the forward read. And looking at the RC of the final 80bp, the same - lots of perfect matches all at the very start of the reverse read:
Work though
This does not contradict https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/12S/12S.ipynb which says:
Now, to CytB. Quoting https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/CytB/CytB.ipynb
Again, no contradiction - rather the work though seems to assume that the primers at the start of the forward and reverse reads have all ready been removed?
Note those crude counts from grep are consistent with the paper, Table S4 lists 13454 reads for the 12S Abramis brama, and Table S5 lists 22141 reads for the CytB Abramis brama.