Primer trimming the raw FASTQ files from the SRA/ENA

Hello all. I have a query about the FASTQ files available via the ENA/SRA, and if they are indeed the *raw* files? Or have theses sequences _already_ been subject to primer trimming at their starts, perhaps as part of the Illumina de-muliplexing?

Using two examples:
* SRR3359939	Mock-community_09-12S
* SRR3359940	Mock-community_09-CytB

I have download the FASTQ files from https://www.ebi.ac.uk/ena/data/view/PRJNA313432

# 12S

After merging overlapping reads with Flash and primer trimming, I find the most common sequence in the SRR3359939 aka Mock-community_09-12S to be this *Abramis brama* sequence (which in reference sequences is flanked by the primers):

``ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC``

However, note the following:

```
$ cat raw_data/SRR3359939_1.fastq.gz | gunzip \
| grep -c ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
13731

$ cat raw_data/SRR3359939_1.fastq.gz | gunzip \
| grep -c ^ACTATGCTCAGCCGTAAACCCAGACGTCCAACTACAATTAGACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
13731
```

Using the reverse complement ``GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT`` looking at the reverse reads we have the same pattern:

```
$ cat raw_data/SRR3359939_2.fastq.gz | gunzip \
| grep -c GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT
13499

$ cat raw_data/SRR3359939_2.fastq.gz | gunzip \
| grep -c ^GGGGGTCTAAGGCACCGTCAGGTCCTTTGGGTTTCAAGCTAATGCTCGTAGTACCCGGGCGGACGTCTAATTGTAGTTGGACGTCTGGGTTTACGGCTGAGCATAGT
13496
```

i.e. All those perfect matches start with the marker, so the opening forward primer has apparently been removed from the forward FASTQ, and the reverse primer removed from the start of the reverse FASTQ reads.

# CytB

Looking at the SRR3359940 Mock-community_09-CytB raw data, it does appear the primer sequences have been removed here too (although as you note, with the longer product we do not expect as much read-though and so it will rarely if ever appear _after_ the product).

This time after merging overlaps (and without doing any primer trimming), I find this perfect *Abramis brama* match to be the most common sequence (which is flanked by the CytB primers):

``CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACCATCTAACATTTCAACACTATGAAACTTCGGATCCCTCCTAGGATTATGTTTAATTACCCAAATCCTCACGGGATTATTTCTAGCCATACACTACACCTCTGATATCTCCACCGCATTTTCATCAGTAACCCACATCTGCCGAGACGTTAACTACGGCTGACTTATTCGAAACTTACATGCTAATGGAGCATCATTCTTCTTTATCTGCCTTTATATACATATTGCACGAGGCCTATACTACGGGTCATATCTTTACAAAGAAACCTGAAATATTGGCGTAGTCCTATTTCTTCTAGTTATAATAACAGCCTTCGTCGGCTACGTACTTCCAT``

Again looking at the raw reads, and checking for the fist 80bp:

```
$ cat raw_data/SRR3359940_1.fastq.gz | gunzip | grep -c CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACC
20106
$ cat raw_data/SRR3359940_1.fastq.gz | gunzip | grep -c ^CAGGAACTAATGGCAAGCCTACGAAAAACCCACCCACTAATAAAAATCGCTAATGACGCACTAGTCGACCTCCCAACACC
20106
```

Lots of matches, and all at the very start of the forward read. And looking at the RC of the final 80bp, the same - lots of perfect matches all at the very start of the reverse read:

```
$ cat raw_data/SRR3359940_2.fastq.gz | gunzip | grep -c ATGGAAGTACGTAGCCGACGAAGGCTGTTATTATAACTAGAAGAAATAGGACTACGCCAATATTTCAGGTTTCTTTGTAA
21869
$ cat raw_data/SRR3359940_2.fastq.gz | gunzip | grep -c ^ATGGAAGTACGTAGCCGACGAAGGCTGTTATTATAACTAGAAGAAATAGGACTACGCCAATATTTCAGGTTTCTTTGTAA
21869
```

# Work though

This does not contradict https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/12S/12S.ipynb which says:

> The 12S amplicon sequenced here is only 106bp long. Readlength used in the MiSeq run was 2x300bp. Our reads are thus longer than our amplicon and we so expect to find primer/adapter sequences in our reads that need to be removed as part of the raw data processing.
>
> Specifically, forward reads are expected to contain the reverse complement of the reverse primer plus the reverse Illumina adapter (FA501 - FA508), and reverse reads will contain reverse complements of the forward primers and adapters (RB701 - RB712).

Now, to CytB. Quoting https://github.com/HullUni-bioinformatics/Haenfling_et_al_2016/blob/master/CytB/CytB.ipynb

> The amplicon is expected to be > 400 bp long. With a readlength of 300 bp we don't expect to see any primer sequences, so it's not necessary to provide the Primer sequence for the trimming algorithm.

Again, no contradiction - rather the work though seems to assume that the primers at the start of the forward and reverse reads have all ready been removed?

Note those crude counts from grep are consistent with the paper, Table S4 lists 13454 reads for the 12S *Abramis brama*, and Table S5 lists 22141 reads for the CytB  *Abramis brama*.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primer trimming the raw FASTQ files from the SRA/ENA #1

12S

CytB

Work though

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Primer trimming the raw FASTQ files from the SRA/ENA #1

Description

12S

CytB

Work though

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions