You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run the scripts in the parent directory to the cloned github repo (so the filepaths remain relative)
This workflow is written for a SLURM HPC system (I used the CropDiversity cluster)
To run this code, git clone this whole repo, keeping the adapter directory and the scripts directory in the relative places they're in so that the codes run smoothly (or, edit the code yourself only in your cloned remote repo!
A loop which first demultiplexes the reads based on the 5' SP5 primers, and then, one-by-one, demultiplexes each output from the SP5 demultiplexing by the SP27 3' primers. It also trims adapters on the fly.
This script takes in 2 extra files, which are hardcoded into the script: the 5' and 3' adapter sequences, M13_amplicon_indices_forward.fa and the reverse complement of the 3' sequences (since we reoriented before with pychopper, M13_amplicon_indices_reverse_rc.fa)
You may notice that we do bin reads into adapter combinations which do not exist! Though this is inefficient, it's not the end of the world, since we have sequenced deeply enough per individual that enough reads are binned into the correct adapter combinations
These non-existing adapter combos are removed after demuxing, along with the 'unknown' bins, since we don't want to analyse them later on.
The workflow is more dynamic here, since the user may be analysing different amplicons for us. In our wet-lab protocol:
For the rRNAs, we use 2 sets of primers to amplify overlapping segments of the nuclear rRNA cistron, so a sequence can be anywhere over 3Kb in length
For the COIs, we use 2 alternate options for a forward primer, and one option for a reverse primer, to amplify 'redundant' sequences of a COI barcode segment (the 2 alternate options for the forward primer allow for matching more taxa than just one), so a sequence can be between 300bp-900bp in length
The amplicon sorter step splits samples' reads by size and clusters them by sequence
The variables (min length, max length, prefix name for amplicon type, and input folder) are user-defined :)
This script takes each clustered (amplicon-sorted) file and removes the primer sequences from the amplicon sequences, since these are synthetic
The user can define the primer sequences to remove based on the amplicon they are sequencing, but if more than one amplicon was sequenced in a run, the other primer sets can also be submitted as a back check
Sequences with lone primers, mismatched primers, or >1 primer of a kind are removed in a failsafe
Optionally, the user can choose to trim 'untrimmed' sequences with alternate primers
05a_pybarrnap_rDNA_extract.sh: sbatch $0 /path/to/dataset/primerless and 05b_reorganise_COIs.sh: sbatch $0 /path/to/dataset/primerless
5a script uses pybarrnap version 0.5.1. It takes the assembled contigs and uses an covariance model based on Rfam(14.10) to extract sequences matching 28S and 18S rDNA profiles from our amplicon contigs.
5b is a straightforward script copying over the cleaned, clustered/non-redundant primerless COIs from the primerless directory to a COI directory for clarity
Primer schematics
rRNA amplification
COI amplification
About
The code used to process COI/28S/18S amplicon data from the 2025 NERC/NHM Lake District Freshwater Meiofauna workshop