-
Notifications
You must be signed in to change notification settings - Fork 5
2. Sample file
Here we introduce some of the nomenclature used in mapache. If you are familiar with the concepts of sample, library, read group, mapping quality and sequencing platform, you may go straight to the description of the sample file for mapache.
In a BAM file, this corresponds to the
SMtag found in the header.
As the goal of mapache is to map and filter sequencing reads, the final product includes a BAM file per sample.
Although from a technical point of view a DNA sample can have many definitions, in the documentation of mapache, a sample usually refers to a single individual.
Yet again, the user is free to define their own samples according to their research questions.
Examples of different definitions of samples
samples: ind1, ind2, ind3
Here, you sequenced the genomes of three individuals, and you want a BAM for each of them. Each sample would correspond to the name or ID of the individual.
samples: tooth1, tooth2
Imagine that two teeth 🦷🦷 were excavated from the same archaeological site, very close to each other. They might belong or not to the same individual. In this case, you might want to have two different BAM files (tooth1.bam and tooth2.bam) for downstream analyses, representing one tooth each.
samples: Summer2012, Winter2021
Let's say that you took a water sample from a lake in Summer 2012, and then again in Winter 2021. Now, you are wondering if the one (or more) specific microbe that was present in 2012 is still there in 2021. In this case, you would like to get a BAM file for each time point, probably called Summer2012.bam and Winter2021.bam
In a BAM file, this corresponds to the
LBtag found in the header.
The best way to know about the library building process and how many libraries were built is to ask your lab manager.
Once the biological sample is taken and the DNA is extracted and purified, it is time to build sequencing libraries. Usually one library is built and sequenced per sample. However, in many ancient DNA labs, it is common practice to build more than one library for different reasons (a protocol was updated, the researcher needed to sequence more DNA, the quality of the initial library was not good enough, etc.).
When is this information relevant/critical?
This depends on the project type and research questions.
If you are interested in assessing the quality of different libraries, then it is important to know which FASTQ files correspond to which libraries.
For example, in libraries built from ancient samples, one might need to have a closer look at the yield, duplication rate, and adapters content per library. More importantly, mapache is capable of identifying, marking or removing duplicates (via picardtools) per library specified in the sample input file.
On the other hand, while working with fresh DNA samples (like saliva), as the quality of this material differs from that of degraded samples, some researchers might be willing to accept a few duplicated reads in their BAM files, considering that identifying duplicates is a time-consuming step.
In a BAM file from mapache, this corresponds to the
RGtag found in the header to specify read groups.
Finally, we describe the ID label.
Once a library has been built, it can be sequenced once or more times. Sometimes, even if it was sequenced only once, you might receive multiple FASTQ files for a single sequencing run.
In mapache, the ID refers to an identifier (defined by the user) that will be used to track a single (or a pair, for paired-end data) FASTQ file.
Examples
Assume that DNA was extracted for a museum's specimen, labelled as museum_139. Two libraries were prepared from this sample (lib1 and lib2), and they were sequenced on Illumina platforms. The library lib1 is a single-end library, and lib2 is a paired-end library. Each library was sequenced twice, and the sequencing center delivered the following files:
museum_139_lib1_S1_L001_R1.fastq.gz
museum_139_lib1_S1_L002_R1.fastq.gz
museum_139_lib2_S1_L001_R1.fastq.gz, museum_139_lib2_S1_L001_R2.fastq.gz
museum_139_lib2_S1_L002_R1.fastq.gz, museum_139_lib2_S1_L002_R2.fastq.gz
The idea of assigning an ID to the (pairs of) FASTQ files in mapache is to easily keep track of them during their processing. Thus, we re commend to set meaningful IDs for the files.
In the example above, the user could define different types of IDs; for example, labelling the files by sequencing round
SM LB ID Data1 Data2
museum_139 lib1 round1 museum_139_lib1_S1_L001_R1.fastq.gz NULL
museum_139 lib1 round2 museum_139_lib1_S1_L002_R1.fastq.gz NULL
museum_139 lib2 round1 museum_139_lib2_S1_L001_R1.fastq.gz museum_139_lib2_S1_L001_R2.fastq.gz
museum_139 lib2 round2 museum_139_lib2_S1_L002_R1.fastq.gz museum_139_lib2_S1_L002_R2.fastq.gz
they could also be labelled with a simple suffix:
SM LB ID Data1 Data2
museum_139 lib1 lib1_1 museum_139_lib1_S1_L001_R1.fastq.gz NULL
museum_139 lib1 lib1_2 museum_139_lib1_S1_L002_R1.fastq.gz NULL
museum_139 lib2 lib2_1 museum_139_lib2_S1_L001_R1.fastq.gz museum_139_lib2_S1_L001_R2.fastq.gz
museum_139 lib2 lib2_2 museum_139_lib2_S1_L002_R1.fastq.gz museum_139_lib2_S1_L002_R2.fastq.gz
In this sense, the ID can take many values as long as they are meaningful to the user. The only condition is that the IDs must be unique within a specific library of a sample. In the example above, it would not be allowed to set lib1_1 and lib1_1 for the two files belonging to lib1.
The sample file is the most important specification as it lists all fastq files to map and their aggregation into libraries and samples.
In addition, it states the minimum mapping quality to retain reads, and it specifies the sequencing platform from which the reads were obtained.
The name of this file has to be specified in the config file.
The sample file is a plain text file that contains 6 or 7 columns (for single- and paired-end data, respectively). The columns have to be separated by spaces or tabs.
SM LB ID Data
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz
ind1 b_L2 b_L2_R1_001 reads/b_L2_R1_001.fastq.gz
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz
For a mix of paired-end and single-end libraries, you should use the paired-end format and indicate NULL in the column corresponding to the second fastq file (Data2).
SM LB ID Data1 Data2
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz reads/a_L2_R2_001.fastq.gz
ind1 a_L2 a_L2_R1_002 reads/a_L2_R1_002.fastq.gz reads/a_L2_R2_001.fastq.gz
ind1 b_L2 b_L2_R1_002 reads/b_L2_R1_002.fastq.gz NULL
In the first example, four fastq files will be mapped. They were generated from two different libraries (here, labelled as a_L2 and b_L2) from a single sample (ind1). The reads will be mapped and retained if the mapping quality is above 30 (MAPQ column).
In the second example, there is still only one sample (ind1), and two libraries, sequenced in paired-end (a_L2) and single-end (b_L2) mode.
The columns SM, LB, ID and PL will be used to annotate the header of the BAM files produced (SM, LB, RG and PL tags, respectively).
The columns of the sample file are:
- SM: Sample name. Libraries are merged according to this name.
- LB: Library name. Fastq files are merged according to this name.
- ID: An ID for the fastq library (examples: id1, fq_1, ind1_lib1_fq2, etc.)
- Data (single-end format): Path to the fastq file. The file may be gzipped or not. Path may be absolute or relative to the working directory.
- Data1 (paired-end format): Path to the forward fastq file (R1) for paired-end data or the fastq file for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
-
Data2 (paired-end format): Path to the reverse fastq file (R2) for paired-end data or
NULLfor single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
Please note
- The order of the columns is free, but the column names are specific.
-
IDnames have to be unique within the same library (LB). - Names in
ID,LBandSMmay be anything, but may not contain points ('.') - Commented lines (
#) are ignored.
Mapache supports fastq files defined as an ftp download link (e.g., from ENA). The files are downloaded automatically and stored, also if temporal files are set to be removed. If an additional md5sum is specified (additional column, MD5 (SE reads), MD5_1/Md5_2 (PE reads)) the downloads are tested for completeness:
SM LB ID Data MD5
ind1 a_L2 a_L2_R1_001 reads/a_L2_R1_001.fastq.gz
ind2 ftp_lib ftp_id ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR106/095/ERR10675895/ERR10675895.fastq.gz 06a3243190c072ea4dce55b8fecb7e8
You need to edit your config file (config/config.yml) and indicate the path to your samples file.
Assuming you saved this file as my_samples.txt, the original config file has to be modified from this:
sample_file: config/samples.tsv
to this:
sample_file: my_samples.txt
Welcome to Mapache's wiki! Got any question? Found a bug? Please open an issue.