mapseq-processing/NOTES.txt at main · ZadorLaboratory/mapseq-processing · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Developer Notes for MAPseq pipeline
============================================

Key principles:

Code is modularized to allow usage from several layers: as library, interactively
from ipython/Jupyter, and via command-line scripts.

Configuration is handled via ConfigParser objects, with built-in package defaults,
explicit values in function calls, and overrides from CLI scripts.

Efforts made to parameterize wherever possible, to allow modification of things like
nucleotide field lengths/location.

Pandas and/or Dask using vectorized functions where possible.

Abstracted out calls to bowtie/bowtie2 to enable swap out of tools.

As experiment size has increased (1B+ reads coming) this has forced efforts to maximize memory and space
efficiency. One example is switch to using pyarrow strings vs. native Python strings, and the
usage of parquet as a binary storage format (since reading/writing data and assigning types to data
read in from TSV becomes cumbersome. Reading files in chunks and doing piecewise processing
is another example. More generally, Dask allows trading time and temporary storage space for memory.


Stage-specific Notes
==========================================

process_fastq_pairs

We use Pandas read_csv with a lambda to pull out every fourth line of FASTQ. This is chunked to minimize
memory.

aggregate_reads


filter_split