Snakemake workflow for processing MPRA barcode libraries, aligning sequencing reads, generating barcode–variant assignments, and building final reporter element count matrices.
This pipeline requires:
- cutadapt
- bwa
- subread (featureCounts)
- gzip
- sed
- python3
- R >= 4.0
Used in Module/A1.barcode_mapping.py
Required in A2.Barcode_Digest.R and A5.Cleansing_Count_Matrix.R:
- data.table
- reshape2
- Biostrings
- optparse
- ggplot2 (optional)
IGVF_CDMPRA_BCmapping/ ├── Snakefile ├── config.yaml ├── Module/ │ ├── A1.barcode_mapping.py │ ├── A2.Barcode_Digest.R │ └── A5.Cleansing_Count_Matrix.R ├── logs/ └── (generated output files)
output_dir: "/path/to/output" bcmapping_fastq: "/path/to/bcmap_input.fastq.gz" libfile: "/path/to/library_file.txt"
fastq_dir: "/path/to/trim_align_fastqs" samples:
- sample1
- sample2
- ... adapterseq: "ACTAGTACACTCCCC"
| Field | Description |
|---|---|
| output_dir | Directory to place all generated outputs |
| bcmapping_fastq | FASTQ file used for barcode mapping |
| libfile | Library design table used for variant-barcode mapping |
| fastq_dir | Directory containing FASTQ files for BWA alignment |
| samples | List of sample prefixes (without .fastq.gz) |
| adapterseq | 3’ adapter sequence for trimming with cutadapt |
Dry-run (check workflow): snakemake -n -p
Run full workflow: snakemake -j 20 --rerun-incomplete
Reads the barcode FASTQ, extracts every 4th-line sequence, and maps barcodes to variants.
Input:
- bcmapping_fastq
- libfile
Output:
- bcmap.txt
Processes bcmap.txt to generate:
- Reporter_assignment.tsv
- barcode.fasta
- barcode.saf
- BC_statistics.pdf
Output files:
- Reporter_assignment.tsv
- barcode.fasta
- barcode.saf
- BC_statistics.pdf
For each sample:
- trim reads using cutadapt
- align with bwa aln / samse
- generate sample.sam
Output:
- {sample}.sam
Uses SAM files + barcode SAF to build the raw count matrix.
Output:
- MPRA_barcode_count.txt
Performs:
- removing rows with all-zero barcodes
- removing outlier barcodes
- selecting variants supported by ≥5 barcodes
Output:
- Counts_noOutliers.tsv
- Reporter_Element.tsv
- barcode_representation.pdf
| File | Description |
|---|---|
| Counts_noOutliers.tsv | Barcode counts with outlier removal |
| Reporter_Element.tsv | Final element × sample count matrix |
| barcode_representation.pdf | QC histogram of barcode-per-variant distribution |
| barcode.fasta | FASTA file of barcodes |
| barcode.saf | SAF annotation for featureCounts |
- Pipeline is modular: each step can be rerun independently.
- All log files stored in
logs/. - FASTQ filenames must match
samples:list in config.yaml.
Hyunggyu Min
UNC Chapel Hill — Hyejung Won Lab