IGVF CDMPRA Barcode Mapping Pipeline

Snakemake workflow for processing MPRA barcode libraries, aligning sequencing reads, generating barcode–variant assignments, and building final reporter element count matrices.

Dependencies

This pipeline requires:

HPC modules

cutadapt
bwa
subread (featureCounts)

System tools

gzip
sed
python3
R >= 4.0

Python packages

Used in Module/A1.barcode_mapping.py

R packages

Required in A2.Barcode_Digest.R and A5.Cleansing_Count_Matrix.R:

data.table
reshape2
Biostrings
optparse
ggplot2 (optional)

Directory Structure

IGVF_CDMPRA_BCmapping/ ├── Snakefile ├── config.yaml ├── Module/ │ ├── A1.barcode_mapping.py │ ├── A2.Barcode_Digest.R │ └── A5.Cleansing_Count_Matrix.R ├── logs/ └── (generated output files)

Configuration (config.yaml)

output_dir: "/path/to/output" bcmapping_fastq: "/path/to/bcmap_input.fastq.gz" libfile: "/path/to/library_file.txt"

fastq_dir: "/path/to/trim_align_fastqs" samples:

sample1
sample2
... adapterseq: "ACTAGTACACTCCCC"

Required fields

Field	Description
output_dir	Directory to place all generated outputs
bcmapping_fastq	FASTQ file used for barcode mapping
libfile	Library design table used for variant-barcode mapping
fastq_dir	Directory containing FASTQ files for BWA alignment
samples	List of sample prefixes (without `.fastq.gz`)
adapterseq	3’ adapter sequence for trimming with cutadapt

Running the Pipeline

Dry-run (check workflow): snakemake -n -p

Run full workflow: snakemake -j 20 --rerun-incomplete

Workflow Overview

A1 — Barcode Mapping

Reads the barcode FASTQ, extracts every 4th-line sequence, and maps barcodes to variants.

Input:

bcmapping_fastq
libfile

Output:

bcmap.txt

A2 — Barcode Digest

Processes bcmap.txt to generate:

Reporter_assignment.tsv
barcode.fasta
barcode.saf
BC_statistics.pdf

Output files:

Reporter_assignment.tsv
barcode.fasta
barcode.saf
BC_statistics.pdf

A3 — Build Count Matrix (BWA alignment)

For each sample:

trim reads using cutadapt
align with bwa aln / samse
generate sample.sam

Output:

{sample}.sam

A4 — featureCounts

Uses SAM files + barcode SAF to build the raw count matrix.

Output:

MPRA_barcode_count.txt

A5 — Cleansing Count Matrix

Performs:

removing rows with all-zero barcodes
removing outlier barcodes
selecting variants supported by ≥5 barcodes

Output:

Counts_noOutliers.tsv
Reporter_Element.tsv
barcode_representation.pdf

Final Outputs

File	Description
Counts_noOutliers.tsv	Barcode counts with outlier removal
Reporter_Element.tsv	Final element × sample count matrix
barcode_representation.pdf	QC histogram of barcode-per-variant distribution
barcode.fasta	FASTA file of barcodes
barcode.saf	SAF annotation for featureCounts

Notes

Pipeline is modular: each step can be rerun independently.
All log files stored in logs/.
FASTQ filenames must match samples: list in config.yaml.

Maintainer

Hyunggyu Min
UNC Chapel Hill — Hyejung Won Lab

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.snakemake/log		.snakemake/log
Module		Module
config.yaml		config.yaml
readme.md		readme.md
snakefile		snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IGVF CDMPRA Barcode Mapping Pipeline

Dependencies

HPC modules

System tools

Python packages

R packages

Directory Structure

Configuration (config.yaml)

Required fields

Running the Pipeline

Workflow Overview

A1 — Barcode Mapping

A2 — Barcode Digest

A3 — Build Count Matrix (BWA alignment)

A4 — featureCounts

A5 — Cleansing Count Matrix

Final Outputs

Notes

Maintainer

About

Uh oh!

Releases 1

Packages

Languages

thewonlab/cross_disorder_MPRA_count_matrix_pipeline

Folders and files

Latest commit

History

Repository files navigation

IGVF CDMPRA Barcode Mapping Pipeline

Dependencies

HPC modules

System tools

Python packages

R packages

Directory Structure

Configuration (config.yaml)

Required fields

Running the Pipeline

Workflow Overview

A1 — Barcode Mapping

A2 — Barcode Digest

A3 — Build Count Matrix (BWA alignment)

A4 — featureCounts

A5 — Cleansing Count Matrix

Final Outputs

Notes

Maintainer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages