Skip to content

thewonlab/cross_disorder_MPRA_count_matrix_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IGVF CDMPRA Barcode Mapping Pipeline

Snakemake workflow for processing MPRA barcode libraries, aligning sequencing reads, generating barcode–variant assignments, and building final reporter element count matrices.


Dependencies

This pipeline requires:

HPC modules

  • cutadapt
  • bwa
  • subread (featureCounts)

System tools

  • gzip
  • sed
  • python3
  • R >= 4.0

Python packages

Used in Module/A1.barcode_mapping.py

R packages

Required in A2.Barcode_Digest.R and A5.Cleansing_Count_Matrix.R:

  • data.table
  • reshape2
  • Biostrings
  • optparse
  • ggplot2 (optional)

Directory Structure

IGVF_CDMPRA_BCmapping/ ├── Snakefile ├── config.yaml ├── Module/ │ ├── A1.barcode_mapping.py │ ├── A2.Barcode_Digest.R │ └── A5.Cleansing_Count_Matrix.R ├── logs/ └── (generated output files)


Configuration (config.yaml)

output_dir: "/path/to/output" bcmapping_fastq: "/path/to/bcmap_input.fastq.gz" libfile: "/path/to/library_file.txt"

fastq_dir: "/path/to/trim_align_fastqs" samples:

  • sample1
  • sample2
  • ... adapterseq: "ACTAGTACACTCCCC"

Required fields

Field Description
output_dir Directory to place all generated outputs
bcmapping_fastq FASTQ file used for barcode mapping
libfile Library design table used for variant-barcode mapping
fastq_dir Directory containing FASTQ files for BWA alignment
samples List of sample prefixes (without .fastq.gz)
adapterseq 3’ adapter sequence for trimming with cutadapt

Running the Pipeline

Dry-run (check workflow): snakemake -n -p

Run full workflow: snakemake -j 20 --rerun-incomplete


Workflow Overview

A1 — Barcode Mapping

Reads the barcode FASTQ, extracts every 4th-line sequence, and maps barcodes to variants.

Input:

  • bcmapping_fastq
  • libfile

Output:

  • bcmap.txt

A2 — Barcode Digest

Processes bcmap.txt to generate:

  • Reporter_assignment.tsv
  • barcode.fasta
  • barcode.saf
  • BC_statistics.pdf

Output files:

  • Reporter_assignment.tsv
  • barcode.fasta
  • barcode.saf
  • BC_statistics.pdf

A3 — Build Count Matrix (BWA alignment)

For each sample:

  • trim reads using cutadapt
  • align with bwa aln / samse
  • generate sample.sam

Output:

  • {sample}.sam

A4 — featureCounts

Uses SAM files + barcode SAF to build the raw count matrix.

Output:

  • MPRA_barcode_count.txt

A5 — Cleansing Count Matrix

Performs:

  • removing rows with all-zero barcodes
  • removing outlier barcodes
  • selecting variants supported by ≥5 barcodes

Output:

  • Counts_noOutliers.tsv
  • Reporter_Element.tsv
  • barcode_representation.pdf

Final Outputs

File Description
Counts_noOutliers.tsv Barcode counts with outlier removal
Reporter_Element.tsv Final element × sample count matrix
barcode_representation.pdf QC histogram of barcode-per-variant distribution
barcode.fasta FASTA file of barcodes
barcode.saf SAF annotation for featureCounts

Notes

  • Pipeline is modular: each step can be rerun independently.
  • All log files stored in logs/.
  • FASTQ filenames must match samples: list in config.yaml.

Maintainer

Hyunggyu Min
UNC Chapel Hill — Hyejung Won Lab

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published