Skip to content

elalgil/BioSequence-Pseudo-Alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bio Sequence / Elal Gilboa

🔧 Extensions:

  • EXTQUALITY – Filters out low-quality reads, low-quality Kmers, and highly redundant Kmers during the mapping process.
  • EXTREVCOMP – Determines the best orientation for each read (forward/reverse) based on the number of unique Kmers.
  • EXTCOVERAGE – Calculates genome coverage based on unique and ambiguous read alignments.
  • EXTSIM – Filters highly similar genomes from the reference Kmer DB and provides filtering statistics.

🚀 Usage

Run from the command line:

python3 main.py -t [chosen_task] [chosen_flags]

✅ Valid Flag Combinations:

Reference Creation

-t reference -g genome.fa -k 21 -r output.kdb

Required:

  • -g: genome file path (FASTA format)
  • -k: kmer size (21–31)
  • -r: output reference DB file path (.kdb)

Optional (for EXTENSIM):

  • --filter-similar
  • --similarity-threshold THRESHOLD

Dump Reference

-t dumpref -r ref.kdb

Or:

-t dumpref -g genome.fa -k 21

Align Reads

-t align -r ref.kdb -a output.aln --reads input.fq [options]

Or:

-t align -g genome.fa -k 21 -a output.aln --reads input.fq [options]

Options:

  • -m: unique threshold
  • -p: ambiguous threshold

Dump Alignments

-t dumpalign -a output.aln

Can only be used alone.

Or:

-t dumpalign -r ref.kdb --reads input.fq [options]

Or:

-t dumpalign -g genome.fa -k 21 --reads input.fq [options]

🔌 Extension Parameters

  • EXTQUALITY: --min-read-quality MRQ, --min-kmer-quality MKQ, --max-genomes MG
  • EXTREVCOMP: --reverse-complement
  • EXTCOVERAGE: --coverage, --genomes g1,g2,..., --min-coverage MC, --full-coverage

🧠 Design

Data Structures:

  • Reference Kmers DB is implemented as a dict[str, List[int]] for O(1) lookup time in pseudo-alignment.

Classes:

  • Kmer: holds sequence data, locations, uniqueness, and reference sources.
  • Read: holds read data, quality, header, mapping status, and orientation.
  • Reference: holds genome data, header, Kmers, and ID for similarity calculations.
  • Alignment: manages mapping results, coverage, and quality statistics.
  • KmerDB: contains all reference Kmers and optional similarity statistics.

File Formats:

  • *.aln: pickled Alignment object, compressed using gzip
  • *.kdb: pickled KmerDB and genome lengths dictionary (from FASTA headers)

FASTA/Q Loading:

  • FASTQ: reads 4-line blocks and creates Read objects.
  • FASTA: reads genome header + sequence to create Reference objects using generators.

Developed by Elal Gilboa
Hebrew University – Bioinformatics Project

About

This is the final project in python for course intro to computer science in HUJI Fall semester 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages