- EXTQUALITY – Filters out low-quality reads, low-quality Kmers, and highly redundant Kmers during the mapping process.
- EXTREVCOMP – Determines the best orientation for each read (forward/reverse) based on the number of unique Kmers.
- EXTCOVERAGE – Calculates genome coverage based on unique and ambiguous read alignments.
- EXTSIM – Filters highly similar genomes from the reference Kmer DB and provides filtering statistics.
Run from the command line:
python3 main.py -t [chosen_task] [chosen_flags]-t reference -g genome.fa -k 21 -r output.kdbRequired:
-g: genome file path (FASTA format)-k: kmer size (21–31)-r: output reference DB file path (.kdb)
Optional (for EXTENSIM):
--filter-similar--similarity-threshold THRESHOLD
-t dumpref -r ref.kdbOr:
-t dumpref -g genome.fa -k 21-t align -r ref.kdb -a output.aln --reads input.fq [options]Or:
-t align -g genome.fa -k 21 -a output.aln --reads input.fq [options]Options:
-m: unique threshold-p: ambiguous threshold
-t dumpalign -a output.alnCan only be used alone.
Or:
-t dumpalign -r ref.kdb --reads input.fq [options]Or:
-t dumpalign -g genome.fa -k 21 --reads input.fq [options]- EXTQUALITY:
--min-read-quality MRQ,--min-kmer-quality MKQ,--max-genomes MG - EXTREVCOMP:
--reverse-complement - EXTCOVERAGE:
--coverage,--genomes g1,g2,...,--min-coverage MC,--full-coverage
- Reference Kmers DB is implemented as a
dict[str, List[int]]for O(1) lookup time in pseudo-alignment.
Kmer: holds sequence data, locations, uniqueness, and reference sources.Read: holds read data, quality, header, mapping status, and orientation.Reference: holds genome data, header, Kmers, and ID for similarity calculations.Alignment: manages mapping results, coverage, and quality statistics.KmerDB: contains all reference Kmers and optional similarity statistics.
*.aln: pickledAlignmentobject, compressed usinggzip*.kdb: pickledKmerDBand genome lengths dictionary (from FASTA headers)
- FASTQ: reads 4-line blocks and creates
Readobjects. - FASTA: reads genome header + sequence to create
Referenceobjects using generators.
Developed by Elal Gilboa
Hebrew University – Bioinformatics Project