Skip to content
/ FuFiHLA Public
forked from jingqing-hu/FuFiHLA

This repo is for backup only. Please check the parent repo for details.

License

Notifications You must be signed in to change notification settings

hlilab/FuFiHLA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FuFiHLA: Full Field HLA allele typing for Long Reads

License: MIT Python

FuFiHLA is a pipeline for full field HLA allele typing and consensus sequence construction from long-read sequencing data. It currently supports PacBio HiFi data on six clinically important transplant genes: HLA-A, -B, -C, -DQA1, -DQB1, -DRB1.

Highlights

  • Reference-free: does not depend on a specific version of reference genome such GRCh38 or CHM13
  • Improved consensus accuracy compared to StarPhase

Citation: TBD


Installation

Install from Bioconda (recommended):

conda install -c bioconda -c conda-forge fufihla

Quick Test

with 'test.fa.gz' under the folder "test", run:

fufihla --fa test.fa.gz --out test_dir

The output includes:

  • test_dir/ → pipeline logs
  • test_dir.out → result output
  • test_dir.err → stderr log

Usage

To use the latest reference allele sequences from IMGT, type:

fufihla-ref-prep

This will create a directory called ref_data, which would contain the reference allele sequence ref.gene.fa.gz.

run the pipeline:

# with default reference allele sequences, version IPD-IMGT/HLA-V3.61.0
fufihla --fa <input_reads.fa.gz> --out <output_dir>
# or with the specific version of reference data
fufihla --fa <input_reads.fa.gz> --out <output_dir> --refdir <reference data directory> --hifi/--ont --debug

Arguments

  • <input_reads.fa.gz> : raw PacBio HiFi reads (.fa/.fa.gz/.fq/.fq.gz)
  • <output_dir> : directory for pipeline outputs
  • --refdir <reference_data_directory>(optional): path to reference allele dataset; if omitted, uses the default bundled set
  • --hifi/--ont(optional): choose HiFi long reads or Nanopore long read data as input, default is --hifi
  • --debug(optional): keep all intermediate files; otherwise only consensus results are kept

Outputs

A typical run produces:

<outdir>/consensus/*_asm*.fa        → consensus allele FASTA sequences

Allele calls are printed to <output_dir>.out in PAF-like format with minimap2 tags. Example:

HLA-A*01:01:01:01  cons_HLA-A*01_01_01_01  ...  cs:Z::3503
HLA-A*26:01:01:01  cons_HLA-A*26_01_01_01  ...  cs:Z::3517
  • Column 1 → the allele name called by FuFiHLA
  • Column 2 → the consensus sequence build upon allele in the suffix
  • Last column (cs:Z) → minimap2 cs tag encoding base-level matches/mismatches:
    • Known Alleles: cs:Z::3503 → perfect match over 3503 bp
    • Novel Alleles → cs:Z contains substitutions (*), insertions (+), or deletions (-)

Running tips

  • Extract reads from exist bam files can also generate similar result as using WGS reads.
## save the six gene locations into bed format based on the gene annotation file
echo "
chr6	29942254	29945755
chr6	31268254	31272571
chr6	31353362	31357442
chr6	32578769	32589848
chr6	32636717	32643200
chr6	32660031	32667132" > sel.bed

## Extract the reads covering the target gene region
samtools view -bh ${bam} --region-file sel.bed | samtools fasta | gzip -c > out.fa.gz

About

This repo is for backup only. Please check the parent repo for details.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 84.7%
  • Shell 15.3%