FuFiHLA is a pipeline for full field HLA allele typing and consensus sequence construction from long-read sequencing data. It currently supports PacBio HiFi data on six clinically important transplant genes: HLA-A, -B, -C, -DQA1, -DQB1, -DRB1.
- Reference-free: does not depend on a specific version of reference genome such GRCh38 or CHM13
- Improved consensus accuracy compared to StarPhase
Citation: TBD
Install from Bioconda (recommended):
conda install -c bioconda -c conda-forge fufihla
with 'test.fa.gz' under the folder "test", run:
fufihla --fa test.fa.gz --out test_dirThe output includes:
test_dir/→ pipeline logstest_dir.out→ result outputtest_dir.err→ stderr log
To use the latest reference allele sequences from IMGT, type:
fufihla-ref-prepThis will create a directory called ref_data, which would contain the reference allele sequence ref.gene.fa.gz.
run the pipeline:
# with default reference allele sequences, version IPD-IMGT/HLA-V3.61.0
fufihla --fa <input_reads.fa.gz> --out <output_dir>
# or with the specific version of reference data
fufihla --fa <input_reads.fa.gz> --out <output_dir> --refdir <reference data directory> --hifi/--ont --debugArguments
<input_reads.fa.gz>: raw PacBio HiFi reads (.fa/.fa.gz/.fq/.fq.gz)<output_dir>: directory for pipeline outputs--refdir <reference_data_directory>(optional): path to reference allele dataset; if omitted, uses the default bundled set--hifi/--ont(optional): choose HiFi long reads or Nanopore long read data as input, default is--hifi--debug(optional): keep all intermediate files; otherwise only consensus results are kept
A typical run produces:
<outdir>/consensus/*_asm*.fa → consensus allele FASTA sequences
Allele calls are printed to <output_dir>.out in PAF-like format with minimap2 tags.
Example:
HLA-A*01:01:01:01 cons_HLA-A*01_01_01_01 ... cs:Z::3503
HLA-A*26:01:01:01 cons_HLA-A*26_01_01_01 ... cs:Z::3517
- Column 1 → the allele name called by FuFiHLA
- Column 2 → the consensus sequence build upon allele in the suffix
- Last column (
cs:Z) → minimap2 cs tag encoding base-level matches/mismatches:- Known Alleles:
cs:Z::3503→ perfect match over 3503 bp - Novel Alleles → cs:Z contains substitutions (*), insertions (+), or deletions (-)
- Known Alleles:
- Extract reads from exist bam files can also generate similar result as using WGS reads.
## save the six gene locations into bed format based on the gene annotation file
echo "
chr6 29942254 29945755
chr6 31268254 31272571
chr6 31353362 31357442
chr6 32578769 32589848
chr6 32636717 32643200
chr6 32660031 32667132" > sel.bed
## Extract the reads covering the target gene region
samtools view -bh ${bam} --region-file sel.bed | samtools fasta | gzip -c > out.fa.gz