Protein Homology Identification via Domain-Related Architecture
A simple way to search and validate identified Pfam domains of interest against a curated InterProScan Domain Architecture (IDA) file to check whether or not your proteins match a domain composition found in the InterPro Database.
A Python-based package of scripts that performs an initial homology search using MMseqs2 to identify targeted proteins of interest in small or large datasets. Top hits are searched against the Pfam database using pfam_scan, and verified domains are checked and compared against a custom input InterProScan Domain Architecture (IDA) file relative to your target protein of interest.
A recursive search is then performed using the full-length proteins with validated IDAs as the subject database and the original input as the query, with initial matches filtered out. This process captures potentially more distant proteins that may have been missed in the initial homology search but are functionally relevant.
➡️ Official documentation is hosted on PHIDRA.
phidracan be run either on a machine with a properly set up Python environment or by creating a custom Conda environment if you do not wish to change your current setup (recommended).- MMseqs2 required for initial and recursive homology search.
- HMMER required for Pfam database creation and protein domain identification through
pfam_scan. - Python >= 3.8
- Conda setup (recommended) Help
#Create environment with >= Python 3.8
conda create --name phidra python=3.8
#Activate environment
conda activate phidra
#Install hmmer/mmseqs2 (Required)
conda install -c conda-forge -c bioconda mmseqs2
conda install bioconda::hmmer
#Clone tool into working directory
git clone https://github.com/zschreib/phidra
cd phidra
#Grabs required python packages
pip install -r requirements.txt
- Python setup >= Python 3.8, requires MMseqs2 and HMMER to be already set up.
git clone https://github.com/zschreib/phidra
cd phidra
pip install -r requirements.txt
- Tool and version check. If you receive any errors here, do not proceed until fixed.
mmseqs -h
hmmscan -h
python phidra_run.py -v
- If you already have a custom or existing Pfam-HMM formatted database you can skip this step.
- You can optionally include the Pfam-B database to perform a less restrictive search, which can help identify more remote or novel domain relationships that may be missed by the curated Pfam-A profiles.
mkdir pfam_database
cd pfam_database
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip -c Pfam-A.hmm.dat.gz > Pfam-A.hmm.dat
gunzip -c Pfam-A.hmm.gz > Pfam-A.hmm
rm Pfam-A.hmm.gz Pfam-A.hmm.dat.gz
hmmpress Pfam-A.hmm
- For DNA polymerase A, I have identified PF00476 (DNA_pol_A) as the core domain.
- Search domain on InterPro:
https://www.ebi.ac.uk/interpro/entry/pfam/PF00476/domain_architecture/and grab the full IDA profile.
File should include (by default):
- IDA ID (unique hash)
- Domain combinations
- Protein counts within IDA
- Representative sequence
- Representative length
- Domain positions/coordinates
IDA ID IDA Text Unique Proteins Representative Accession Representative Length Representative Domains
82911c7e8cf0ed5121595d5944b2a2f9a2c4f49e PF02739:IPR020046-PF01367:IPR020045-PF01612:IPR002562-PF00476:IPR001098 22766 P00582 928 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[9-170],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[171-272],PF01612{DNA_pol_A_exo1}:IPR002562{3'-5'_exonuclease_dom}[330-516],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[551-925]
d4033548d92ed83469d80faa39cea59a849060a8 PF00476:IPR001098 18435 P00581 704 PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[333-701]
16983c3a921f1409678537d9977eca4ed176e7c2 PF02739:IPR020046-PF01367:IPR020045-PF22619:IPR054690-PF00476:IPR001098 10607 Q04957 877 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[4-170],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[172-266],PF22619{DNA_polI_exo1}:IPR054690{DNA_polI_exonuclease}[318-457],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[498-875]
ce54c2b3dc9b223aa1f4992353c0972accb51c74 PF02739:IPR020046-PF01367:IPR020045-PF00476:IPR001098 6038 O84500 866 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[3-166],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[167-259],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[495-865]
f54dc0def957e931720f1460942192debd6092ca PF01612:IPR002562-PF00476:IPR001098 4576 Q05254 595 PF01612{DNA_pol_A_exo1}:IPR002562{3'-5'_exonuclease_dom}[19-210],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[243-587]
Core domain (PF00476) commonly associates with:
- PF02739 (5' exonuclease N-terminal)
- PF01367 (5' exonuclease)
- PF01612 (DNA polymerase A exonuclease)
- Compare unknown or known sequence hits to the subject database against known IDA profiles.
- Explore domain organization and composition.
Domain architectures help validate protein predictions by ensuring essential functional domains are present, and they also provide structured, functionally meaningful features that can be leveraged as high-quality inputs for machine-learning models to improve training and downstream classification.
usage: phidra_run.py [-h] [-v] -i INPUT_FASTA -db SUBJECT_DB -pfam PFAM_HMM_DB -ida IDA_FILE -f FUNCTION -o OUTPUT_DIR
[-t THREADS] [-e EVALUE]
Identifies homologous proteins and associated Pfam domains from input protein sequences, while comparing against
InterPro Domain Architectures to analyze domain-level similarities and functional relationships.
Help:
-h, --help Show this help message and exit
-v, --version Show program version and exit
Required arguments:
-i INPUT_FASTA, --input_fasta INPUT_FASTA
Query FASTA for mmseqs search (default: None)
-db SUBJECT_DB, --subject_db SUBJECT_DB
Subject FASTA for mmseqs createdb (default: None)
-pfam PFAM_HMM_DB, --pfam_hmm_db PFAM_HMM_DB
Pfam HMM format database path (default: None)
-ida IDA_FILE, --ida_file IDA_FILE
IDA TSV file (default: None)
-f FUNCTION, --function FUNCTION
User label for this run (default: None)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Base output directory (default: None)
Optional arguments:
-t THREADS, --threads THREADS
Threads for tools supporting -cpu/--threads (default: 1)
-e EVALUE, --evalue EVALUE
E-value threshold for mmseqs easy-search (e.g., 1E-3, 1e-5) (default: 1E-3)
Example run using the provided examples directory:
python phidra_run.py -i examples/query/polA_test.fa -db examples/subject/pola_16_k12_ref.fa -pfam [pfam_DB_location] -ida examples/IDA/polA_IDA_list.tsv -f examples/output -t 5 -e 1E-3
output/
├── final_results/
│ ├── pfam_coverage_report.tsv # Summary of Pfam domain coverage across all search hits
│ ├── summary.tsv # Summary table combining counts for each iteration
│ ├── unvalidated_ida_pfams/ # Pfam domains found but not validated by iterative domain architecture (IDA)
│ │ ├── domains.fa # FASTA of individual Pfam domain hits
│ │ ├── full_proteins.fa # Full-length proteins containing those domains
│ │ └── pfam_unvalidated_merged_report.tsv # Merged table of unvalidated domain results
│ └── validated_ida_pfams/ # Pfam domains validated by IDA recursion
│ ├── domains.fa # FASTA of validated Pfam domain hits
│ ├── full_proteins.fa # Full-length proteins containing validated domains
│ └── pfam_validated_merged_report.tsv # Merged table of validated domain results
│
├── mmseqs/
│ ├── initial/ # First-pass MMseqs2 search output
│ │ ├── bits.tsv # Top hit table by best bitscore
│ │ ├── hits.fa # FASTA of sequences with best e-value
│ │ ├── hits.tsv # Top hit table by best e-value
│ │ └── res.m8 # MMseqs2 table output (m8 format) for all significant initial hits
│ └── recursive/ # Secondary MMseqs2 search using hits as queries
│ └── res.m8 # MMseqs2 table output (m8 format) for all significant recursive hits
│ # No recursive hits so tophit/FASTA not created
└── pfam/
├── initial/ # Initial Pfam HMMER domain search results
│ ├── pfam_coverage_report.tsv # Coverage summary of initial Pfam search
│ ├── unvalidated_ida_report/ # Domains not validated by user IDA but hit Pfam domain
│ │ ├── domains.fa # FASTA of unvalidated Pfam domain hits
│ │ ├── full_proteins.fa # Full-length proteins for unvalidated hits
│ │ └── pfam_unvalidated_report.tsv # Tabular report of unvalidated domain results
│ └── validated_ida_report/ # Domains validated by IDA
│ ├── domains.fa # FASTA of validated Pfam domain hits
│ ├── full_proteins.fa # Full-length proteins for validated hits
│ └── pfam_validated_report.tsv # Tabular report of validated domain results
└── recursive/ # Results from recursive Pfam search
(empty) # No recursive hits so pfam files not created
- The named output directory contains the complete output of the
phidrapipeline. - Includes homology search results, IDA-validated and unvalidated Pfam domain calls with both sequence-level and domain-level FASTA files, and comprehensive summary tables.
- A detailed description of each file is provided above to help navigate and interpret the results.
- Browse the stable v1.x branch
- Or check the v1.0.0 release
If you found this tool useful, please cite:
Primary reference (method):
Schreiber, Zachary D. Unraveling Viral Gene Associations Through Integrative Computational Approaches. PhD dissertation, University of Delaware, 2025.
Software (implementation and version used):
Schreiber, Zachary D. PHIDRA: Protein Homology Identification via Domain-Related Architecture (version 2.0) [Computer software]. GitHub. https://github.com/zschreib/phidra
Contributor’s name and contact info
This project is licensed under the GNU General Public License v3.0 see the LICENSE file for more details.
Inspiration, code snippets, etc.