A production-ready pipeline that performs BLAST searches against Swiss-Prot (reviewed proteins only) and enriches results with comprehensive UniProt annotations. Perfect for protein sequence identification and functional annotation.
graph LR
A[π CSV Input<br/>Protein Sequences] --> B[π BLAST Search<br/>Swiss-Prot Database]
B --> C[𧬠UniProt Annotations<br/>Function, EC numbers, etc.]
C --> D[π Top-10 Results<br/>Ranked by quality]
D --> E[π Multiple Outputs<br/>CSV, FASTA, Stats]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#e8f5e8
style D fill:#f3e5f5
style E fill:#e1f5fe
- Swiss-Prot Only: Searches only reviewed, high-quality Swiss-Prot database entries
- Comprehensive Annotations: Extracts function, catalytic activity, cofactors, EC numbers, organism info, and more
- Top-10 Results: Returns the best 10 matches per query with detailed scoring
- Multiple Output Formats: Summary tables, detailed results, FASTA sequences, and statistics
- Production Ready: Robust error handling, retry logic, and comprehensive logging
- Easy to Use: Simple command-line interface with sensible defaults
- Python β₯3.10
- Internet connection for EMBL-EBI BLAST and UniProt APIs
- Valid email address for fair-use compliance
-
Install dependencies:
pip install -r requirements.txt
-
Run with your protein sequences:
python swissprot_blast.py input_seq.csv
That's it! The system will process your sequences and generate comprehensive results.
The system works out-of-the-box with sensible defaults. To customize, edit config.yaml:
blast:
email: " @gmail.com" # Your email for fair-use compliance
database: "uniprotkb_swissprot" # Swiss-Prot only (enforced)
batch_size: 30 # Sequences processed per batch
selection:
topk: 10 # Top hits per query
min_identity_pct: 20 # Minimum identity threshold
min_coverage_pct: 20 # Minimum coverage threshold
max_evalue: 1e-2 # Maximum E-value threshold
output:
prefix: "swissprot_results" # Output file prefix
include_fasta: true # Generate FASTA fileYour CSV file needs these columns:
Name: Query identifier (e.g., "Iai47_00045")Protein_sequence: Amino acid sequence (standard 20 amino acids + X/B/Z/J/U/O)
Example:
Name,Protein_sequence
Iai47_00045,MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW
Iai47_00065,MRNFDLTPLYRSAIGFDRLFNLLESNQNQSNGGYPPYNVELVDENHYRITIAVAGFSQSELDITAHDNVLIVRGAHPEEQAERKYLYQGIAERNFERKFQLADHIVVRDARLENGLLSIDLERLVPEEAKPRRIEILK
After running, you'll get these files:
swissprot_results_wide.csv- Summary table with top 10 hits per query (easy to read)swissprot_results_long.csv- Detailed results with all hit informationswissprot_results_hits.fasta- Protein sequences of the best matches
swissprot_results_stats.json- Processing summary and statisticsrun.log- General application logswissprot_results_blast_jobs.json- BLAST job tracking detailsswissprot_results_uniprot_requests.json- UniProt API request logs
Here's what you can expect from a successful run:
Statistics Summary:
- Total queries: 2 sequences processed
- Success rate: 100% (both sequences found matches)
- Average identity: 84.25% (high-quality matches)
- Average coverage: 99.28% (nearly full-length alignments)
- All hits reviewed: Only Swiss-Prot entries (highest quality)
Sample Results:
- Iai47_00045 β D-galactonate dehydratase (EC 4.2.1.6) from Serratia proteamaculans
- Iai47_00065 β Heat shock protein IbpA (chaperone) from Citrobacter koseri
graph TD
S[𧬠Unknown Protein Sequence<br/>MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW]
S --> ID[π― D-galactonate dehydratase<br/>EC 4.2.1.6]
S --> ORG[π¦ Serratia proteamaculans]
S --> FUNC[βοΈ Catalyzes dehydration of<br/>D-galactonate to 2-keto-3-deoxy-D-galactonate]
S --> COF[π§ͺ Mg(2+) cofactor]
style S fill:#e3f2fd
style ID fill:#e8f5e8
style ORG fill:#fff3e0
style FUNC fill:#f3e5f5
style COF fill:#e1f5fe
The swissprot_results_wide.csv file has one row per query with columns like:
Query_Name,Query_Sequence,Query_Length,Status,n_hitsHit1_Accession,Hit1_Identity,Hit1_Coverage,Hit1_Evalue,Hit1_BitscoreHit1_Reviewed,Hit1_Organism,Hit1_Function,Hit1_EC,Hit1_Keywords- ... repeated for Hit2 through Hit10
The swissprot_results_long.csv file has one row per hit with columns:
query_name,rank,accession,identity,coverage,evalue,bitscoreorganism,protein_name,function,catalytic_activity,cofactorec_numbers,keywords,subcellular_location,gene_names
- Batch Processing: Processes sequences in batches for efficiency
- Robust Error Handling: Automatic retries with exponential backoff
- Comprehensive Logging: Detailed logs for troubleshooting
- Memory Efficient: Handles large datasets without memory issues
selection:
taxonomy_id: 511145 # E. coli K-12output:
include_sequences_wide: truepython swissprot_blast.py input.csv my_custom_results- "Missing required columns": Make sure your CSV has
NameandProtein_sequencecolumns - "BLAST email is required": Update the email in
config.yaml - "Configuration file not found": Ensure
config.yamlexists in your working directory - No results found: Check that your sequences are valid protein sequences
- Check
run.logfor detailed error messages - Review
swissprot_results_stats.jsonfor processing summary - Ensure you have internet connection for BLAST and UniProt APIs
- EMBL-EBI for BLAST services
- UniProt Consortium for annotation data
- NCBI for taxonomy information
Ready to identify your protein sequences? Just run python swissprot_blast.py input_seq.csv and get comprehensive results!
