Swiss-Prot BLAST & Annotation System

A production-ready pipeline that performs BLAST searches against Swiss-Prot (reviewed proteins only) and enriches results with comprehensive UniProt annotations. Perfect for protein sequence identification and functional annotation.

🔄 How It Works

graph LR
    A[📁 CSV Input<br/>Protein Sequences] --> B[🔍 BLAST Search<br/>Swiss-Prot Database]
    B --> C[🧬 UniProt Annotations<br/>Function, EC numbers, etc.]
    C --> D[📊 Top-10 Results<br/>Ranked by quality]
    D --> E[📄 Multiple Outputs<br/>CSV, FASTA, Stats]
    
    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#f3e5f5
    style E fill:#e1f5fe

🚀 Key Features

Swiss-Prot Only: Searches only reviewed, high-quality Swiss-Prot database entries
Comprehensive Annotations: Extracts function, catalytic activity, cofactors, EC numbers, organism info, and more
Top-10 Results: Returns the best 10 matches per query with detailed scoring
Multiple Output Formats: Summary tables, detailed results, FASTA sequences, and statistics
Production Ready: Robust error handling, retry logic, and comprehensive logging
Easy to Use: Simple command-line interface with sensible defaults

📋 Requirements

Python ≥3.10
Internet connection for EMBL-EBI BLAST and UniProt APIs
Valid email address for fair-use compliance

🛠️ Quick Start

Install dependencies:
```
pip install -r requirements.txt
```
Run with your protein sequences:
```
python swissprot_blast.py input_seq.csv
```

That's it! The system will process your sequences and generate comprehensive results.

⚙️ Configuration (Optional)

The system works out-of-the-box with sensible defaults. To customize, edit config.yaml:

blast:
  email: " @gmail.com"        # Your email for fair-use compliance
  database: "uniprotkb_swissprot"  # Swiss-Prot only (enforced)
  batch_size: 30                   # Sequences processed per batch

selection:
  topk: 10                         # Top hits per query
  min_identity_pct: 20            # Minimum identity threshold
  min_coverage_pct: 20            # Minimum coverage threshold
  max_evalue: 1e-2                # Maximum E-value threshold

output:
  prefix: "swissprot_results"      # Output file prefix
  include_fasta: true              # Generate FASTA file

📊 Input Format

Your CSV file needs these columns:

Name: Query identifier (e.g., "Iai47_00045")
Protein_sequence: Amino acid sequence (standard 20 amino acids + X/B/Z/J/U/O)

Example:

Name,Protein_sequence
Iai47_00045,MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW
Iai47_00065,MRNFDLTPLYRSAIGFDRLFNLLESNQNQSNGGYPPYNVELVDENHYRITIAVAGFSQSELDITAHDNVLIVRGAHPEEQAERKYLYQGIAERNFERKFQLADHIVVRDARLENGLLSIDLERLVPEEAKPRRIEILK

📁 Output Files

After running, you'll get these files:

Main Results

swissprot_results_wide.csv - Summary table with top 10 hits per query (easy to read)
swissprot_results_long.csv - Detailed results with all hit information
swissprot_results_hits.fasta - Protein sequences of the best matches

Statistics & Logs

swissprot_results_stats.json - Processing summary and statistics
run.log - General application log
swissprot_results_blast_jobs.json - BLAST job tracking details
swissprot_results_uniprot_requests.json - UniProt API request logs

📈 Example Results

Here's what you can expect from a successful run:

Statistics Summary:

Total queries: 2 sequences processed
Success rate: 100% (both sequences found matches)
Average identity: 84.25% (high-quality matches)
Average coverage: 99.28% (nearly full-length alignments)
All hits reviewed: Only Swiss-Prot entries (highest quality)

Sample Results:

Iai47_00045 → D-galactonate dehydratase (EC 4.2.1.6) from Serratia proteamaculans
Iai47_00065 → Heat shock protein IbpA (chaperone) from Citrobacter koseri

🧬 Example: From Sequence to Function

graph TD
    S[🧬 Unknown Protein Sequence<br/>MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW] 
    
    S --> ID[🎯 D-galactonate dehydratase<br/>EC 4.2.1.6]
    S --> ORG[🦠 Serratia proteamaculans]
    S --> FUNC[⚗️ Catalyzes dehydration of<br/>D-galactonate to 2-keto-3-deoxy-D-galactonate]
    S --> COF[🧪 Mg(2+) cofactor]
    
    style S fill:#e3f2fd
    style ID fill:#e8f5e8
    style ORG fill:#fff3e0
    style FUNC fill:#f3e5f5
    style COF fill:#e1f5fe

📊 Output Format Details

Wide Format (Summary Table)

The swissprot_results_wide.csv file has one row per query with columns like:

Query_Name, Query_Sequence, Query_Length, Status, n_hits
Hit1_Accession, Hit1_Identity, Hit1_Coverage, Hit1_Evalue, Hit1_Bitscore
Hit1_Reviewed, Hit1_Organism, Hit1_Function, Hit1_EC, Hit1_Keywords
... repeated for Hit2 through Hit10

Long Format (Detailed Results)

The swissprot_results_long.csv file has one row per hit with columns:

query_name, rank, accession, identity, coverage, evalue, bitscore
organism, protein_name, function, catalytic_activity, cofactor
ec_numbers, keywords, subcellular_location, gene_names

⚡ Performance & Reliability

Batch Processing: Processes sequences in batches for efficiency
Robust Error Handling: Automatic retries with exponential backoff
Comprehensive Logging: Detailed logs for troubleshooting
Memory Efficient: Handles large datasets without memory issues

🔧 Advanced Options

Prioritize Specific Organisms

selection:
  taxonomy_id: 511145  # E. coli K-12

Include Full Sequences in Summary

output:
  include_sequences_wide: true

Custom Output Prefix

python swissprot_blast.py input.csv my_custom_results

🔍 Troubleshooting

Common Issues

"Missing required columns": Make sure your CSV has Name and Protein_sequence columns
"BLAST email is required": Update the email in config.yaml
"Configuration file not found": Ensure config.yaml exists in your working directory
No results found: Check that your sequences are valid protein sequences

Getting Help

Check run.log for detailed error messages
Review swissprot_results_stats.json for processing summary
Ensure you have internet connection for BLAST and UniProt APIs

🙏 Acknowledgments

EMBL-EBI for BLAST services
UniProt Consortium for annotation data
NCBI for taxonomy information

Ready to identify your protein sequences? Just run python swissprot_blast.py input_seq.csv and get comprehensive results!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Swiss-Prot BLAST & Annotation System

🔄 How It Works

🚀 Key Features

📋 Requirements

🛠️ Quick Start

⚙️ Configuration (Optional)

📊 Input Format

📁 Output Files

Main Results

Statistics & Logs

📈 Example Results

🧬 Example: From Sequence to Function

📊 Output Format Details

Wide Format (Summary Table)

Long Format (Detailed Results)

⚡ Performance & Reliability

🔧 Advanced Options

Prioritize Specific Organisms

Include Full Sequences in Summary

Custom Output Prefix

🔍 Troubleshooting

Common Issues

Getting Help

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
QUICK_START.md		QUICK_START.md
README.md		README.md
SIMPLE_DIAGRAM.md		SIMPLE_DIAGRAM.md
WORKFLOW_DIAGRAM.md		WORKFLOW_DIAGRAM.md
config.yaml		config.yaml
diagram.png		diagram.png
input_seq.csv		input_seq.csv
requirements.txt		requirements.txt
run.log		run.log
swissprot_blast.py		swissprot_blast.py
swissprot_results_blast_jobs.json		swissprot_results_blast_jobs.json
swissprot_results_hits.fasta		swissprot_results_hits.fasta
swissprot_results_long.csv		swissprot_results_long.csv
swissprot_results_run_log.jsonl		swissprot_results_run_log.jsonl
swissprot_results_stats.json		swissprot_results_stats.json
swissprot_results_uniprot_requests.json		swissprot_results_uniprot_requests.json
swissprot_results_wide.csv		swissprot_results_wide.csv

saadnaseem/sequence_to_swissprot_entries

Folders and files

Latest commit

History

Repository files navigation

Swiss-Prot BLAST & Annotation System

🔄 How It Works

🚀 Key Features

📋 Requirements

🛠️ Quick Start

⚙️ Configuration (Optional)

📊 Input Format

📁 Output Files

Main Results

Statistics & Logs

📈 Example Results

🧬 Example: From Sequence to Function

📊 Output Format Details

Wide Format (Summary Table)

Long Format (Detailed Results)

⚡ Performance & Reliability

🔧 Advanced Options

Prioritize Specific Organisms

Include Full Sequences in Summary

Custom Output Prefix

🔍 Troubleshooting

Common Issues

Getting Help

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages