Skip to content

saadnaseem/sequence_to_swissprot_entries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Workflow Diagram

Swiss-Prot BLAST & Annotation System

A production-ready pipeline that performs BLAST searches against Swiss-Prot (reviewed proteins only) and enriches results with comprehensive UniProt annotations. Perfect for protein sequence identification and functional annotation.

πŸ”„ How It Works

graph LR
    A[πŸ“ CSV Input<br/>Protein Sequences] --> B[πŸ” BLAST Search<br/>Swiss-Prot Database]
    B --> C[🧬 UniProt Annotations<br/>Function, EC numbers, etc.]
    C --> D[πŸ“Š Top-10 Results<br/>Ranked by quality]
    D --> E[πŸ“„ Multiple Outputs<br/>CSV, FASTA, Stats]
    
    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#f3e5f5
    style E fill:#e1f5fe
Loading

πŸš€ Key Features

  • Swiss-Prot Only: Searches only reviewed, high-quality Swiss-Prot database entries
  • Comprehensive Annotations: Extracts function, catalytic activity, cofactors, EC numbers, organism info, and more
  • Top-10 Results: Returns the best 10 matches per query with detailed scoring
  • Multiple Output Formats: Summary tables, detailed results, FASTA sequences, and statistics
  • Production Ready: Robust error handling, retry logic, and comprehensive logging
  • Easy to Use: Simple command-line interface with sensible defaults

πŸ“‹ Requirements

  • Python β‰₯3.10
  • Internet connection for EMBL-EBI BLAST and UniProt APIs
  • Valid email address for fair-use compliance

πŸ› οΈ Quick Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Run with your protein sequences:

    python swissprot_blast.py input_seq.csv

That's it! The system will process your sequences and generate comprehensive results.

βš™οΈ Configuration (Optional)

The system works out-of-the-box with sensible defaults. To customize, edit config.yaml:

blast:
  email: " @gmail.com"        # Your email for fair-use compliance
  database: "uniprotkb_swissprot"  # Swiss-Prot only (enforced)
  batch_size: 30                   # Sequences processed per batch

selection:
  topk: 10                         # Top hits per query
  min_identity_pct: 20            # Minimum identity threshold
  min_coverage_pct: 20            # Minimum coverage threshold
  max_evalue: 1e-2                # Maximum E-value threshold

output:
  prefix: "swissprot_results"      # Output file prefix
  include_fasta: true              # Generate FASTA file

πŸ“Š Input Format

Your CSV file needs these columns:

  • Name: Query identifier (e.g., "Iai47_00045")
  • Protein_sequence: Amino acid sequence (standard 20 amino acids + X/B/Z/J/U/O)

Example:

Name,Protein_sequence
Iai47_00045,MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW
Iai47_00065,MRNFDLTPLYRSAIGFDRLFNLLESNQNQSNGGYPPYNVELVDENHYRITIAVAGFSQSELDITAHDNVLIVRGAHPEEQAERKYLYQGIAERNFERKFQLADHIVVRDARLENGLLSIDLERLVPEEAKPRRIEILK

πŸ“ Output Files

After running, you'll get these files:

Main Results

  • swissprot_results_wide.csv - Summary table with top 10 hits per query (easy to read)
  • swissprot_results_long.csv - Detailed results with all hit information
  • swissprot_results_hits.fasta - Protein sequences of the best matches

Statistics & Logs

  • swissprot_results_stats.json - Processing summary and statistics
  • run.log - General application log
  • swissprot_results_blast_jobs.json - BLAST job tracking details
  • swissprot_results_uniprot_requests.json - UniProt API request logs

πŸ“ˆ Example Results

Here's what you can expect from a successful run:

Statistics Summary:

  • Total queries: 2 sequences processed
  • Success rate: 100% (both sequences found matches)
  • Average identity: 84.25% (high-quality matches)
  • Average coverage: 99.28% (nearly full-length alignments)
  • All hits reviewed: Only Swiss-Prot entries (highest quality)

Sample Results:

  • Iai47_00045 β†’ D-galactonate dehydratase (EC 4.2.1.6) from Serratia proteamaculans
  • Iai47_00065 β†’ Heat shock protein IbpA (chaperone) from Citrobacter koseri

🧬 Example: From Sequence to Function

graph TD
    S[🧬 Unknown Protein Sequence<br/>MKITRLTTYRLPPRWMFLKIETDEGIVGWGEPVIEGRARSVEAAVHELSEYLIGQDPSRINDLWQVMYRGGFYRGGPILMSAIAGIDQALWDIKGKALGVPVYQLLGGLVRDRIKAYSWVGGDRPADVIEGITKLRTIGFDTFKLNGCEEMGIIDSALKVDAAVNTVAQIREAFGKEIEFGLDFHGRVSAPMAKVLIKELEPYRPLFIEEPVLAEQAEYYPRLAAQTHIPIAAGERMFSRFEFKRVLEAGGLAILQPDLSHAGGITECYKIAAMAESYDVALAPHCPLGPIALAACLHIDFVSRNAVFQEQSMGIHYNQGAELLDYVLNKDDFKMDDGHFYPLNKPGLGVEINEELVIARSKNAPDWRNPLWRSADGSVAEW] 
    
    S --> ID[🎯 D-galactonate dehydratase<br/>EC 4.2.1.6]
    S --> ORG[🦠 Serratia proteamaculans]
    S --> FUNC[βš—οΈ Catalyzes dehydration of<br/>D-galactonate to 2-keto-3-deoxy-D-galactonate]
    S --> COF[πŸ§ͺ Mg(2+) cofactor]
    
    style S fill:#e3f2fd
    style ID fill:#e8f5e8
    style ORG fill:#fff3e0
    style FUNC fill:#f3e5f5
    style COF fill:#e1f5fe
Loading

πŸ“Š Output Format Details

Wide Format (Summary Table)

The swissprot_results_wide.csv file has one row per query with columns like:

  • Query_Name, Query_Sequence, Query_Length, Status, n_hits
  • Hit1_Accession, Hit1_Identity, Hit1_Coverage, Hit1_Evalue, Hit1_Bitscore
  • Hit1_Reviewed, Hit1_Organism, Hit1_Function, Hit1_EC, Hit1_Keywords
  • ... repeated for Hit2 through Hit10

Long Format (Detailed Results)

The swissprot_results_long.csv file has one row per hit with columns:

  • query_name, rank, accession, identity, coverage, evalue, bitscore
  • organism, protein_name, function, catalytic_activity, cofactor
  • ec_numbers, keywords, subcellular_location, gene_names

⚑ Performance & Reliability

  • Batch Processing: Processes sequences in batches for efficiency
  • Robust Error Handling: Automatic retries with exponential backoff
  • Comprehensive Logging: Detailed logs for troubleshooting
  • Memory Efficient: Handles large datasets without memory issues

πŸ”§ Advanced Options

Prioritize Specific Organisms

selection:
  taxonomy_id: 511145  # E. coli K-12

Include Full Sequences in Summary

output:
  include_sequences_wide: true

Custom Output Prefix

python swissprot_blast.py input.csv my_custom_results

πŸ” Troubleshooting

Common Issues

  1. "Missing required columns": Make sure your CSV has Name and Protein_sequence columns
  2. "BLAST email is required": Update the email in config.yaml
  3. "Configuration file not found": Ensure config.yaml exists in your working directory
  4. No results found: Check that your sequences are valid protein sequences

Getting Help

  • Check run.log for detailed error messages
  • Review swissprot_results_stats.json for processing summary
  • Ensure you have internet connection for BLAST and UniProt APIs

πŸ™ Acknowledgments

  • EMBL-EBI for BLAST services
  • UniProt Consortium for annotation data
  • NCBI for taxonomy information

Ready to identify your protein sequences? Just run python swissprot_blast.py input_seq.csv and get comprehensive results!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages