Skip to content

R wrapper for running offline BLASTp alignments on curated protein sequence pairs. Originally built for personal use, now being developed into a reusable package for broader community use.

Notifications You must be signed in to change notification settings

shahnawazkcl/ncbi_blast_wrapper

Repository files navigation

NCBI BLASTp Wrapper (R) — WIP

Status GitHub last commit MIT License Commit

Small R pipeline that runs offline BLASTp pairwise alignments for curated pairs of protein sequences.

Project status: early setup / scaffold. Expect breaking changes.

Requirements

  • NCBI BLAST+ (tested with 2.12.0+)
  • R ≥ 4.0 with package: Biostrings
  • (Optional) RStudio

The original scaffold targeted BLAST+ 2.12.0+, R 4.0.3.
See library(Biostrings) in the script for package needs.

Repo layout

.
├─ ncbi\_blastp\_wrap.R # main pipeline
├─ prepare\_input.sh # builds ./data from *.fasta in repo root
├─ transcript\_type\_info.csv # mapping of pairs: name, prin, alt
└─ data/ # example multi-FASTA files

Input expectations

  • transcript_type_info.csv with columns:
name prin alt
ESRRB_1 ENST00000512784_domains.fasta ENST00000505752_domains.fasta
ESRRB_2 ENST00000512784_domains.fasta ENST00000644823_domains.fasta
  • ./data/<name>.fasta multi-FASTA files whose sequence headers contain the transcript IDs used above (e.g., >zf-C4_1_ENST00000512784).

Quick start

  1. Install BLAST+ and R deps (Biostrings).
  2. Prepare data (either):
  • Put your multi-FASTA files under ./data, or
  • Place *.fasta in repo root and run:
    bash prepare_input.sh
  1. Configure ncbi_blastp_wrap.R if needed:
  • path_to_domain_files <- "data/" (default)
  1. Run:
Rscript ncbi_blastp_wrap.R

What it does (pipeline)

  • Splits each multi-FASTA into per-sequence files and tags them as query or subject based on transcript_type_info.csv.

  • For each name, finds the matching query/subject pair and runs:

    blastp -query <query.fasta> -subject <subject.fasta> -outfmt 0
  • Saves text outputs under ./alignment/ (folder name may appear as alingment/ in early versions).

Outputs

  • For each domain/family name, a folder is created containing:

    • alignment/Alignment_<query>_<subject>_out.txt (pairwise BLASTp report, -outfmt 0).

Roadmap / known issues (early stage)

  • Define or replace open_input_files(); fix save_metafile() variable names/scope.
  • Normalize output folder to alignment/.
  • Add argument parsing (input dir, CSV path, outdir, -outfmt).
  • Add unit tests and CI for R/BLAST+ presence.
  • Example notebook / vignette.

License

TBD.

About

R wrapper for running offline BLASTp alignments on curated protein sequence pairs. Originally built for personal use, now being developed into a reusable package for broader community use.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published