Small R pipeline that runs offline BLASTp pairwise alignments for curated pairs of protein sequences.
Project status: early setup / scaffold. Expect breaking changes.
- NCBI BLAST+ (tested with 2.12.0+)
- R ≥ 4.0 with package:
Biostrings - (Optional) RStudio
The original scaffold targeted BLAST+ 2.12.0+, R 4.0.3.
See library(Biostrings) in the script for package needs.
.
├─ ncbi\_blastp\_wrap.R # main pipeline
├─ prepare\_input.sh # builds ./data from *.fasta in repo root
├─ transcript\_type\_info.csv # mapping of pairs: name, prin, alt
└─ data/ # example multi-FASTA files
transcript_type_info.csvwith columns:
| name | prin | alt |
|---|---|---|
| ESRRB_1 | ENST00000512784_domains.fasta | ENST00000505752_domains.fasta |
| ESRRB_2 | ENST00000512784_domains.fasta | ENST00000644823_domains.fasta |
./data/<name>.fastamulti-FASTA files whose sequence headers contain the transcript IDs used above (e.g.,>zf-C4_1_ENST00000512784).
- Install BLAST+ and R deps (
Biostrings). - Prepare data (either):
- Put your multi-FASTA files under
./data, or - Place
*.fastain repo root and run:bash prepare_input.sh
- Configure
ncbi_blastp_wrap.Rif needed:
path_to_domain_files <- "data/"(default)
- Run:
Rscript ncbi_blastp_wrap.R-
Splits each multi-FASTA into per-sequence files and tags them as query or subject based on
transcript_type_info.csv. -
For each
name, finds the matching query/subject pair and runs:blastp -query <query.fasta> -subject <subject.fasta> -outfmt 0
-
Saves text outputs under
./alignment/(folder name may appear asalingment/in early versions).
-
For each domain/family
name, a folder is created containing:alignment/Alignment_<query>_<subject>_out.txt(pairwise BLASTp report,-outfmt 0).
- Define or replace
open_input_files(); fixsave_metafile()variable names/scope. - Normalize output folder to
alignment/. - Add argument parsing (input dir, CSV path, outdir,
-outfmt). - Add unit tests and CI for R/BLAST+ presence.
- Example notebook / vignette.
TBD.