Skip to content

Latest commit

 

History

History
169 lines (126 loc) · 5.33 KB

File metadata and controls

169 lines (126 loc) · 5.33 KB

Setup Guide

Complete setup instructions for BioScreen. Run these steps in order.

Prerequisites

  • Python 3.12 (Python 3.14 has FAISS/torch segfaults)
  • Modal account for GPU compute (pip install modal && modal setup)
  • NVIDIA NIM API key from https://build.nvidia.com

Step-by-Step Setup

# 1. Create Python 3.12 venv
python3.12 -m venv venv312 && source venv312/bin/activate
pip install -r requirements.txt

# 2. Set up .env
cp .env.example .env
# Edit .env and add your NVIDIA_API_KEY (nvapi-...)

# 3. Build toxin DB — 2000 proteins, ESM2-650M embeddings on Modal GPU (~45s)
pip install modal   # if not already installed
modal run scripts/build_db_modal.py

# 4. Enrich with GO terms from UniProt (~40s)
python3 scripts/enrich_go_terms.py

# 5. Install Foldseek binary
curl -sL https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz -o /tmp/foldseek.tar.gz
tar -xzf /tmp/foldseek.tar.gz -C /tmp/
cp /tmp/foldseek/bin/foldseek venv312/bin/

# 6. Predict toxin structures via ESMFold NIM API on Modal (~3 min)
modal run scripts/predict_structures_modal.py

# 7. Build Foldseek toxin database from predicted structures
mkdir -p data/foldseek_toxin_db
foldseek createdb data/toxin_structures data/foldseek_toxin_db/toxins

# 8. (Optional) Download full PDB database for generic structural search (~2.2GB, ~10 min)
mkdir -p data/foldseek_db
foldseek databases PDB data/foldseek_db/pdb /tmp/foldseek_tmp

Verification

Run this to check everything is set up correctly:

source venv312/bin/activate && python3 -c "
import os, json

checks = []

# 1. FAISS toxin DB
ok = os.path.exists('data/toxin_db.faiss') and os.path.getsize('data/toxin_db.faiss') > 1000
checks.append(('Toxin FAISS index', ok, 'modal run scripts/build_db_modal.py'))

# 2. Toxin metadata with GO terms
ok = False
if os.path.exists('data/toxin_meta.json'):
    with open('data/toxin_meta.json') as f:
        meta = json.load(f)
    ok = len(meta) >= 2000 and any(m.get('go_terms') for m in meta)
checks.append(('Toxin metadata + GO terms', ok, 'python3 scripts/enrich_go_terms.py'))

# 3. Toxin structures
n = len([f for f in os.listdir('data/toxin_structures') if f.endswith('.pdb')]) if os.path.isdir('data/toxin_structures') else 0
checks.append((f'Toxin structures ({n} PDBs)', n >= 400, 'modal run scripts/predict_structures_modal.py'))

# 4. Foldseek binary
import shutil
ok = shutil.which('foldseek') is not None
checks.append(('Foldseek binary', ok, 'See step 5 in setup'))

# 5. Foldseek toxin DB
ok = os.path.exists('data/foldseek_toxin_db/toxins.dbtype')
checks.append(('Foldseek toxin DB', ok, 'foldseek createdb data/toxin_structures data/foldseek_toxin_db/toxins'))

# 6. Foldseek PDB DB (optional)
ok = os.path.exists('data/foldseek_db/pdb.dbtype')
checks.append(('Foldseek PDB DB (optional)', ok, 'foldseek databases PDB data/foldseek_db/pdb /tmp/foldseek_tmp'))

# 7. .env with API key
ok = os.path.exists('.env')
has_key = False
if ok:
    with open('.env') as f:
        has_key = 'nvapi-' in f.read()
checks.append(('.env with NVIDIA_API_KEY', ok and has_key, 'cp .env.example .env and add your key'))

# 8. Python version
import sys
ok = sys.version_info[:2] == (3, 12)
checks.append((f'Python 3.12 (have {sys.version_info[0]}.{sys.version_info[1]})', ok, 'python3.12 -m venv venv312'))

print()
for name, ok, fix in checks:
    status = '✅' if ok else '❌'
    print(f'  {status} {name}')
    if not ok:
        print(f'     Fix: {fix}')
print()
"

Expected output when everything is set up:

  ✅ Toxin FAISS index
  ✅ Toxin metadata + GO terms
  ✅ Toxin structures (496 PDBs)
  ✅ Foldseek binary
  ✅ Foldseek toxin DB
  ✅ Foldseek PDB DB (optional)
  ✅ .env with NVIDIA_API_KEY
  ✅ Python 3.12 (have 3.12)

Running

source venv312/bin/activate

# Run tests
KMP_DUPLICATE_LIB_OK=TRUE python3 -m pytest tests/test_pipeline.py -v

# Run API server
KMP_DUPLICATE_LIB_OK=TRUE uvicorn app.main:app --reload --port 8000

# Run demo scenarios
KMP_DUPLICATE_LIB_OK=TRUE python3 scripts/demo_scenarios.py

# Run 10 demo scenarios
KMP_DUPLICATE_LIB_OK=TRUE python3 scripts/demo_10_scenarios.py

# Run Streamlit frontend (separate terminal)
KMP_DUPLICATE_LIB_OK=TRUE streamlit run frontend/streamlit_app.py

What the Data Files Are

These files are gitignored (too large) and must be generated locally:

File Size Generated by Required?
data/toxin_db.faiss ~10MB build_db_modal.py Yes
data/toxin_meta.json ~700KB build_db_modal.py + enrich_go_terms.py Yes
data/toxin_structures/*.pdb ~25MB predict_structures_modal.py Yes (for full path)
data/foldseek_toxin_db/ ~400KB foldseek createdb Yes (for full path)
data/foldseek_db/ ~4.2GB foldseek databases PDB Optional

Troubleshooting

FAISS segfault / OpenMP crash: Always set KMP_DUPLICATE_LIB_OK=TRUE before running. This is a known conflict between torch and FAISS on macOS. The server sets this automatically in app/main.py.

Tokenizer crash: Set TOKENIZERS_PARALLELISM=false. Also set automatically by the server.

ESMFold NIM API errors: Check your NVIDIA_API_KEY in .env. Get one free at https://build.nvidia.com.

Modal auth issues: Run modal setup to authenticate. Requires a Modal account (free tier works).

Python 3.14 crashes: Use Python 3.12. FAISS and tokenizers have ABI issues on 3.14.