Complete setup instructions for BioScreen. Run these steps in order.
- Python 3.12 (Python 3.14 has FAISS/torch segfaults)
- Modal account for GPU compute (
pip install modal && modal setup) - NVIDIA NIM API key from https://build.nvidia.com
# 1. Create Python 3.12 venv
python3.12 -m venv venv312 && source venv312/bin/activate
pip install -r requirements.txt
# 2. Set up .env
cp .env.example .env
# Edit .env and add your NVIDIA_API_KEY (nvapi-...)
# 3. Build toxin DB — 2000 proteins, ESM2-650M embeddings on Modal GPU (~45s)
pip install modal # if not already installed
modal run scripts/build_db_modal.py
# 4. Enrich with GO terms from UniProt (~40s)
python3 scripts/enrich_go_terms.py
# 5. Install Foldseek binary
curl -sL https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz -o /tmp/foldseek.tar.gz
tar -xzf /tmp/foldseek.tar.gz -C /tmp/
cp /tmp/foldseek/bin/foldseek venv312/bin/
# 6. Predict toxin structures via ESMFold NIM API on Modal (~3 min)
modal run scripts/predict_structures_modal.py
# 7. Build Foldseek toxin database from predicted structures
mkdir -p data/foldseek_toxin_db
foldseek createdb data/toxin_structures data/foldseek_toxin_db/toxins
# 8. (Optional) Download full PDB database for generic structural search (~2.2GB, ~10 min)
mkdir -p data/foldseek_db
foldseek databases PDB data/foldseek_db/pdb /tmp/foldseek_tmpRun this to check everything is set up correctly:
source venv312/bin/activate && python3 -c "
import os, json
checks = []
# 1. FAISS toxin DB
ok = os.path.exists('data/toxin_db.faiss') and os.path.getsize('data/toxin_db.faiss') > 1000
checks.append(('Toxin FAISS index', ok, 'modal run scripts/build_db_modal.py'))
# 2. Toxin metadata with GO terms
ok = False
if os.path.exists('data/toxin_meta.json'):
with open('data/toxin_meta.json') as f:
meta = json.load(f)
ok = len(meta) >= 2000 and any(m.get('go_terms') for m in meta)
checks.append(('Toxin metadata + GO terms', ok, 'python3 scripts/enrich_go_terms.py'))
# 3. Toxin structures
n = len([f for f in os.listdir('data/toxin_structures') if f.endswith('.pdb')]) if os.path.isdir('data/toxin_structures') else 0
checks.append((f'Toxin structures ({n} PDBs)', n >= 400, 'modal run scripts/predict_structures_modal.py'))
# 4. Foldseek binary
import shutil
ok = shutil.which('foldseek') is not None
checks.append(('Foldseek binary', ok, 'See step 5 in setup'))
# 5. Foldseek toxin DB
ok = os.path.exists('data/foldseek_toxin_db/toxins.dbtype')
checks.append(('Foldseek toxin DB', ok, 'foldseek createdb data/toxin_structures data/foldseek_toxin_db/toxins'))
# 6. Foldseek PDB DB (optional)
ok = os.path.exists('data/foldseek_db/pdb.dbtype')
checks.append(('Foldseek PDB DB (optional)', ok, 'foldseek databases PDB data/foldseek_db/pdb /tmp/foldseek_tmp'))
# 7. .env with API key
ok = os.path.exists('.env')
has_key = False
if ok:
with open('.env') as f:
has_key = 'nvapi-' in f.read()
checks.append(('.env with NVIDIA_API_KEY', ok and has_key, 'cp .env.example .env and add your key'))
# 8. Python version
import sys
ok = sys.version_info[:2] == (3, 12)
checks.append((f'Python 3.12 (have {sys.version_info[0]}.{sys.version_info[1]})', ok, 'python3.12 -m venv venv312'))
print()
for name, ok, fix in checks:
status = '✅' if ok else '❌'
print(f' {status} {name}')
if not ok:
print(f' Fix: {fix}')
print()
"Expected output when everything is set up:
✅ Toxin FAISS index
✅ Toxin metadata + GO terms
✅ Toxin structures (496 PDBs)
✅ Foldseek binary
✅ Foldseek toxin DB
✅ Foldseek PDB DB (optional)
✅ .env with NVIDIA_API_KEY
✅ Python 3.12 (have 3.12)
source venv312/bin/activate
# Run tests
KMP_DUPLICATE_LIB_OK=TRUE python3 -m pytest tests/test_pipeline.py -v
# Run API server
KMP_DUPLICATE_LIB_OK=TRUE uvicorn app.main:app --reload --port 8000
# Run demo scenarios
KMP_DUPLICATE_LIB_OK=TRUE python3 scripts/demo_scenarios.py
# Run 10 demo scenarios
KMP_DUPLICATE_LIB_OK=TRUE python3 scripts/demo_10_scenarios.py
# Run Streamlit frontend (separate terminal)
KMP_DUPLICATE_LIB_OK=TRUE streamlit run frontend/streamlit_app.pyThese files are gitignored (too large) and must be generated locally:
| File | Size | Generated by | Required? |
|---|---|---|---|
data/toxin_db.faiss |
~10MB | build_db_modal.py |
Yes |
data/toxin_meta.json |
~700KB | build_db_modal.py + enrich_go_terms.py |
Yes |
data/toxin_structures/*.pdb |
~25MB | predict_structures_modal.py |
Yes (for full path) |
data/foldseek_toxin_db/ |
~400KB | foldseek createdb |
Yes (for full path) |
data/foldseek_db/ |
~4.2GB | foldseek databases PDB |
Optional |
FAISS segfault / OpenMP crash:
Always set KMP_DUPLICATE_LIB_OK=TRUE before running. This is a known conflict between torch and FAISS on macOS. The server sets this automatically in app/main.py.
Tokenizer crash:
Set TOKENIZERS_PARALLELISM=false. Also set automatically by the server.
ESMFold NIM API errors:
Check your NVIDIA_API_KEY in .env. Get one free at https://build.nvidia.com.
Modal auth issues:
Run modal setup to authenticate. Requires a Modal account (free tier works).
Python 3.14 crashes: Use Python 3.12. FAISS and tokenizers have ABI issues on 3.14.