This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
BioScreen is a structure-based biosecurity screening tool for AI-designed proteins. It detects dangerous proteins by evaluating function and structure, not just sequence homology — addressing the gap where AI-designed proteins (RFdiffusion, ProteinMPNN) can fold into toxin structures while sharing near-zero sequence similarity.
# Run API server (dev mode with hot reload)
uvicorn app.main:app --reload
# Run Streamlit frontend (separate terminal)
streamlit run frontend/streamlit_app.py
# Run all tests
pytest
# Run a single test
pytest tests/test_pipeline.py::test_score_returns_value_in_range
# Build toxin reference database (requires network access to UniProt)
python scripts/build_db.pyThe core screening flow for a protein sequence:
- Sequence validation (
sequence.py) — validates input, detects type (protein/DNA/RNA), translates nucleotides to protein if needed - Embedding (
embedding.py) — generates ESM-2 embeddings (facebook/esm2_t33_650M_UR50D) - Similarity search (
similarity.py) — cosine similarity via FAISS (fast path) + optional Foldseek structural alignment (full path) - Structure prediction (
structure.py) — ESMFold via NVIDIA NIM API (always runs) - Active site detection (
active_site.py) — identifies binding pockets in PDB structures and compares active site geometry between query and known toxins (BioPython + numpy) - Function prediction (
function.py) — GO term / EC number classification - Risk scoring (
scoring.py) — weighted combination of embedding similarity (0.5), structural similarity (0.3), function overlap (0.2), with non-linear transforms and synergy bonuses for multiple high-confidence signals
All screening runs the full pipeline: Steps 1→2→3→4→5→6→7. Structure prediction (ESMFold + Foldseek) is always enabled.
- App state via lifespan (
main.py): ESM-2 model and FAISS toxin DB are loaded once at startup via FastAPI'sasynccontextmanagerlifespan, then attached toapp.statefor route access. - Configuration (
config.py): All settings viapydantic-settingsBaseSettings, read from.envfile. Cached singleton via@lru_cache. Includes screening thresholds, model paths, and API keys. - Pydantic v2 schemas (
models/schemas.py): Request/response models with validators (e.g., FASTA header stripping onScreeningRequest.sequence). - Toxin database (
database/): FAISS index + JSON metadata sidecar. Built from UniProt Tox-Prot viascripts/build_db.py. - Routes (
api/routes.py): All API endpoints defined in a singleAPIRouter, mounted under/apiprefix inmain.py.
Behavioral monitoring layer that detects convergent optimization patterns (e.g., a user iteratively modifying sequences toward a toxin). Key components:
session_store.py— rolling-window session store (50-entry, 1-hour TTL) tracking per-session screening historyanalyzer.py—SessionAnalyzerthat detects anomalous patterns across a session's entriesschemas.py— Pydantic models forSessionEntry,SessionState,AnomalyAlert- Module-level singletons (
default_store,default_analyzer) used by the API layer
| Endpoint | Method | Purpose |
|---|---|---|
/api/screen |
POST | Screen single sequence |
/api/batch |
POST | Screen multiple sequences |
/api/health |
GET | Health/readiness check |
/api/toxins |
GET | List toxin DB entries |
/api/compare |
POST | Compare query structure with toxin via superposition |
/api/session/{id} |
GET | Get session state/history |
/api/session/{id}/alerts |
GET | Get anomaly alerts for session |
Streamlit-based UI with multi-page layout:
streamlit_app.py— main app entry pointpages/single_screen.py— single sequence screening pagepages/session_analysis.py— session history and anomaly analysiscomponents/api_client.py— HTTP client for the backend APIcomponents/protein_3d.py— py3Dmol 3D protein structure viewercomponents/result_viewer.py— screening result displaycomponents/summary_cards.py— summary card widgetscomponents/styles.py— shared CSS/stylingvideo_generator.py— captures py3Dmol via headless Playwright + composites stats overlays with PIL, outputs MP4 via ffmpeg
Requires .env file (copy from .env.example). Key variables:
NVIDIA_API_KEY— for ESMFold NIM API (structure prediction)ESMFOLD_API_URL— ESMFold NIM endpoint URLDEVICE—cpuorcudaAPP_ENV—developmentorproductionLOG_LEVEL— logging level (defaultINFO)API_HOST/API_PORT— server bind address (default0.0.0.0:8000)ESM2_MODEL_NAME— HuggingFace model ID for embeddingsTOXIN_DB_PATH/TOXIN_META_PATH— paths to FAISS index and metadata JSONFOLDSEEK_BIN/FOLDSEEK_DB_PATH— Foldseek binary and database pathsUNIPROT_BATCH_SIZE/MAX_TOXIN_RECORDS— UniProt build settings forscripts/build_db.pyEMBEDDING_SIM_THRESHOLD/STRUCTURE_SIM_THRESHOLD/RISK_HIGH_THRESHOLD/RISK_MEDIUM_THRESHOLD— screening thresholds
Tests use pytest + pytest-asyncio. Test files in tests/:
test_pipeline.py— sequence validation and risk scoringtest_schemas.py— Pydantic model validationtest_analyzer.py— session anomaly detection logictest_monitoring_init.py— monitoring module singleton setuptest_session_store.py— session store CRUD and TTL behaviortest_session_routes.py— session API endpoint integration tests
Tests do not require GPU or external APIs.