BioNeighbor is an open-source, on-device cancer-research toolkit. Turn a cancer treatment plan into a citation-grounded second opinion — locally, privately, in plain language.
Treatment Auditor: synthesize six public medical databases (NCI PDQ, ClinicalTrials.gov, DDInter, ChEMBL, RxNorm, OpenFDA FAERS) into a printable PDF audit, powered by Gemma 4 running on your own machine. No cloud, no telemetry, no patient data leaves the device.
BioNeighbor also retains its original molecular-similarity engine (FAISS + RDKit + ChEMBL) — the collaborative-filtering-inspired feature the project takes its name from — for exploring "neighbor" compounds to known drugs.
BioNeighbor is an open-source cancer-research toolkit centered on an on-device AI Treatment Auditor: describe a cancer treatment plan — disease/subtype, stage, prescribed drugs, scheduled treatments, symptoms — and the system runs a multi-pass audit across public medical data sources, synthesizes the findings via a local Ollama model, and exports a printable PDF.
Evidence sources:
- NCI PDQ — standard-of-care guideline text from cancer.gov
- ClinicalTrials.gov — recruiting and completed trials, per-modality and per-drug
- RxNorm — brand→generic name normalization
- DDInter — pairwise drug-drug interactions (Major / Moderate / Minor)
- ChEMBL — mechanism-of-action / target overlap between drugs
- openFDA FAERS — post-market adverse-event reaction frequencies
The Auditor runs from either a SwiftUI macOS GUI or a cross-platform Python CLI (macOS, Linux, Windows); both produce identical reports. The project also retains its original molecular-similarity engine (FAISS + RDKit + ChEMBL) for exploring "neighbor" compounds — the collaborative-filtering-inspired feature the project takes its name from. Everything runs locally; nothing leaves your machine. Research tool only, not medical advice.
BioNeighbor combines:
- Public biochemical datasets (ChEMBL, BindingDB)
- Molecular fingerprints and embeddings (computed via RDKit or other cheminformatics tools)
- Nearest-neighbor and similarity search engines (FAISS / vector search)
- Optional collaborative filtering for hybrid recommendation of molecules
- Molecule-centric search: Start with a known drug or bioactive compound and find similar molecules.
- Biological target awareness: Incorporates pathway and protein target information (e.g., adenosine-related targets) when available.
- Offline operation: No server required — the Python engine for molecular similarity runs locally, called from a Mac app or other front-end.
- Interactive visualization: Molecule structures can be visualized in 2D or 3D using embedded viewers (e.g., 3Dmol.js, NGL Viewer, or SwiftUI wrapper).
- CF-inspired neighbor recommendations: Uses the concept of collaborative filtering applied to molecules and their activity profiles to prioritize promising candidates.
- Cancer Research workspace: Drill into a cancer type, pick a drug, and see synonyms, indications, structurally similar drugs, and a unified Clinical Trial Outcomes section pulled from ClinicalTrials.gov — multi-arm trials, primary outcomes, per-arm values with 95% CIs, and CI-overlap flags so you can spot likely-real differences vs noise.
- On-device AI trial summaries (optional): Point the app at a local Ollama install and get a plain-English summary of every clinical trial listed for a drug. Runs entirely on your machine — no data leaves the device. Default model is
gemma4:26b; configurable in Settings. - Treatment Auditor (multi-source deep audit): Describe a cancer treatment plan — disease/subtype, stage + free-text stage detail (e.g. "metastasized to bone"), prescribed drugs, scheduled treatments, symptoms/side effects — and the on-device AI runs a multi-pass audit. Before the LLM passes, four deterministic safety lookups run and surface as factual callouts above the AI prose: brand→generic dedupe via RxNorm (e.g. Taxol + paclitaxel collapse to a single ingredient before fan-out), pairwise drug-drug interactions from DDInter (Major / Moderate / Minor severity — see Loading DDInter below), mechanism-of-action target overlap from ChEMBL (flags when two drugs hit the same gene, e.g. anastrozole + letrozole both inhibit CYP19A1), and OpenFDA FAERS post-market reaction frequencies matched against the user's symptoms (e.g. "fatigue is the #1 reported reaction for tamoxifen with 386 reports out of 5,613"). Then the LLM phase: NCI PDQ standard-of-care text from cancer.gov, ClinicalTrials.gov searches for radiation / surgery / chemotherapy / targeted-therapy trials in the subtype (independent of the patient's drugs), and per-drug trial outcomes. Each source gets its own streaming mini-summary; a final synthesis pass combines them with the deterministic findings + explicit "Further reading" citations to the PDQ URL and the most-relevant NCT IDs. Each step shows up as a progress row so the wait feels like work. Research tool only, not medical advice. Tumor-mutation matching is tracked as a planned follow-up.
- Printable PDF reports for the Treatment Auditor: After a deep audit completes, click Save as PDF… to export a self-contained, paginated report capturing all inputs, the four deterministic findings (RxNorm merges, DDInter interactions, ChEMBL target overlap, FAERS top reactions + symptom matches), the multi-pass methodology (CT.gov v2 search terms / filters / pagination, the PDQ URL fetched and section-scoring rules, the per-drug ChEMBL→NCT path), the live audit pipeline log, every per-source mini-summary, the final synthesis, and a References section with every NCT ID, the PDQ URL, the data-source URLs for each finding type, and a link back to this repo — enough detail that someone could repeat the audit by hand. Example: example_reports/treatment-audit-her2-20260507-1433.pdf.
The drug-drug interaction section of the Treatment Auditor is backed by DDInter (ddinter.scbdd.com), an academic interaction database that ships eight per-ATC-class CSVs (~13 MB total, ~236k pairwise interactions across ~1.8k approved drugs).
License — read this before running the loader. DDInter is published under CC BY-NC-SA 4.0:
- ✅ Free for non-commercial use (research, personal, academic).
- ✅ No registration or login required to download.
⚠️ Attribution required when redistributing audit reports that include DDInter data.⚠️ ShareAlike: derivative data works must be CC BY-NC-SA. (BioNeighbor's MIT-licensed source code is unaffected — only the dataset carries this restriction.)- ❌ Commercial use prohibited. Do not run the loader on a server or in a product that monetizes the resulting interactions.
To load:
# macOS / Linux
source venv/bin/activate
# Windows (PowerShell)
.\venv\Scripts\Activate.ps1
python backend/load_ddinter_interactions.py # downloads + ingests into SQLite
python backend/load_ddinter_interactions.py --refresh # force re-downloadThe CSVs land in data/ddinter_cache/ (gitignored — never committed). The loader runs schema migrations first, wipes the drug_interactions table, then bulk-inserts ~236k rows. Takes <30s on a fresh checkout including download.
Without this step the audit still runs, but the interactions section gracefully shows "DDInter not loaded" rather than a misleading empty result. Audits where the data isn't loaded explicitly say "interaction data unavailable" in the AI synthesis instead of asserting "no interactions found."
Why DDInter and not DrugBank? Earlier versions of BioNeighbor used DrugBank's XML download. DrugBank's terms forbid using their data to "build products" without a commercial license, which conflicts with BioNeighbor's MIT licensing and free distribution. DDInter's CC BY-NC-SA terms are stated cleanly, the data is free for non-commercial use without registration, and the operational integration is simpler (no per-user license required to download).
Molecules tab - Browse and explore molecules in the database
Diseases tab - Browse diseases and their associated drugs and molecules
Drugs tab - View all drugs with detailed information
Download Data tab - Download molecules, drugs, and diseases with real-time progress tracking
Advanced Search tab - Search for similar molecules by SMILES or ChEMBL ID
Cancer Research tab - Drug detail with Clinical Trial Outcomes and an optional on-device AI summary (Ollama + Gemma 4) condensing every trial into a few paragraphs.
Treatment Auditor tab - Enter your cancer type/subtype, prescribed drugs (autocompleted against ChEMBL), scheduled treatments, and side effects. Research tool only, not medical advice.
Treatment Auditor - Streaming on-device AI audit covering efficacy signals, alternative regimens, side-effect concerns, plan gaps, and uncertainty, citing ClinicalTrials.gov NCT IDs.
- Drug repurposing: Discover alternative molecules similar to existing drugs to target new pathways.
- Adenosine pathway research: Explore candidates that modulate adenosine production (e.g., CD73, CD39, A2A receptor inhibitors).
- Molecular discovery for other pathways: Flexible framework supports any pathway or target with available activity data.
- Educational / research tool: Provides an approachable interface for exploring chemical space and molecular similarity without needing deep ML expertise.
BioNeighbor separates frontend and backend logic while remaining fully offline. There are two frontends and one backend:
-
GUI (macOS only): SwiftUI application under
macos_app/.- Allows users to browse molecules, diseases, and drugs
- Search for similar molecules by SMILES or ChEMBL ID
- Visualizes molecules using embedded 2D/3D viewers
- Download data from multiple sources with real-time progress tracking
- Built with RxSwift for reactive programming patterns
-
CLI (cross-platform — macOS, Linux, Windows): treatment_auditor_cli.py drives the Treatment Auditor pipeline headlessly against the same Flask backend. Pure stdlib for the audit itself; the optional
--pdfflag asks the backend to render a PDF identical to the macOS app's "Save as PDF…" output, so a Linux server or Windows workstation can produce the same audit reports without any Mac/Xcode dependency. Useful for regression testing, batch audits, and any deployment that doesn't have a Mac in front of it. -
Backend / local engine: Python Flask API server with:
- RDKit for fingerprint and descriptor computation
- FAISS for nearest-neighbor vector search
- SQLite database for molecules, drugs, diseases, and relationships
- Multi-API integration (openFDA, ClinicalTrials.gov, PubChem, RxNorm, NLM)
- Real-time progress tracking for downloads
- Database schema management and migrations
The frontend communicates with the local Python engine via:
- HTTP REST API on
http://127.0.0.1:5000 - Automatic backend process management
BioNeighbor supports multiple data sources with automatic fallback and comprehensive disease-drug relationships:
Molecules:
-
PubChem FTP (Primary - Recommended for bulk downloads)
- Full SDF files via FTP (300-500 MB per file, ~500,000 compounds)
- No API rate limits
- URL: PubChem FTP CURRENT-Full
- Automatically downloads, decompresses, and converts to SMILES
-
PubChem API (Fallback)
- Individual molecule downloads by name or CID
- Rate-limited but reliable for smaller batches
- Automatic retry with exponential backoff
-
ZINC Database (Alternative)
- Curated drug-like and lead-like subsets
- Bulk SMILES file downloads
- URL: ZINC Database
-
ChEMBL (Live)
- Powers the Cancer Research tab end-to-end: live drug-name search with write-through caching, full drug detail (synonyms, indications, structure, similar drugs), approved-drug ingestion, and NCT lookup via
drug_indication. - Accessed via
chembl_webresource_client— the long-running 500-error outages tracked in chembl/chembl_webresource_client#134 have been resolved. - Calls run through a small thread-pool helper with hard timeouts (5–20 s depending on endpoint) so a slow ChEMBL response never hangs the UI.
- Powers the Cancer Research tab end-to-end: live drug-name search with write-through caching, full drug detail (synonyms, indications, structure, similar drugs), approved-drug ingestion, and NCT lookup via
Drugs:
- RxNorm API - Standardized drug names and ingredients (bulk downloads)
- PubChem - Comprehensive drug information (indications, MOA, ingredients)
- openFDA - FDA-approved drugs by condition
- ClinicalTrials.gov - Drugs in clinical trials
Diseases:
- NLM Clinical Tables - 2,400+ medical conditions with ICD codes and synonyms
- Bulk download via JSON file: ctss-downloads
- API access: conditions v3 API docs
Data Download Priority:
- PubChem FTP (for bulk molecule downloads)
- RxNorm + PubChem (for bulk drug downloads)
- NLM Clinical Tables (for disease data)
- PubChem API (for individual downloads)
- ZINC database (alternative source)
- Sample data (for testing)
Users can download data through the in-app interface with real-time progress tracking.
BioNeighbor leverages a collaborative filtering metaphor:
- Molecules = “items”
- Targets / pathways = “users”
- Activity / binding data = “ratings”
This analogy allows CF-inspired models to prioritize molecules based on structural similarity and shared biological activity.
Cross-platform (backend + CLI — macOS, Linux, Windows):
- Python 3.9+ (Python 3.11 or 3.12 recommended)
- macOS:
brew install python3orbrew install python@3.12 - Linux: distro package (
apt install python3.12,dnf install python3.12, etc.) or pyenv - Windows: install from python.org or use WSL
- Or any platform:
conda install python=3.11
- macOS:
- Pango / cairo for the Treatment Auditor's PDF endpoint (WeasyPrint)
- macOS:
brew install pango(auto-installed bysetup.sh) - Linux:
apt install libpango-1.0-0 libpangoft2-1.0-0(or your distro's equivalent) - Windows: see WeasyPrint installation docs
- macOS:
- Internet connection (for initial dataset download and live API calls)
macOS GUI app only (skip if you only want the CLI):
- macOS 13.0 or later
- Xcode 14+
-
Clone the repository:
git clone <repository-url> cd bio-neighbor
-
Run the setup script:
./setup.sh
This will:
- Create a Python virtual environment
- Install all Python dependencies (RDKit, FAISS, etc.)
- Create necessary directories
Note: If RDKit installation fails via pip, you can use conda:
conda install -c conda-forge rdkit
Or see INSTALL_RDKIT.md for alternative installation methods.
-
Activate the virtual environment:
# macOS / Linux source venv/bin/activate # Windows (PowerShell) .\venv\Scripts\Activate.ps1 # Windows (cmd.exe) venv\Scripts\activate.bat
-
Initialize database schema (first time only):
python backend/db_migrations.py
This will:
- Create all database tables (molecules, diseases, drugs, etc.)
- Set up indexes and foreign keys
- Track schema version for future migrations
Note: The schema is automatically initialized when running
setupor download scripts, but you can run this manually to ensure the database is ready. -
Set up the data and build the search index:
python backend/main.py setup --max-molecules 10000
This will:
- Automatically download from ZINC database (recommended - no API limits)
- Falls back to PubChem, ChEMBL, or sample data if needed
- Compute molecular fingerprints using RDKit
- Build the FAISS similarity search index
- Automatically run database migrations if needed
Note: For 10,000+ molecules, ZINC database is recommended. If automatic download fails, see DOWNLOAD_DATA.md for manual download instructions.
-
Test the backend (optional):
# Search for molecules similar to aspirin python backend/main.py search "CC(=O)Oc1ccccc1C(=O)O" --top-k 5 # Start the API server python backend/api.py --mode http
-
Build and run the macOS app:
- Open Xcode
- Create a new macOS App project in
macos_app/directory - Add all Swift files from
macos_app/BioNeighbor/ - Build and run (⌘R)
- The app will automatically start the backend if needed
The Cancer Research tab can summarize every clinical trial listed for a drug into a short, plain-English paragraph. Inference happens locally via Ollama, so trial data never leaves your machine.
- Install Ollama 0.20+ (
brew install ollama, or download from ollama.com) and runollama servein a terminal. - Pull a Gemma 4 model. On a 16 GB Mac use
ollama pull gemma4(≈9.6 GB E4B). On a 32 GB+ Mac useollama pull gemma4:26b(≈18 GB MoE) for noticeably better summaries. - Enable the feature in BioNeighbor. Open the app, go to BioNeighbor → Settings → AI Assistant (Ollama), flip on Enable on-device AI summaries, set the model name (default
gemma4:26b), and click Test connection. - Use it. In the Cancer Research tab, open a drug with clinical trials and click the ✨ Summarize with AI button next to the trials count. Output is summary-only — never medical advice — and is derived from the same ClinicalTrials.gov data already shown on the page.
The backend provides a comprehensive REST API on http://127.0.0.1:5000:
Search & Discovery:
POST /search- Search for similar molecules by SMILES stringPOST /search/chembl- Search by ChEMBL IDPOST /search/by-disease- Search similar molecules to disease-related drugsGET /search/molecules- Autocomplete search for moleculesGET /search/drugs- Autocomplete search for drugsGET /search/diseases- Autocomplete search for diseases
Molecules:
GET /molecules- List molecules with pagination and searchGET /molecule/<index>- Get molecule by indexGET /molecule/<index>/thumbnail- Get molecule thumbnail imageGET /molecule/<index>/3d- Get 3D coordinates for moleculePOST /render- Render molecule structure image
Diseases:
GET /diseases- List all diseasesGET /diseases/<name>/molecules- Get molecules for a diseaseGET /diseases/<name>/drugs- Get drugs for a diseaseGET /diseases/<name>/top-molecules- Get top molecules for a disease
Drugs:
GET /drugs- List all drugsGET /drugs/<drug_id>- Get drug by IDGET /drugs/<drug_id>/molecules- Get active ingredient molecules for a drug
Data Downloads:
POST /download/molecules- Download molecules (by count, name, or full SDF file)POST /download/drugs- Download drugs (by name, disease, or bulk)POST /download/diseases- Download diseases (by name or bulk from NLM)GET /download/status/<task_id>- Get download progress status
Statistics:
GET /stats- Get database statistics (molecules, drugs, diseases, relationships)GET /health- Health check
The backend also provides a CLI for testing:
# Setup data and index
python backend/main.py setup --max-molecules 10000
# Search by SMILES
python backend/main.py search "CC(=O)Oc1ccccc1C(=O)O" --top-k 10
# Search by ChEMBL ID
python backend/main.py search-chembl CHEMBL25 --top-k 5treatment_auditor_cli.py runs the same multi-source audit pipeline as the
GUI's Treatment Auditor tab from the command line, so plans can be regression-
tested against backend changes without driving the UI. It hits the running
Flask server (./start_server.sh) and runs all seven deterministic steps —
NCI PDQ fetch, ClinicalTrials.gov modality search ×4 (radiation / surgery /
chemotherapy / targeted), RxNorm brand→generic dedupe, DDInter pairwise
interactions, ChEMBL mechanism-of-action target overlap, OpenFDA FAERS top
reactions + symptom matches, and per-drug ClinicalTrials.gov outcomes. Pure
stdlib — no extra pip install needed for the audit itself; the PDF
endpoint adds a WeasyPrint dependency on the backend side only.
# With the backend running on the default port (Ollama synthesis is on by
# default to match the macOS app — pass --no-ollama to skip):
python treatment_auditor_cli.py --plan examples/treatment_auditor_plan.example.json
# Render a PDF identical to the macOS "Save as PDF…" output:
python treatment_auditor_cli.py --plan plan.json --pdf audit.pdf
# Human-readable summary instead of raw JSON:
python treatment_auditor_cli.py --plan plan.json --format text
# Skip steps you're not testing this run:
python treatment_auditor_cli.py --plan plan.json --skip faers,drug-trials
# Skip the LLM synthesis (deterministic findings only, much faster):
python treatment_auditor_cli.py --plan plan.json --no-ollama --output audit.jsonThe plan file mirrors the GUI form fields (cancer type, subtype, stage,
prescribed drugs with optional ChEMBL IDs, treatments, symptoms with
severity). See examples/treatment_auditor_plan.example.json
for a complete example. Cancer type / subtype accept either human names
(matched case-insensitively against the seed taxonomy) or numeric
cancer_type_id / subtype_id for fully unambiguous runs.
Output is a single JSON document with {plan, steps, source_summaries?, final_audit?} — each
step is {"ok": true/false, ...} so a failed source (e.g., DDInter not
loaded, PDQ unavailable for a hematologic cancer) doesn't abort the
run, matching the GUI's best-effort behavior. Per-step progress is written
to stderr, so --output file.json always gets clean JSON on stdout.
The database schema is managed through a migration system:
backend/db_schema.py: Defines all table structuresbackend/db_migrations.py: Handles schema migrations and versioningbackend/SCHEMA.md: Complete schema documentation
Running migrations:
# Check current schema version
python backend/db_migrations.py --check
# Run migrations (automatic - run this if you get schema errors)
python backend/db_migrations.py
# Force recreate all tables (DANGEROUS - deletes all data!)
python backend/db_migrations.py --force-recreateMigrations run automatically when:
- Running
python backend/main.py setup - Running any download script (molecules, drugs, diseases)
- Starting the API server
See backend/SCHEMA.md for complete schema documentation.
bio-neighbor/
├── backend/ # Python backend
│ ├── api.py # Flask HTTP API server
│ ├── main.py # CLI entry point
│ ├── search_engine.py # Similarity search engine
│ ├── data_loader.py # Dataset loading utilities
│ ├── fingerprints.py # Molecular fingerprint computation
│ ├── index_builder.py # FAISS index building
│ ├── molecule_renderer.py # 2D structure rendering
│ ├── db_schema.py # Database schema definitions
│ ├── db_migrations.py # Schema migration system
│ ├── download_molecules.py # Molecule download scripts
│ ├── download_drugs_*.py # Drug download scripts (RxNorm, bulk)
│ ├── download_diseases_nlm.py # Disease download from NLM
│ ├── download_by_name.py # Download by name (molecules/drugs/diseases)
│ ├── multi_api_disease_loader.py # Multi-API drug search
│ ├── progress_tracker.py # Real-time progress tracking
│ ├── stream_process_output.py # Subprocess output streaming
│ └── test_*.py # Test suites
├── macos_app/ # SwiftUI macOS app
│ └── BioNeighbor/
│ └── BioNeighbor/
│ ├── BioNeighborApp.swift # Main app entry point
│ ├── BackendService.swift # Python backend service integration
│ ├── OllamaService.swift # Local Ollama client (AI trial summaries)
│ ├── BrowseView.swift # Molecules tab
│ ├── DiseaseBrowseView.swift # Diseases tab
│ ├── DiseasesDownloadView.swift # Disease download view
│ ├── DiseasesDownloadViewRx.swift # Disease download (RxSwift)
│ ├── DownloadStatisticsView.swift # Download statistics
│ ├── DrugCard.swift # Drug card component
│ ├── DrugDataDownloadView.swift # Download Data tab
│ ├── DrugDetailView.swift # Drug detail view
│ ├── DrugsView.swift # Drugs tab
│ ├── DrugsDownloadView.swift # Drug download view
│ ├── DrugsDownloadViewRx.swift # Drug download (RxSwift)
│ ├── Models.swift # Data models
│ ├── Molecule3DView.swift # 3D molecule visualization
│ ├── MoleculeCard.swift # Molecule card component
│ ├── MoleculeDetailView.swift # Molecule detail view
│ ├── MoleculesDownloadView.swift # Molecule download view
│ ├── MoleculesDownloadViewRx.swift # Molecule download (RxSwift)
│ ├── ReactiveDownloadService.swift # RxSwift download service
│ ├── ResultsView.swift # Search results view
│ ├── SearchView.swift # Advanced Search tab
│ └── TreatmentAuditorView.swift # Treatment Auditor tab
├── data/ # Data files
│ ├── molecules.db # SQLite database
│ ├── faiss_index.bin # FAISS search index
│ ├── fingerprints.pkl # Molecular fingerprints
│ └── progress/ # Progress tracking files
├── examples/ # Sample inputs for CLI tools
│ └── treatment_auditor_plan.example.json
├── images/ # Screenshots
├── venv/ # Python virtual environment
├── setup.sh # Setup script
├── treatment_auditor_cli.py # CLI version of the Treatment Auditor (issue #62)
└── README.md # This file
The project name BioNeighbor reflects its CF-inspired approach:
“Find the biological neighbors of a molecule in chemical and activity space.”
- Real-time Progress Tracking: See exactly what's happening during downloads with detailed progress information
- Multi-API Integration: Automatically searches multiple APIs (openFDA, ClinicalTrials.gov, PubChem, RxNorm) for comprehensive drug discovery
- Database Schema Management: Versioned schema with automatic migrations
- Reactive Programming: Built with RxSwift for responsive, asynchronous operations
- Comprehensive Testing: Unit tests, UI tests, and integration tests
- Bulk Downloads: Download entire datasets (molecules, drugs, diseases) with progress tracking
- Offline Operation: All data stored locally in SQLite database
- Optional training of collaborative filtering models locally
- Enhanced visualization of molecular clusters and pathways
- Additional dataset integrations
- Performance optimizations for large-scale searches
- Core code: MIT
- Datasets: Check individual dataset licenses (ChEMBL, BindingDB, PubChem).