Perturb Agent is a computational framework to identify patient-specific pathway perturbations from TCGA transcriptomic data and map them to potential therapeutic targets.
Status: 🚧 Under development
The pipeline integrates:
-
Streamlit for interactive exploration and visualization
streamlit==1.55.0
protobuf==3.20.3
click==8.0.4 -
Docker for reproducibility and R interface
-
Nextflow for scalable, reproducible data processing
-
Python (ML/AI layer) — pathway scoring, feature attribution, and target prioritization
-
uv Python project and dependency management
-
ruff Python code formater and fixer.
- uv run ruff check src/libs/*.py --fix > ruff.txt
- uv run ruff format src/libs/*.py
https://perturb-agent.onrender.com/
- 57 primary sites.
- 11428 cases.
- 245657 samples.
- 480826 annotated mutations.
- 18961 different genes.
project → project_id - gdc.list_gdc_progams()
primary_sites → pid and disease_type - gdc.get_primary_sites(program=program)
cases → case_id (UUID) - gdc.build_cases(pid=pid, subtype=subtype, stage=stage)
- subtypes → cancer subtypes, tissue subtypes
- stages → stage_id (AJCC)
samples → sample type: [tumor, normal] and file access
barcodes → patients
annotated mutations (from cBioPortal)
See: tcga_gdc_and_cBioPortal_mutations_loop.ipynb
- Chatbot
- Query any GDC/TCGA cancer type
- Acts as the orchestration layer for pipeline execution
- tool1 — Mutation clusterization
- Given a disease
- Retrieve all mutations
- Create a pivot table: barcodes x symbols
- Clusterization
- Pairwise distance using Jaccard distance - pairwise_distances(X, metric="jaccard")
- Hirarchical Clustering + dendogram (seaborn)
- Cluster with UMAP
- Find groups using knn with k=8
Jaccard distance
Jaccard distance is a measure of dissimilarity between two sets, derived directly from Jaccard similarity. While Jaccard similarity measures how much two sets overlap, Jaccard distance measures how different they are. It is defined as one minus the Jaccard similarity.
- Most mutated genes for disease = 'Esophagus'
- Mutation heatmap for disease = 'Esophagus'
- UMAP applying knn with k = 8
- tool2 — Differential Expression (per patient)
- Retrieve all patient cases (barcodes)
- For each patient:
- Obtain gene expression (raw counts)
- Compute DEGs per patient:
- Control: TCGA solid tissue normal samples
- Method: DESeq2
- Thresholds:
- |log2FC| ≥ 1
- FDR < 0.05
- tool3 — Pathway Perturbation Modeling
- Retrieve Reactome pathways and gene sets
- Map DEGs onto Reactome pathways
- For each pathway:
- Identify DEGs present in the pathway
- Expand DEG signal using the Reactome functional interaction graph:
- Include first-order neighbors (1-hop) of DEGs within the pathway graph
- Expansion is restricted to pathway-local topology
- Pathway selection
- Find the minimum N according to the hypergeometric statistics
- Select pathway havein n >= N genes
- For each selected pathway:
- Construct a perturbation profile including:
- Highligh DEGs
- network-propagated genes (neighbors)
- Construct a perturbation profile including:
- tool4 - Patient Representation & Clustering
- Represent each patient as a pathway perturbation vector
- Cluster patients based on pathway-level features
- tool5 — Biological and Therapeutic Annotation
-
For each patient cluster and pathway:
- Gene–phenotype associations:
- Gene–disease associations:
- Drug associations:
- LINCS (perturbation signatures)
- Allosteric Database
- Drug-Gene Interaction Database (DGIdb)
- DrugBank
- ChEMBL
- Mechanism of action (MOA):
- inferred from LINCS perturbation profiles
- tool6 — Visualization & Reporting
-
Dashboard includes:
- Pathway-level views:
- perturbed genes
- network structure
- DEG vs propagated genes
- perturbed genes
- Cluster-level summaries:
- shared pathways/genes
- candidate drugs and MOA
- LLM-generated summaries using TAHOE perturbed datasets
- Tahoe Bio LLMs a gigascale single cell perturbational atlas (May 2025)
- biological interpretation
- therapeutic hypotheses
- Tahoe Bio LLMs a gigascale single cell perturbational atlas (May 2025)
- Pathway-level views:
-
Possible 'new' questions:
- Given a primary site, a subtype, and stage
- are all samples similar?
- what kind of info returns a clusterization?
- For each clusterization, are they similar to:
- EXCEPTIONAL_RESPONDERS?
- Organoids?
- Given a primary site, a subtype, and stage
-
Docker config:
- Given a primary site, a subtype, and stage
- are all samples similar?
- what kind of info returns a clusterization?
- For each clusterization, are they similar to:
- EXCEPTIONAL_RESPONDERS?
- Organoids?
- Given a primary site, a subtype, and stage
PhD Flavio Lichtenstein
email: flalix@gmail.com
phone: +55-11-96560-1960
local: Brazil/Sao Paulo


