🧪 perturb_agent

Perturb Agent is a computational framework to identify patient-specific pathway perturbations from TCGA transcriptomic data and map them to potential therapeutic targets.

Status: 🚧 Under development

⚙️ Pipeline Overview

The pipeline integrates:

Streamlit for interactive exploration and visualization

streamlit==1.55.0
protobuf==3.20.3
click==8.0.4
Docker for reproducibility and R interface
Nextflow for scalable, reproducible data processing
Python (ML/AI layer) — pathway scoring, feature attribution, and target prioritization
uv Python project and dependency management
ruff Python code formater and fixer.

uv run ruff check src/libs/*.py --fix > ruff.txt
uv run ruff format src/libs/*.py

⚙️ First results

💡 The running version can be found at

https://perturb-agent.onrender.com/

Interfacing GDC TCGA data, results:

57 primary sites.
11428 cases.
245657 samples.
480826 annotated mutations.
18961 different genes.

💡 GDC flow

project → project_id - gdc.list_gdc_progams()
primary_sites → pid and disease_type - gdc.get_primary_sites(program=program)
cases → case_id (UUID) - gdc.build_cases(pid=pid, subtype=subtype, stage=stage)

subtypes → cancer subtypes, tissue subtypes
stages → stage_id (AJCC)

samples → sample type: [tumor, normal] and file access
barcodes → patients
annotated mutations (from cBioPortal)

See: tcga_gdc_and_cBioPortal_mutations_loop.ipynb

💡 Core components

Chatbot
- Query any GDC/TCGA cancer type
- Acts as the orchestration layer for pipeline execution

tool1 — Mutation clusterization
1. Given a disease
2. Retrieve all mutations
3. Create a pivot table: barcodes x symbols
4. Clusterization
  - Pairwise distance using Jaccard distance - pairwise_distances(X, metric="jaccard")
  - Hirarchical Clustering + dendogram (seaborn)
  - Cluster with UMAP
  - Find groups using knn with k=8

Jaccard distance

Jaccard distance is a measure of dissimilarity between two sets, derived directly from Jaccard similarity. While Jaccard similarity measures how much two sets overlap, Jaccard distance measures how different they are. It is defined as one minus the Jaccard similarity.

$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$

Most mutated genes for disease = 'Esophagus'

Mutation heatmap for disease = 'Esophagus'

UMAP applying knn with k = 8

tool2 — Differential Expression (per patient)
1. Retrieve all patient cases (barcodes)
2. For each patient:
  - Obtain gene expression (raw counts)
3. Compute DEGs per patient:
  - Control: TCGA solid tissue normal samples
  - Method: DESeq2
  - Thresholds:
    - |log2FC| ≥ 1
    - FDR < 0.05

tool3 — Pathway Perturbation Modeling
1. Retrieve Reactome pathways and gene sets
2. Map DEGs onto Reactome pathways
3. For each pathway:
  - Identify DEGs present in the pathway
  - Expand DEG signal using the Reactome functional interaction graph:
    - Include first-order neighbors (1-hop) of DEGs within the pathway graph
    - Expansion is restricted to pathway-local topology
4. Pathway selection
  - Find the minimum N according to the hypergeometric statistics
  - Select pathway havein n >= N genes
5. For each selected pathway:
  - Construct a perturbation profile including:
    - Highligh DEGs
    - network-propagated genes (neighbors)

tool4 - Patient Representation & Clustering
1. Represent each patient as a pathway perturbation vector
2. Cluster patients based on pathway-level features

tool5 — Biological and Therapeutic Annotation

For each patient cluster and pathway:
1. Gene–phenotype associations:
2. Gene–disease associations:
  - MalaCards
  - DisGeNET
3. Drug associations:
  - LINCS (perturbation signatures)
  - Allosteric Database
  - Drug-Gene Interaction Database (DGIdb)
  - DrugBank
  - ChEMBL
4. Mechanism of action (MOA):
  - inferred from LINCS perturbation profiles

tool6 — Visualization & Reporting

Dashboard includes:
1. Pathway-level views:
  - perturbed genes
    - network structure
    - DEG vs propagated genes
2. Cluster-level summaries:
  - shared pathways/genes
  - candidate drugs and MOA
3. LLM-generated summaries using TAHOE perturbed datasets
  - Tahoe Bio LLMs a gigascale single cell perturbational atlas (May 2025)
    - biological interpretation
    - therapeutic hypotheses

Possible 'new' questions:
1. Given a primary site, a subtype, and stage
  - are all samples similar?
  - what kind of info returns a clusterization?
2. For each clusterization, are they similar to:
  - EXCEPTIONAL_RESPONDERS?
  - Organoids?

Docker config:
1. Given a primary site, a subtype, and stage
  - are all samples similar?
  - what kind of info returns a clusterization?
2. For each clusterization, are they similar to:
  - EXCEPTIONAL_RESPONDERS?
  - Organoids?

⚙️ Under development

PhD Flavio Lichtenstein
email: flalix@gmail.com
phone: +55-11-96560-1960
local: Brazil/Sao Paulo

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
chatgpt		chatgpt
figures		figures
notebooks		notebooks
notebooks_dvlp		notebooks_dvlp
rag		rag
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
config_rsync.ipynb		config_rsync.ipynb
env_utils.py		env_utils.py
params.yml		params.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.txt		ruff.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 perturb_agent

⚙️ Pipeline Overview

⚙️ First results

💡 The running version can be found at

Interfacing GDC TCGA data, results:

💡 GDC flow

💡 Core components

⚙️ Under development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧪 perturb_agent

⚙️ Pipeline Overview

⚙️ First results

💡 The running version can be found at

Interfacing GDC TCGA data, results:

💡 GDC flow

💡 Core components

⚙️ Under development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages