Skip to content

flalix/perturb_agent

Repository files navigation

🧪 perturb_agent

Perturb Agent is a computational framework to identify patient-specific pathway perturbations from TCGA transcriptomic data and map them to potential therapeutic targets.

Status: 🚧 Under development

⚙️ Pipeline Overview

The pipeline integrates:

  1. Streamlit for interactive exploration and visualization

    streamlit==1.55.0
    protobuf==3.20.3
    click==8.0.4

  2. Docker for reproducibility and R interface

  3. Nextflow for scalable, reproducible data processing

  4. Python (ML/AI layer) — pathway scoring, feature attribution, and target prioritization

  5. uv Python project and dependency management

  6. ruff Python code formater and fixer.

  • uv run ruff check src/libs/*.py --fix > ruff.txt
  • uv run ruff format src/libs/*.py

⚙️ First results

💡 The running version can be found at

https://perturb-agent.onrender.com/

Interfacing GDC TCGA data, results:

  • 57 primary sites.
  • 11428 cases.
  • 245657 samples.
  • 480826 annotated mutations.
  • 18961 different genes.

💡 GDC flow

project → project_id - gdc.list_gdc_progams()
primary_sites → pid and disease_type - gdc.get_primary_sites(program=program)
cases → case_id (UUID) - gdc.build_cases(pid=pid, subtype=subtype, stage=stage)

  • subtypes → cancer subtypes, tissue subtypes
  • stages → stage_id (AJCC)

samples → sample type: [tumor, normal] and file access
barcodes → patients
annotated mutations (from cBioPortal)

See: tcga_gdc_and_cBioPortal_mutations_loop.ipynb


💡 Core components

  1. Chatbot
    • Query any GDC/TCGA cancer type
    • Acts as the orchestration layer for pipeline execution

  1. tool1 — Mutation clusterization
    1. Given a disease
    2. Retrieve all mutations
    3. Create a pivot table: barcodes x symbols
    4. Clusterization
      • Pairwise distance using Jaccard distance - pairwise_distances(X, metric="jaccard")
      • Hirarchical Clustering + dendogram (seaborn)
      • Cluster with UMAP
      • Find groups using knn with k=8

Jaccard distance

Jaccard distance is a measure of dissimilarity between two sets, derived directly from Jaccard similarity. While Jaccard similarity measures how much two sets overlap, Jaccard distance measures how different they are. It is defined as one minus the Jaccard similarity.

$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$


  1. Most mutated genes for disease = 'Esophagus'

mutation frequency

  1. Mutation heatmap for disease = 'Esophagus'

heatmap

  1. UMAP applying knn with k = 8

UMAP


  1. tool2 — Differential Expression (per patient)
    1. Retrieve all patient cases (barcodes)
    2. For each patient:
      • Obtain gene expression (raw counts)
    3. Compute DEGs per patient:
      • Control: TCGA solid tissue normal samples
      • Method: DESeq2
      • Thresholds:
        • |log2FC| ≥ 1
        • FDR < 0.05

  1. tool3 — Pathway Perturbation Modeling
    1. Retrieve Reactome pathways and gene sets
    2. Map DEGs onto Reactome pathways
    3. For each pathway:
      • Identify DEGs present in the pathway
      • Expand DEG signal using the Reactome functional interaction graph:
        • Include first-order neighbors (1-hop) of DEGs within the pathway graph
        • Expansion is restricted to pathway-local topology
    4. Pathway selection
      • Find the minimum N according to the hypergeometric statistics
      • Select pathway havein n >= N genes
    5. For each selected pathway:
      • Construct a perturbation profile including:
        • Highligh DEGs
        • network-propagated genes (neighbors)

  1. tool4 - Patient Representation & Clustering
    1. Represent each patient as a pathway perturbation vector
    2. Cluster patients based on pathway-level features

  1. tool5 — Biological and Therapeutic Annotation

  1. tool6 — Visualization & Reporting
  • Dashboard includes:

    1. Pathway-level views:
      • perturbed genes
        • network structure
        • DEG vs propagated genes
    2. Cluster-level summaries:
      • shared pathways/genes
      • candidate drugs and MOA
    3. LLM-generated summaries using TAHOE perturbed datasets
      • Tahoe Bio LLMs a gigascale single cell perturbational atlas (May 2025)
        • biological interpretation
        • therapeutic hypotheses

  1. Possible 'new' questions:

    1. Given a primary site, a subtype, and stage
      • are all samples similar?
      • what kind of info returns a clusterization?
    2. For each clusterization, are they similar to:
      • EXCEPTIONAL_RESPONDERS?
      • Organoids?

  1. Docker config:

    1. Given a primary site, a subtype, and stage
      • are all samples similar?
      • what kind of info returns a clusterization?
    2. For each clusterization, are they similar to:
      • EXCEPTIONAL_RESPONDERS?
      • Organoids?

⚙️ Under development

PhD Flavio Lichtenstein
email: flalix@gmail.com
phone: +55-11-96560-1960
local: Brazil/Sao Paulo

About

Perturb Agent focus in targets - under developmente

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors