Skip to content

dillnelson2o/biosynth-jacks-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Biosynth Agent: Reverse Engineering Plant Secondary Metabolites

Biosynth Agent is a computational workbench designed for biologists studying plant specialized metabolism. It automates the cognitive process of retrosynthesis—working backward from a complex plant molecule to identify the biosynthetic gene clusters (BGCs) likely responsible for its creation.

🌿 Why This Matters for Plant Biology

Plants are the world's greatest chemists, producing hundreds of thousands of specialized metabolites (terpenes, alkaloids, phenolics). However, the "logic" of how these molecules are assembled is often buried in complex genomes.

This tool helps you learn and predict that logic by treating biosynthesis as a structured language:

  1. The Words: Chemical motifs (amides, double bonds, cyclizations).
  2. The Grammar: Enzymatic transformations (acyl-adenylation, desaturation, decarboxylation).
  3. The Speaker: The specific plant or host proteome executing these steps.

🧠 Scientific Principles: Logic in Biosynthesis

This pipeline is built on the core logical frameworks of scientific inquiry:

1. Deduction (Chemical Deconstruction)

  • Premise: A molecule exists (e.g., an Echinacea alkamide).
  • Rule: If an amide bond is present, it must have been formed by condensing an amine and an acid.
  • Execution: The agent uses Graph Theory (via RDKit) to computationally "cut" the molecule at these logical seams, identifying the precursor modules.

2. Induction (Enzyme Hypothesis)

  • Observation: The precursor modules (e.g., a branched-chain amine) mimic known biological substrates.
  • Inference: Therefore, a specific family of enzymes (e.g., PLP-dependent decarboxylases) is likely required.
  • Execution: The agent maps these chemical modules to probabilistic enzyme families stored in a Knowledge Base.

3. Verification (Genomic Realization)

  • Hypothesis: The host organism possesses genes encoding these functions.
  • Test: The agent mines the proteome using Hidden Markov Models (HMMs) to find and rank physical protein candidates.

📂 Repository Architecture

The codebase is structured to mirror the flow of biological information:

Plaintext
src/biosynth_agent/├── chemistry/ # The "Chemist" (RDKit)│ ├── fragmentation.py # Deductive logic: Splitting molecules into precursors│ └── rdkit_motifs.py # Feature extraction: Identifying functional groups├── planning/ # The "Architect" (Search Algorithms)│ ├── beam.py # Beam Search: Ranking the most likely pathways│ ├── enzyme_mapping.py # Knowledge Base: Linking chemistry to enzymes│ └── search.py # Scoring: Evaluating host feasibility (µ - ασ)├── genomics/ # The "Geneticist" (HMMER)│ └── gene_candidates.py # Mining proteomes for specific gene sequences└── cli.py # Orchestrator for the design phase

🛠️ Installation

Prerequisites

  • Python 3.10+

  • HMMER: Required for the genomic mining step.

    • Linux: sudo apt-get install hmmer
    • macOS: brew install hmmer

Setup

Bash
# Clone the repositorygit clone https://github.com/YOUR_USERNAME/biosynth-agent.gitcd biosynth-agent # Set up environmentpython -m venv venvsource venv/bin/activate # Windows: venv\Scripts\activate # Install dependenciespip install -e .

⚡ Usage: A Plant Study Example

Let's analyze a bioactive alkamide (often found in plants like Spilanthes or Echinacea). We want to understand how a host organism could synthesize it.

Step 1: Deconstruct the Molecule (Logic Layer)

We start with the SMILES string. The agent will fragment it and propose a pathway.

Bash
python -m biosynth_agent.cli \ --smiles "CC#CC#CCC/C=C/C=C\C(=O)NCC(C)C" \ --prefix echinacea_target \ --hosts "cyanobacteria,yeast" \ --beam 5

  • What happens: The agent applies deductive rules to split the amide bond, identifying an acyl chain and an amine. It then uses Beam Search to rank the best enzyme families to perform this coupling.
  • Output: results/echinacea_target_designpack.json

Step 2: Find the Genes (Physical Layer)

Now, we ask: "Does my specific host (e.g., Cyanobacteria as a model chloroplast) have the genes to do this?"

Bash
python -m biosynth_agent.gene_cli \ --designpack results/echinacea_target_designpack.json \ --host cyanobacteria \ --proteome data/proteomes/syn6803.faa \ --hmm_dir data/hmms \ --out results/echinacea_genes.json

  • What happens: The agent takes the "Enzyme Hypotheses" from Step 1 and uses HMM profiles to scan the proteome. It calculates bitscores to separate high-probability candidates from noise.
  • Output: results/echinacea_genes.json

🧪 Critical Thinking & Customization

As a computational biologist, you are encouraged to modify the "Brain" of the agent:

  1. Refine the Logic: Edit src/biosynth_agent/chemistry/fragmentation.py to add new rules for plant-specific bonds (e.g., ester linkages in rosmarinic acid).
  2. Tune the Scoring: Adjust the HostProfile in src/biosynth_agent/planning/search.py to model plant-specific constraints (e.g., cytosolic vs. plastidial localization penalties).
  3. Expand the Knowledge: Add new enzyme families to data/enzyme_kb.json to cover more diverse specialized metabolites.

📄 License

MIT License.

About

state of the art prediction tool

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages