genesetGPT is a Python package that enables researchers to precisely summarize individual genes and larger gene sets using LLMs. Both OpenAI and Anthropic models are currently supported for gene set summarization via each organization's APIs. The LLMs are strictly guided by functional information pulled from databases such as the Human Protein Atlas, UniProt, and NCBI's Entrez database, along with user-provided biological context concerning the system being studied.
In order to install and start using genesetGPT we recommend a uv-based workflow. From here on out, we assume a Unix-based system, though the commands for a Windows system are very similar. First, create and navigate to a directory that will house your analysis (we'll call it gene-set-analysis here, but feel free to choose your own name) like so:
mkdir gene-set-analysis
cd gene-set-analysisNext, intialize your uv project and create a virtual environment (named .venv by default), making sure to use our required minimum Python version:
uv init --python 3.12
uv venv --python 3.12Activate your virtual environment:
source .venv/bin/activateNow you can install genesetGPT and its dependencies from this GitHub repository using pip:
uv pip install git+https://github.com/jr-leary7/genesetGPT.gitIn order to use genesetGPT, you'll need at minimum: either an OpenAI or Anthropic API key (linked to your funded acount), and a MIM API key that you've previously registered for. Optionally, you can provide an Entrez API key linked to your NCBI account; this is free, and only serves to increase the rate limit of your requests to that database from 3/sec to 10/sec. We recommend storing these in the root directory your project in a plaintext file called .env, then loading them into Python using a combination of the load_dotenv() function from the python-dotenv package and the os.getenv() function.
For example, you should format your .env file to look like this:
MIM_API_KEY='01234'
ANTHROPIC_API_KEY='56789'
Import the necessary libraries, then set your API keys from .env as environment variables:
import os
from dotenv import load_dotenv
load_dotenv()Lastly, define your API keys as variables in your Python session:
mim_key = os.getenv('MIM_API_KEY')
anthropic_key = os.getenv('ANTHROPIC_API_KEY')Warning
Be incredibly careful not to commit your .env file containing your API keys to any code-hosting service e.g., GitHub. This can be accomplished by adding it to the .gitignore file in your project's root directory, which you should do immediately after creating it. In addition, avoid sharing a single API key between multiple users.
Load our package and other necessary ones (the Anthropic LLM backend is used going forward), then import an example set of 50 genes that were significantly differentially-expressed in a cluster of B cells in the 10X Genomics PBMC3k dataset. See this script for processing details.
import anthropic
import pandas as pd
import genesetgpt as gpt
bcell_genes = gpt.load_example_gene_set()Next, load your API keys as described in the previous section of this README.
mim_key = os.getenv('MIM_API_KEY')
entrez_key = os.getenv('ENTREZ_API_KEY')
entrez_email = os.getenv('ENTREZ_EMAIL')
claude_key = os.getenv('ANTHROPIC_API_KEY')Use these two helper functions to load DataFrames containing mappings between Ensembl, Entrez, HGNC symbol, & MIM IDs.
all_hs_genes = gpt.fetch_gene_table()
mim_table = gpt.fetch_mim_table()Now you can construct a DataFrame containing per-gene summarization prompts based on information pulled from Entrez, HPA, UniProt, etc.
user_prompt_df = gpt.build_prompt_df(
gene_list=bcell_genes,
gene_id_table=all_hs_genes,
mim_mapping_table=mim_table,
mim_api_key=mim_key,
entrez_email=entrez_email,
entrez_api_key=entrez_key
)Next, initialize your Claude LLM client using your API key.
claude_client = anthropic.Anthropic(api_key=claude_key)Each individual gene is then concisely summarized and confidence-scored based on the gene-level prompts from the previous step.
gene_sumys = gpt.summarize_individual_genes(
user_prompt_df=user_prompt_df,
provider='anthropic',
client=claude_client,
model='claude-haiku-4-5',
n_workers=4
)Lastly, the entire gene set is summarized, scored, and named based on the per-gene LLM sumaries.
module_sumy = gpt.summarize_module(
module_genes=bcell_genes,
gene_sumy_df=gene_sumys,
provider='anthropic',
client=claude_client,
model='claude-haiku-4-5'
)
module_sumy_df = module_sumy['module_summary_df']In this repository's notebooks subdirectory there are several marimo notebooks (a drop-in replacement for Jupyter notebooks that stores everything as versionable Python code) demonstrating how to use the package.
Important
Each example notebook imports additional dependencies that are not included with the default genesetGPT install e.g., scikit-learn, scanpy[skmisc], and squidpy for the spatially variable gene modules case study. Each marimo notebook, when launched, will immediately alert you as to which additional dependencies are not installed in your virtual environment, and provide instructions as to how to add them.
For example, to load the spatially-resolved transcriptomics case study notebook, execute the following in your terminal (with your virtual environment activated):
marimo edit notebooks/spatial_case_study.pyIf you encounter any issues with the package or need assistance in performing your analysis, please open an issue or reach out via email.