Skip to content
/ PubMind Public

PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.

License

Notifications You must be signed in to change notification settings

WGLab/PubMind

Repository files navigation

pubmind_logo_v1

PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.

image

PubMind is an AI-driven framework that uses large language models (LLMs) to extract genetic variant–disease–pathogenicity associations directly from biomedical literature. It combines fine-tuned BERT models for input filtering with instruction-tuned LLMs for extracting variant, disease, and functional evidence, covering SNVs, CNVs, SVs, and gene fusions. Extracted variants are normalized to genomic and transcript coordinates and stored in PubMind-DB, a web-accessible knowledgebase. Applied to >41M PubMed abstracts and >5M PMC full texts, PubMind-DB contains ~0.7M consolidated unique variants with rich annotations, of which only ~10% overlap with ClinVar—yet >80% of those show concordant pathogenicity labels, including full agreement for four-star expert-reviewed variants. PubMind provides a scalable, generalizable, and open-source framework that transforms unstructured text into structured genomic knowledge, supporting variant interpretation and precision medicine.

Prerequisite

Please refer to requirements.txt for required packages.

Run PubMind

Please refer to run_PubMind.ipynb for how to use PubMind. All inputs and outputs during this example PubMind run are in the example folder.

PubMind frameworkds includes the following modules:

  1. Filtering Module (finetuned BERT model)
    • Wangwpi/PubMind_finetuned_BERT (Hugging Face)
  2. Inference Module (instruction-tuned LLM)
    • meta-llama/Llama-3.3-70B-Instruct (Hugging Face)
  3. Normalization Module
    • Quality filter (gene name, pathogenicity)
    • Variant parser (cDNA, protein, RSID)
    • Map to transcript
    • Map to genome cooridnates
    • MONDO Disease name
    • HPO term

PubMind-DB

PubMind-DB could be accessed here: https://pubmind.wglab.org/

Reference (Preprint)

Wang, P. and K. Wang (2025). PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models. bioRxiv: 2025.2010.2013.682183.

License

PubMind is freely available for academic use. For license details, please refer to this page.

About

PubMind is a large language model (LLM)-assisted framework for Publication Mutation and information Discovery, designed to extract variant–disease–pathogenicity relationships directly from biomedical literature.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •