Skip to content

navdeep5/Bioinformatics_Text_Mining_Chatbot

Repository files navigation

Bioinformatics Text-Mining System

The Bioinformatics Text-Mining System is a comprehensive platform designed for extracting, generating, and fine-tuning chemical abstracts and information from large datasets. Leveraging SQL databases, state-of-the-art language models, and advanced evaluation metrics, this system streamlines the extraction and analysis of chemical data for research and development purposes.

Table of Contents

Overview

This system encompasses a series of processes, from data extraction to model evaluation, aimed at efficiently handling chemical information:

  • Data Extraction: Utilizing SQL, the system extracts data from ChemFOnt and transforms it into triplets for further analysis.
  • Abstract Generation: Leveraging Mistral-7b-OpenOrca and GPT-3.5-Turbo, the system generates abstracts based on the extracted triplets.
  • Model Fine-Tuning: Using the generated abstracts, the system fine-tunes models such as Starling-7B-Alpha and Mixtral-8x7B-Instruct-V0.1 to enhance their ability to process chemical articles and return relevant information.
  • Evaluation: The system evaluates the performance of the fine-tuned models using standard metrics like Precision, Recall, and F1-score, along with novel metrics like bonus Precision and bonus Recall. Additionally, Jaro-Winkler similarity is employed for text similarity comparison. The fine-tuned Mixtral achieves a notable false positive percentage of 8%, surpassing GPT-3.5-Turbo.

Files

  • Abstract_Generation: Contains abstracts generated by Mistral-7b-OpenOrca and GPT-3.5-Turbo.
  • ChemFont_Tagger_Files: Edited version of the ChemFont_Tagging directory, containing text files for all predicates in ChemFOnt.
  • Clusters: Files containing different clustered variations of the abstracts/triplets.
  • Data: Includes all training and test sets used in the system.
  • Evaluation: Holds all resulting evaluations and graphs generated during model evaluation.
  • Finetune: Contains fine-tuning scripts utilized in the system.
  • Scripts: Houses all other scripts and notebooks used for various tasks.
  • Text_files: Miscellaneous text files used within the system.

This repository serves as a centralized hub for accessing and managing the diverse components of the Bioinformatics Text-Mining System, facilitating efficient data analysis and exploration in the field of bioinformatics and chemistry.

Installation

  1. Clone the Repository:

    git clone https://github.com/navdeep5/Bioinformatics_Text_Mining_Chatbot.git
  2. Install Dependencies:

    pip install -r requirements.txt
    • Ensure you have access to Jupyter Notebooks as well as GPUs for training (A100s used in this case).

Usage

To use the Bioinformatics Text-Mining System, follow these steps:

  1. Use Scripts to access all scripts including those used for triplet extraction and generation, abstract generation, preparing datasets, cleaning results, drawing graphs, etc.
  2. Use Finetune to access all fine-tuning scripts used in this system.
  3. Use Manual Evaluation to create Excel spreadsheets for you to mark, and then re-run to compute metrics.

Features

  • Data extraction from SQL databases
  • Abstract generation using language models
  • Model fine-tuning for enhanced information retrieval
  • Evaluation using various metrics including Precision, Recall, and F1-score

Contributing

Contributions to the Bioinformatics Text-Mining System are welcome! If you have any suggestions, feature requests, or bug reports, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact Information

For inquiries or collaborations, feel free to reach out via email.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published