Bioinformatics Text-Mining System

The Bioinformatics Text-Mining System is a comprehensive platform designed for extracting, generating, and fine-tuning chemical abstracts and information from large datasets. Leveraging SQL databases, state-of-the-art language models, and advanced evaluation metrics, this system streamlines the extraction and analysis of chemical data for research and development purposes.

Overview

This system encompasses a series of processes, from data extraction to model evaluation, aimed at efficiently handling chemical information:

Data Extraction: Utilizing SQL, the system extracts data from ChemFOnt and transforms it into triplets for further analysis.
Abstract Generation: Leveraging Mistral-7b-OpenOrca and GPT-3.5-Turbo, the system generates abstracts based on the extracted triplets.
Model Fine-Tuning: Using the generated abstracts, the system fine-tunes models such as Starling-7B-Alpha and Mixtral-8x7B-Instruct-V0.1 to enhance their ability to process chemical articles and return relevant information.
Evaluation: The system evaluates the performance of the fine-tuned models using standard metrics like Precision, Recall, and F1-score, along with novel metrics like bonus Precision and bonus Recall. Additionally, Jaro-Winkler similarity is employed for text similarity comparison. The fine-tuned Mixtral achieves a notable false positive percentage of 8%, surpassing GPT-3.5-Turbo.

Files

Abstract_Generation: Contains abstracts generated by Mistral-7b-OpenOrca and GPT-3.5-Turbo.
ChemFont_Tagger_Files: Edited version of the ChemFont_Tagging directory, containing text files for all predicates in ChemFOnt.
Clusters: Files containing different clustered variations of the abstracts/triplets.
Data: Includes all training and test sets used in the system.
Evaluation: Holds all resulting evaluations and graphs generated during model evaluation.
Finetune: Contains fine-tuning scripts utilized in the system.
Scripts: Houses all other scripts and notebooks used for various tasks.
Text_files: Miscellaneous text files used within the system.

This repository serves as a centralized hub for accessing and managing the diverse components of the Bioinformatics Text-Mining System, facilitating efficient data analysis and exploration in the field of bioinformatics and chemistry.

Installation

Clone the Repository:

git clone https://github.com/navdeep5/Bioinformatics_Text_Mining_Chatbot.git

Install Dependencies:
```
pip install -r requirements.txt
```
- Ensure you have access to Jupyter Notebooks as well as GPUs for training (A100s used in this case).

Usage

To use the Bioinformatics Text-Mining System, follow these steps:

Use Scripts to access all scripts including those used for triplet extraction and generation, abstract generation, preparing datasets, cleaning results, drawing graphs, etc.
Use Finetune to access all fine-tuning scripts used in this system.
Use Manual Evaluation to create Excel spreadsheets for you to mark, and then re-run to compute metrics.

Features

Data extraction from SQL databases
Abstract generation using language models
Model fine-tuning for enhanced information retrieval
Evaluation using various metrics including Precision, Recall, and F1-score

Contributing

Contributions to the Bioinformatics Text-Mining System are welcome! If you have any suggestions, feature requests, or bug reports, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact Information

For inquiries or collaborations, feel free to reach out via email.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bioinformatics Text-Mining System

Table of Contents

Overview

Files

Installation

Usage

Features

Contributing

License

Contact Information

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Abstract_Generation		Abstract_Generation
ChemFont_Tagger_Files		ChemFont_Tagger_Files
ChemFont_Tagging		ChemFont_Tagging
Clusters		Clusters
Data		Data
Evaluation		Evaluation
Finetune		Finetune
Scripts		Scripts
Text_Files		Text_Files
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

navdeep5/Bioinformatics_Text_Mining_Chatbot

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics Text-Mining System

Table of Contents

Overview

Files

Installation

Usage

Features

Contributing

License

Contact Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages