The Bioinformatics Text-Mining System is a comprehensive platform designed for extracting, generating, and fine-tuning chemical abstracts and information from large datasets. Leveraging SQL databases, state-of-the-art language models, and advanced evaluation metrics, this system streamlines the extraction and analysis of chemical data for research and development purposes.
This system encompasses a series of processes, from data extraction to model evaluation, aimed at efficiently handling chemical information:
- Data Extraction: Utilizing SQL, the system extracts data from ChemFOnt and transforms it into triplets for further analysis.
- Abstract Generation: Leveraging Mistral-7b-OpenOrca and GPT-3.5-Turbo, the system generates abstracts based on the extracted triplets.
- Model Fine-Tuning: Using the generated abstracts, the system fine-tunes models such as Starling-7B-Alpha and Mixtral-8x7B-Instruct-V0.1 to enhance their ability to process chemical articles and return relevant information.
- Evaluation: The system evaluates the performance of the fine-tuned models using standard metrics like Precision, Recall, and F1-score, along with novel metrics like bonus Precision and bonus Recall. Additionally, Jaro-Winkler similarity is employed for text similarity comparison. The fine-tuned Mixtral achieves a notable false positive percentage of 8%, surpassing GPT-3.5-Turbo.
- Abstract_Generation: Contains abstracts generated by Mistral-7b-OpenOrca and GPT-3.5-Turbo.
- ChemFont_Tagger_Files: Edited version of the ChemFont_Tagging directory, containing text files for all predicates in ChemFOnt.
- Clusters: Files containing different clustered variations of the abstracts/triplets.
- Data: Includes all training and test sets used in the system.
- Evaluation: Holds all resulting evaluations and graphs generated during model evaluation.
- Finetune: Contains fine-tuning scripts utilized in the system.
- Scripts: Houses all other scripts and notebooks used for various tasks.
- Text_files: Miscellaneous text files used within the system.
This repository serves as a centralized hub for accessing and managing the diverse components of the Bioinformatics Text-Mining System, facilitating efficient data analysis and exploration in the field of bioinformatics and chemistry.
-
Clone the Repository:
git clone https://github.com/navdeep5/Bioinformatics_Text_Mining_Chatbot.git
-
Install Dependencies:
pip install -r requirements.txt
- Ensure you have access to Jupyter Notebooks as well as GPUs for training (A100s used in this case).
To use the Bioinformatics Text-Mining System, follow these steps:
- Use Scripts to access all scripts including those used for triplet extraction and generation, abstract generation, preparing datasets, cleaning results, drawing graphs, etc.
- Use Finetune to access all fine-tuning scripts used in this system.
- Use Manual Evaluation to create Excel spreadsheets for you to mark, and then re-run to compute metrics.
- Data extraction from SQL databases
- Abstract generation using language models
- Model fine-tuning for enhanced information retrieval
- Evaluation using various metrics including Precision, Recall, and F1-score
Contributions to the Bioinformatics Text-Mining System are welcome! If you have any suggestions, feature requests, or bug reports, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
For inquiries or collaborations, feel free to reach out via email.