Skip to content

Avalderrama04/DQVis-Generation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GQVis Dataset: Natural Language to Genomics Visualization

This repository contains the code for generating the GQVis dataset available on Hugging Face.

The code generates a collection of natural language Queries on genomics Data and responds with a visualization specification in the form of a Gosling grammar.

📂 Dataset on Hugging Face: HIDIVE/GQVis


🚀 Overview

Overview figure of data generation pipline

  1. Template Generation will create abstract questions and specifications with placeholders for sample, entities, and location as well as constraints for those sample and entities.
  2. Data-schema/All-schema are our defined dataset schemas retrieved from 4DN, ENCODE, and Chromoscope.
  3. Template Expansion will reify the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
  4. Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input.
  5. Multi-step defines links, chains, and scripts to generate multi-step queries.
  6. Alt-Gosling exports bulk Alt-Gosling text based on the resulting .csv file.

🗂️ Folder Structure

.
├── datasets/        # Source structured data files
├── main.py          # Entry point for dataset generation
├── template_generation.py             # Generated datasets (optional exports)
├── out/             # Generated datasets (optional exports)
└── README.md        # This file

About

code for generating training data used for fine-tuning the LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 50.7%
  • Jupyter Notebook 49.3%