This repository contains the code for generating the GQVis dataset available on Hugging Face.
The code generates a collection of natural language Queries on genomics Data and responds with a visualization specification in the form of a Gosling grammar.
📂 Dataset on Hugging Face: HIDIVE/GQVis
- Template Generation will create abstract questions and specifications with placeholders for sample, entities, and location as well as constraints for those sample and entities.
- Data-schema/All-schema are our defined dataset schemas retrieved from 4DN, ENCODE, and Chromoscope.
- Template Expansion will reify the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
- Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input.
- Multi-step defines links, chains, and scripts to generate multi-step queries.
- Alt-Gosling exports bulk Alt-Gosling text based on the resulting .csv file.
.
├── datasets/ # Source structured data files
├── main.py # Entry point for dataset generation
├── template_generation.py # Generated datasets (optional exports)
├── out/ # Generated datasets (optional exports)
└── README.md # This file
