Skip to content

ConardLi/easy-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Repo stars GitHub Downloads (all assets, all releases) GitHub Release AGPL 3.0 License GitHub contributors GitHub last commit arXiv:2507.04009

ConardLi%2Feasy-dataset | Trendshift

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

FeaturesQuick StartDocumentationContributingLicense

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

  • Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
  • Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
  • Intelligent Question Generation: Extracts relevant questions from each text segment
  • Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
  • Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
  • Flexible Editing: Edit questions, answers, and datasets at any stage of the process
  • Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
  • Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
  • User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
  • Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

ed3.mp4

Local Run

Download Client

Windows MacOS Linux

Setup.exe

Intel

M

AppImage

Install with NPM

  1. Clone the repository:
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset
  1. Install dependencies:
   npm install
  1. Start the development server:
   npm run build

   npm run start
  1. Open your browser and visit http://localhost:1717

Using the Official Docker Image

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Modify the docker-compose.yml file:
services:
  easy-dataset:
    image: ghcr.io/conardli/easy-dataset
    container_name: easy-dataset
    ports:
      - '1717:1717'
    volumes:
      - ${LOCAL_DB_PATH}:/app/local-db
      - ${LOCAL_PRISMA_PATH}:/app/prisma
    restart: unless-stopped

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

  1. Start with docker-compose:
docker-compose up -d
  1. Open a browser and visit http://localhost:1717

Building with a Local Dockerfile

If you want to build the image yourself, use the Dockerfile in the project root directory:

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Build the Docker image:
docker build -t easy-dataset .
  1. Run the container:
docker run -d \
  -p 1717:1717 \
  -v {YOUR_LOCAL_DB_PATH}:/app/local-db \
  -v {LOCAL_PRISMA_PATH}:/app/prisma \
  --name easy-dataset \
  easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

  1. Open a browser and visit http://localhost:1717

How to Use

Create a Project

  1. Click the "Create Project" button on the homepage;
  2. Enter a project name and description;
  3. Configure your preferred LLM API settings

Process Documents

  1. Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
  2. View and adjust the automatically split text segments;
  3. View and adjust the global domain tree

Generate Questions

  1. Batch construct questions based on text blocks;
  2. View and edit the generated questions;
  3. Organize questions using the label tree

Create Datasets

  1. Batch construct datasets based on questions;
  2. Generate answers using the configured LLM;
  3. View, edit, and optimize the generated answers

Export Datasets

  1. Click the "Export" button in the Datasets section;
  2. Choose your preferred format (Alpaca or ShareGPT);
  3. Select the file format (JSON or JSONL);
  4. Add custom system prompts as needed;
  5. Export your dataset

Documentation

Community Practice

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Citation

If this work is helpful, please kindly cite as:

@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

Star History

Star History Chart

Built with ❤️ by ConardLi • Follow me: WeChat Official AccountBilibiliJuejinZhihuYoutube

About

A powerful tool for creating fine-tuning datasets for LLM

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages