GitHub - ConardLi/easy-dataset: A powerful tool for creating fine-tuning datasets for LLM

GitHub Downloads (all assets, all releases)

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

Features • Quick Start • Documentation • Contributing • License

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
Intelligent Question Generation: Extracts relevant questions from each text segment
Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
Flexible Editing: Edit questions, answers, and datasets at any stage of the process
Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

ed3.mp4

Local Run

Download Client

Windows	MacOS		Linux
Setup.exe	Intel	M	AppImage

Install with NPM

Clone the repository:

   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset

Install dependencies:

   npm install

Start the development server:

   npm run build

   npm run start

Open your browser and visit http://localhost:1717

Using the Official Docker Image

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Modify the docker-compose.yml file:

services:
  easy-dataset:
    image: ghcr.io/conardli/easy-dataset
    container_name: easy-dataset
    ports:
      - '1717:1717'
    volumes:
      - ${LOCAL_DB_PATH}:/app/local-db
      - ${LOCAL_PRISMA_PATH}:/app/prisma
    restart: unless-stopped

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

Start with docker-compose:

docker-compose up -d

Open a browser and visit http://localhost:1717

Building with a Local Dockerfile

If you want to build the image yourself, use the Dockerfile in the project root directory:

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker image:

docker build -t easy-dataset .

Run the container:

docker run -d \
  -p 1717:1717 \
  -v {YOUR_LOCAL_DB_PATH}:/app/local-db \
  -v {LOCAL_PRISMA_PATH}:/app/prisma \
  --name easy-dataset \
  easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

Open a browser and visit http://localhost:1717

How to Use

Create a Project

Click the "Create Project" button on the homepage;
Enter a project name and description;
Configure your preferred LLM API settings

Process Documents

Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
View and adjust the automatically split text segments;
View and adjust the global domain tree

Generate Questions

Batch construct questions based on text blocks;
View and edit the generated questions;
Organize questions using the label tree

Create Datasets

Batch construct datasets based on questions;
Generate answers using the configured LLM;
View, edit, and optimize the generated answers

Export Datasets

Click the "Export" button in the Datasets section;
Choose your preferred format (Alpaca or ShareGPT);
Select the file format (JSON or JSONL);
Add custom system prompts as needed;
Export your dataset

Documentation

View the demo video of this project: Easy Dataset Demo Video
For detailed documentation on all features and APIs, visit our Documentation Site
View the paper of this project: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Community Practice

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Citation

If this work is helpful, please kindly cite as:

@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

Star History

_{Built with ❤️ by ConardLi • Follow me: WeChat Official Account｜Bilibili｜Juejin｜Zhihu｜Youtube}

Name		Name	Last commit message	Last commit date
Latest commit History 646 Commits
.github		.github
.husky		.husky
app		app
components		components
constant		constant
electron		electron
hooks		hooks
lib		lib
local-db		local-db
locales		locales
prisma		prisma
public/imgs		public/imgs
styles		styles
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
.npmrc		.npmrc
.prettierrc.js		.prettierrc.js
.windsurfrules		.windsurfrules
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
commitlint.config.mjs		commitlint.config.mjs
docker-compose.yml		docker-compose.yml
jsconfig.json		jsconfig.json
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Features

Quick Demo

Local Run

Download Client

Install with NPM

Using the Official Docker Image

Building with a Local Dockerfile

How to Use

Create a Project

Process Documents

Generate Questions

Create Datasets

Export Datasets

Documentation

Community Practice

Contributing

Join Discussion Group & Contact the Author

License

Citation

Star History

About

Uh oh!

Releases 23

Packages

Uh oh!

Uh oh!

Contributors 18

Languages

License

ConardLi/easy-dataset

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Quick Demo

Local Run

Download Client

Install with NPM

Using the Official Docker Image

Building with a Local Dockerfile

How to Use

Create a Project

Process Documents

Generate Questions

Create Datasets

Export Datasets

Documentation

Community Practice

Contributing

Join Discussion Group & Contact the Author

License

Citation

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Uh oh!

Contributors 18

Languages

Packages