A powerful tool for creating fine-tuning datasets for Large Language Models
Features • Quick Start • Documentation • Contributing • License
If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!
Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.
- Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
- Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
- Intelligent Question Generation: Extracts relevant questions from each text segment
- Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
- Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
- Flexible Editing: Edit questions, answers, and datasets at any stage of the process
- Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
- Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
- User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
- Custom System Prompts: Add custom system prompts to guide model responses
ed3.mp4
Windows | MacOS | Linux | |
![]() Setup.exe |
![]() Intel |
![]() M |
![]() AppImage |
- Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
- Install dependencies:
npm install
- Start the development server:
npm run build
npm run start
- Open your browser and visit
http://localhost:1717
If you want to build the image yourself, you can use the Dockerfile in the project root:
-
Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git cd easy-dataset
-
Build the Docker image:
docker build -t easy-dataset .
-
Run the container:
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
Note: Please replace
{YOUR_LOCAL_DB_PATH}
with the actual path where you want to store the local database. -
Open your browser and visit
http://localhost:1717
![]() |
![]() |
- Click the "Create Project" button on the homepage;
- Enter a project name and description;
- Configure your preferred LLM API settings
![]() |
![]() |
- Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
- View and adjust the automatically split text segments;
- View and adjust the global domain tree
![]() |
![]() |
- Batch construct questions based on text blocks;
- View and edit the generated questions;
- Organize questions using the label tree
![]() |
![]() |
- Batch construct datasets based on questions;
- Generate answers using the configured LLM;
- View, edit, and optimize the generated answers
![]() |
![]() |
- Click the "Export" button in the Datasets section;
- Choose your preferred format (Alpaca or ShareGPT);
- Select the file format (JSON or JSONL);
- Add custom system prompts as needed;
- Export your dataset
easy-dataset/
├── app/ # Next.js application directory
│ ├── api/ # API routes
│ │ ├── llm/ # LLM API integration
│ │ │ ├── ollama/ # Ollama API integration
│ │ │ └── openai/ # OpenAI API integration
│ │ ├── projects/ # Project management API
│ │ │ ├── [projectId]/ # Project-specific operations
│ │ │ │ ├── chunks/ # Text chunk operations
│ │ │ │ ├── datasets/ # Dataset generation and management
│ │ │ │ ├── generate-questions/ # Batch question generation
│ │ │ │ ├── questions/ # Question management
│ │ │ │ └── split/ # Text splitting operations
│ │ │ └── user/ # User-specific project operations
│ ├── projects/ # Frontend project pages
│ │ └── [projectId]/ # Project-specific pages
│ │ ├── datasets/ # Dataset management UI
│ │ ├── questions/ # Question management UI
│ │ ├── settings/ # Project settings UI
│ │ └── text-split/ # Text processing UI
│ └── page.js # Homepage
├── components/ # React components
│ ├── datasets/ # Dataset-related components
│ ├── home/ # Homepage components
│ ├── projects/ # Project management components
│ ├── questions/ # Question management components
│ └── text-split/ # Text processing components
├── lib/ # Core libraries and tools
│ ├── db/ # Database operations
│ ├── i18n/ # Internationalization
│ ├── llm/ # LLM integration
│ │ ├── common/ # Common LLM tools
│ │ ├── core/ # Core LLM clients
│ │ └── prompts/ # Prompt templates
│ │ ├── answer.js # Answer generation prompts (Chinese)
│ │ ├── answerEn.js # Answer generation prompts (English)
│ │ ├── question.js # Question generation prompts (Chinese)
│ │ ├── questionEn.js # Question generation prompts (English)
│ │ └── ... other prompts
│ └── text-splitter/ # Text splitting tools
├── locales/ # Internationalization resources
│ ├── en/ # English translations
│ └── zh-CN/ # Chinese translations
├── public/ # Static resources
│ └── imgs/ # Image resources
└── local-db/ # Local file database
└── projects/ # Project data storage
- View the demo video of this project: Easy Dataset Demo Video
- For detailed documentation on all features and APIs, visit our Documentation Site
Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
- Fork the repository
- Create a new branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request (submit to the DEV branch)
Please ensure that tests are appropriately updated and adhere to the existing coding style.
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.