Skip to content

ConardLi/easy-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Repo stars GitHub Downloads (all assets, all releases) GitHub Release AGPL 3.0 License GitHub contributors GitHub last commit

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

FeaturesQuick StartDocumentationContributingLicense

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

  • Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
  • Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
  • Intelligent Question Generation: Extracts relevant questions from each text segment
  • Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
  • Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
  • Flexible Editing: Edit questions, answers, and datasets at any stage of the process
  • Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
  • Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
  • User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
  • Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

ed3.mp4

Local Run

Download Client

Windows MacOS Linux

Setup.exe

Intel

M

AppImage

Install with NPM

  1. Clone the repository:
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset
  1. Install dependencies:
   npm install
  1. Start the development server:
   npm run build

   npm run start
  1. Open your browser and visit http://localhost:1717

Build with Local Dockerfile

If you want to build the image yourself, you can use the Dockerfile in the project root:

  1. Clone the repository:

    git clone https://github.com/ConardLi/easy-dataset.git
    cd easy-dataset
  2. Build the Docker image:

    docker build -t easy-dataset .
  3. Run the container:

    docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

    Note: Please replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.

  4. Open your browser and visit http://localhost:1717

How to Use

Create a Project

  1. Click the "Create Project" button on the homepage;
  2. Enter a project name and description;
  3. Configure your preferred LLM API settings

Process Documents

  1. Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
  2. View and adjust the automatically split text segments;
  3. View and adjust the global domain tree

Generate Questions

  1. Batch construct questions based on text blocks;
  2. View and edit the generated questions;
  3. Organize questions using the label tree

Create Datasets

  1. Batch construct datasets based on questions;
  2. Generate answers using the configured LLM;
  3. View, edit, and optimize the generated answers

Export Datasets

  1. Click the "Export" button in the Datasets section;
  2. Choose your preferred format (Alpaca or ShareGPT);
  3. Select the file format (JSON or JSONL);
  4. Add custom system prompts as needed;
  5. Export your dataset

Project Structure

easy-dataset/
├── app/                                # Next.js application directory
│   ├── api/                            # API routes
│   │   ├── llm/                        # LLM API integration
│   │   │   ├── ollama/                 # Ollama API integration
│   │   │   └── openai/                 # OpenAI API integration
│   │   ├── projects/                   # Project management API
│   │   │   ├── [projectId]/            # Project-specific operations
│   │   │   │   ├── chunks/             # Text chunk operations
│   │   │   │   ├── datasets/           # Dataset generation and management
│   │   │   │   ├── generate-questions/ # Batch question generation
│   │   │   │   ├── questions/          # Question management
│   │   │   │   └── split/              # Text splitting operations
│   │   │   └── user/                   # User-specific project operations
│   ├── projects/                       # Frontend project pages
│   │   └── [projectId]/                # Project-specific pages
│   │       ├── datasets/               # Dataset management UI
│   │       ├── questions/              # Question management UI
│   │       ├── settings/               # Project settings UI
│   │       └── text-split/             # Text processing UI
│   └── page.js                         # Homepage
├── components/                         # React components
│   ├── datasets/                       # Dataset-related components
│   ├── home/                           # Homepage components
│   ├── projects/                       # Project management components
│   ├── questions/                      # Question management components
│   └── text-split/                     # Text processing components
├── lib/                                # Core libraries and tools
│   ├── db/                             # Database operations
│   ├── i18n/                           # Internationalization
│   ├── llm/                            # LLM integration
│   │   ├── common/                     # Common LLM tools
│   │   ├── core/                       # Core LLM clients
│   │   └── prompts/                    # Prompt templates
│   │       ├── answer.js               # Answer generation prompts (Chinese)
│   │       ├── answerEn.js             # Answer generation prompts (English)
│   │       ├── question.js             # Question generation prompts (Chinese)
│   │       ├── questionEn.js           # Question generation prompts (English)
│   │       └── ... other prompts
│   └── text-splitter/                  # Text splitting tools
├── locales/                            # Internationalization resources
│   ├── en/                             # English translations
│   └── zh-CN/                          # Chinese translations
├── public/                             # Static resources
│   └── imgs/                           # Image resources
└── local-db/                           # Local file database
    └── projects/                       # Project data storage

Documentation

Community Practice

Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Star History

Star History Chart

Built with ❤️ by ConardLi • Follow me: WeChat Official AccountBilibiliJuejinZhihuYoutube