Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions examples/pre-training/tools/preprocess/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
English | [简体中文](./README_zh.md)

# Pretraining Data Conversion Tool

This tool converts plain-text datasets into indexed binary files suitable for model pretraining.

## Prerequisites: Model Weights

Download the released weights from our repository. For details, please refer to: [Introduction to ERNIE 4.5](/README.md).

Note: Tokenizers differ across models, so the converted dataset is model-dependent.

## Prepare the Text Dataset

Download or make your own dataset in `jsonl` format. Each line must be a JSON object containing a `"text"` field with the document content. For example:

```json
{"text": "An Open-Source Deep Learning Platform Originated from Industrial Practice..."}
{"text": "PaddlePaddle is dedicated to facilitating innovations and applications of deep learning..."}
...
```

## Generate the Pretraining Dataset

```bash
python -u create_pretraining_data.py \
--model_name "/path/to/your/ERNIE-4.5-21B-A3B-Base-Paddle" \
--data_format "JSON" \
--input_path "/path/to/your/text/dataset.jsonl" \
--append_eos \
--output_prefix "./pretrain_data" \
--workers 1 \
--log_interval 10000 \
--data_impl "mmap"
```

The output is saved as `./pretrain_data.bin` and `./pretrain_data.idx`.
37 changes: 37 additions & 0 deletions examples/pre-training/tools/preprocess/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[English](README.md) | 简体中文

# 预训练数据转换工具

本工具用于将文本格式的数据集转换为预训练使用的`bin/idx`格式数据集

## 下载预训练权重

下载已发布的预训练权重,请参考[Introduction to ERNIE 4.5](/README.md)。

注意:每个模型的Tokenizer是不同的,因此必须下载你所训练的模型才能生成训练数据。

## 准备文本数据

本工具的输入为`jsonl`格式,每一行是一份预训练文本,记录在`"text"`字段中。例如:

```json
{"text": "动静统一自动并行只需在单卡基础上进行少量的张量切分标记,飞桨能自动寻找最高效的分布式并行策略..."}
{"text": "同一套框架支持训练和推理,实现训练、推理代码复用和无缝衔接,为大模型的全流程提供了..."}
...
```

## 生成预训练数据

```bash
python -u create_pretraining_data.py \
--model_name "/path/to/your/ERNIE-4.5-21B-A3B-Base-Paddle" \
--data_format "JSON" \
--input_path "/path/to/your/text/dataset.jsonl" \
--append_eos \
--output_prefix "./pretrain_data" \
--workers 1 \
--log_interval 10000 \
--data_impl "mmap"
```

输出将保存为`./pretrain_data.bin`和`./pretrain_data.idx`。
Loading
Loading