Skip to content

Commit d56155c

Browse files
authored
[Feature] Support lmdb format in Dataset Preparer (#1762)
* [Dataset Preparer] Support lmdb format * fix * fix * fix * fix * fix * readme * readme
1 parent 33cbc9b commit d56155c

File tree

9 files changed

+347
-26
lines changed

9 files changed

+347
-26
lines changed

dataset_zoo/icdar2015/textrecog.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,9 @@
6161
parser=dict(type='ICDARTxtTextRecogAnnParser', encoding='utf-8-sig'),
6262
packer=dict(type='TextRecogPacker'),
6363
dumper=dict(type='JsonDumper'))
64-
delete = ['annotations']
64+
delete = [
65+
'annotations', 'ic15_textrecog_train_img_gt', 'ic15_textrecog_test_img'
66+
]
6567
config_generator = dict(
6668
type='TextRecogConfigGenerator',
6769
test_anns=[

docs/en/user_guides/data_prepare/dataset_preparer.md

+52-6
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,15 @@ Only one line of command is needed to complete the data download, decompression,
1515
python tools/dataset_converters/prepare_dataset.py [$DATASET_NAME] --task [$TASK] --nproc [$NPROC]
1616
```
1717

18-
| ARGS | Type | Description |
19-
| ------------ | ---- | ----------------------------------------------------------------------------------------------------------------------------------------- |
20-
| dataset_name | str | (required) dataset name. |
21-
| --task | str | Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'. |
22-
| --nproc | int | Number of processes to be used. Defaults to 4. |
18+
| ARGS | Type | Description |
19+
| ------------------ | ---- | ----------------------------------------------------------------------------------------------------------------------------------------- |
20+
| dataset_name | str | (required) dataset name. |
21+
| --nproc | int | Number of processes to be used. Defaults to 4. |
22+
| --task | str | Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'. |
23+
| --splits | str | Splits of the dataset to be prepared. Multiple splits can be accepted. Defaults to `train val test`. |
24+
| --lmdb | str | Store the data in LMDB format. Only valid when the task is `textrecog`. |
25+
| --overwrite-cfg | str | Whether to overwrite the dataset config file if it already exists in `configs/{task}/_base_/datasets`. |
26+
| --dataset-zoo-path | str | Path to the dataset config file. If not specified, the default path is `./dataset_zoo`. |
2327

2428
For example, the following command shows how to use the script to prepare the ICDAR2015 dataset for text detection task.
2529

@@ -37,6 +41,44 @@ To check the supported datasets of Dataset Preparer, please refer to [Dataset Zo
3741

3842
## Advanced Usage
3943

44+
### LMDB Format
45+
46+
In text recognition tasks, we usually use LMDB format to store data to speed up data loading. When using the `prepare_dataset.py` script to prepare data, you can store data to the LMDB format by the `--lmdb` parameter. For example:
47+
48+
```bash
49+
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb
50+
```
51+
52+
As soon as the dataset is prepared, Dataset Preparer will generate `icdar2015_lmdb.py` in the `configs/textrecog/_base_/datasets/` directory. You can inherit this file and point the `dataloader` to the LMDB dataset. Moreover, the LMDB dataset needs to be loaded by [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray), thus you also need to modify `pipeline`.
53+
54+
For example, if we want to change the training set of `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py` to icdar2015 generated before, we need to perform the following modifications:
55+
56+
1. Modify `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py`:
57+
58+
```python
59+
_base_ = [
60+
'../_base_/datasets/icdar2015_lmdb.py', # point to icdar2015 lmdb dataset
61+
...
62+
]
63+
64+
train_list = [_base_.icdar2015_lmdb_textrecog_train]
65+
...
66+
```
67+
68+
2. Modify `train_pipeline` in `configs/textrecog/crnn/_base_crnn_mini-vgg.py`, change `LoadImageFromFile` to `LoadImageFromNDArray`:
69+
70+
```python
71+
train_pipeline = [
72+
dict(
73+
type='LoadImageFromNDArray',
74+
color_type='grayscale',
75+
file_client_args=file_client_args,
76+
ignore_empty=True,
77+
min_size=2),
78+
...
79+
]
80+
```
81+
4082
### Configuration of Dataset Preparer
4183

4284
Dataset preparer uses a modular design to enhance extensibility, which allows users to extend it to other public or private datasets easily. The configuration files of the dataset preparers are stored in the `dataset_zoo/`, where all the configs of currently supported datasets can be found here. The directory structure is as follows:
@@ -95,6 +137,10 @@ Data:
95137
96138
It is not mandatory to use the metafile in the dataset preparation process (so users can ignore this file when preparing private datasets), but in order to better understand the information of each public dataset, we recommend that users read the metafile before preparing the dataset, which will help to understand whether the datasets meet their needs.
97139
140+
```{warning}
141+
The following section is outdated as of MMOCR 1.0.0rc6.
142+
```
143+
98144
#### Config of Dataset Preparer
99145

100146
Next, we will introduce the conventional fields and usage of the dataset preparer configuration files.
@@ -186,7 +232,7 @@ Therefore, we provide two built-in gatherers, `pair_gather` and `mono_gather`, t
186232

187233
When the image and annotation file are matched, the original annotations will be parsed. Since the annotation format is usually varied from dataset to dataset, the parsers are usually dataset related. Then, the parser will pack the required data into the MMOCR format.
188234

189-
Finally, we can specify the dumpers to decide the data format. Currently, we only support `JsonDumper` and `WildreceiptOpensetDumper`, where the former is used to save the data in the standard MMOCR Json format, and the latter is used to save the data in the Wildreceipt format. In the future, we plan to support `LMDBDumper` to save the annotation files in LMDB format.
235+
Finally, we can specify the dumpers to decide the data format. Currently, we support `JsonDumper`, `WildreceiptOpensetDumper`, and `TextRecogLMDBDumper`. They are used to save the data in the standard MMOCR Json format, Wildreceipt format, and the LMDB format commonly used in academia in the field of text recognition, respectively.
190236

191237
### Use DataPreparer to prepare customized dataset
192238

docs/zh_cn/user_guides/data_prepare/dataset_preparer.md

+54-10
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# 数据准备 (Beta)
1+
c# 数据准备 (Beta)
22

33
```{note}
44
Dataset Preparer 目前仍处在公测阶段,欢迎尝鲜试用!如遇到任何问题,请及时向我们反馈。
@@ -11,16 +11,18 @@ MMOCR 提供了统一的一站式数据集准备脚本 `prepare_dataset.py`。
1111
仅需一行命令即可完成数据的下载、解压、格式转换,及基础配置的生成。
1212

1313
```bash
14-
python tools/dataset_converters/prepare_dataset.py [$DATASET_NAME] [--task $TASK] [--nproc $NPROC] [--overwrite-cfg] [--dataset-zoo-path $DATASET_ZOO_PATH]
14+
python tools/dataset_converters/prepare_dataset.py [-h] [--nproc NPROC] [--task {textdet,textrecog,textspotting,kie}] [--splits SPLITS [SPLITS ...]] [--lmdb] [--overwrite-cfg] [--dataset-zoo-path DATASET_ZOO_PATH] datasets [datasets ...]
1515
```
1616
17-
| 参数 | 类型 | 说明 |
18-
| ------------------ | ---- | ----------------------------------------------------------------------------------------------------- |
19-
| dataset_name | str | (必须)需要准备的数据集名称。 |
20-
| --task | str | 将数据集格式转换为指定任务的 MMOCR 格式。可选项为: 'textdet', 'textrecog', 'textspotting' 和 'kie'。 |
21-
| --nproc | str | 使用的进程数,默认为 4。 |
22-
| --overwrite-cfg | str | 若数据集的基础配置已经在 `configs/{task}/_base_/datasets` 中存在,依然重写该配置 |
23-
| --dataset-zoo-path | str | 存放数据库配置文件的路径。若不指定,则默认为 `./dataset_zoo` |
17+
| 参数 | 类型 | 说明 |
18+
| ------------------ | -------------------------- | ----------------------------------------------------------------------------------------------------- |
19+
| dataset_name | str | (必须)需要准备的数据集名称。 |
20+
| --nproc | str | 使用的进程数,默认为 4。 |
21+
| --task | str | 将数据集格式转换为指定任务的 MMOCR 格式。可选项为: 'textdet', 'textrecog', 'textspotting''kie'|
22+
| --splits | \['train', 'val', 'test'\] | 希望准备的数据集分割,可以接受多个参数。默认为 `train val test`|
23+
| --lmdb | str | 把数据储存为 LMDB 格式,仅当任务为 `textrecog` 时生效。 |
24+
| --overwrite-cfg | str | 若数据集的基础配置已经在 `configs/{task}/_base_/datasets` 中存在,依然重写该配置 |
25+
| --dataset-zoo-path | str | 存放数据库配置文件的路径。若不指定,则默认为 `./dataset_zoo` |
2426
2527
例如,以下命令展示了如何使用该脚本为 ICDAR2015 数据集准备文本检测任务所需的数据。
2628
@@ -38,6 +40,44 @@ python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task te
3840
3941
## 进阶用法
4042
43+
### LMDB 格式
44+
45+
在文本识别任务中,我们通常使用 LMDB 格式来存储数据,以加快数据的读取速度。在使用 `prepare_dataset.py` 脚本准备数据时,可以通过 `--lmdb` 参数来指定将数据转换为 LMDB 格式。例如:
46+
47+
```bash
48+
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb
49+
```
50+
51+
数据集准备完成后,Dataset Preparer 会在 `configs/textrecog/_base_/datasets/` 中生成 `icdar2015_lmdb.py` 配置。你可以继承该配置,并将 `dataloader` 指向 LMDB 数据集。然而,LMDB 数据集的读取需要配合 [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray),因此你也同样需要修改 `pipeline`
52+
53+
例如,我们想要将 `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py` 的训练集改为刚刚生成的 icdar2015,则需要作如下修改:
54+
55+
1. 修改 `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py`:
56+
57+
```python
58+
_base_ = [
59+
'../_base_/datasets/icdar2015_lmdb.py', # 指向 icdar2015 lmdb 数据集
60+
... # 省略
61+
]
62+
63+
train_list = [_base_.icdar2015_lmdb_textrecog_train]
64+
...
65+
```
66+
67+
2. 修改 `configs/textrecog/crnn/_base_crnn_mini-vgg.py` 中的 `train_pipeline`, 将 `LoadImageFromFile` 改为 `LoadImageFromNDArray`
68+
69+
```python
70+
train_pipeline = [
71+
dict(
72+
type='LoadImageFromNDArray',
73+
color_type='grayscale',
74+
file_client_args=file_client_args,
75+
ignore_empty=True,
76+
min_size=2),
77+
...
78+
]
79+
```
80+
4181
### 数据集配置
4282
4383
数据集自动化准备脚本使用了模块化的设计,极大地增强了扩展性,用户能够很方便地配置其他公开数据集或私有数据集。数据集自动化准备脚本的配置文件被统一存储在 `dataset_zoo/` 目录下,用户可以在该目录下找到所有已由 MMOCR 官方支持的数据集准备脚本配置文件。该文件夹的目录结构如下:
@@ -96,6 +136,10 @@ Data:
96136
97137
该文件在数据集准备过程中并不是强制要求的(因此用户在使用添加自己的私有数据集时可以忽略该文件),但为了用户更好地了解各个公开数据集的信息,我们建议用户在使用数据集准备脚本前阅读对应的元文件信息,以了解该数据集的特征是否符合用户需求。
98138
139+
```{warning}
140+
自 MMOCR 1.0.0rc6 起,接下来的章节可能会与实际实现有所出入。
141+
```
142+
99143
#### 数据集准备脚本配置文件
100144
101145
下面,我们将介绍数据集准备脚本配置文件 `textXXX.py` 的默认字段与使用方法。
@@ -235,7 +279,7 @@ OCR 数据集通常有两种标注保存形式,一种为多个标注文件对
235279
236280
###### `dumper`
237281
238-
之后,我们可以通过指定不同的 dumper 来决定要将数据保存为何种格式。目前,我们仅支持 `JsonDumper``WildreceiptOpensetDumper`其中,前者用于将数据保存为标准的 MMOCR Json 格式,而后者用于将数据保存为 Wildreceipt 格式。未来,我们计划支持 `LMDBDumper` 用于保存 LMDB 格式的标注文件
282+
之后,我们可以通过指定不同的 dumper 来决定要将数据保存为何种格式。目前,我们支持 `JsonDumper``WildreceiptOpensetDumper``TextRecogLMDBDumper`。他们分别用于将数据保存为标准的 MMOCR Json 格式Wildreceipt 格式,及文本识别领域学术界常用的 LMDB 格式
239283
240284
###### `delete`
241285

mmocr/datasets/preparers/config_generators/base.py

+11
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,17 @@ def _prepare_anns(self, train_anns: Optional[List[Dict]],
8181
' None!')
8282
for ann_dict in ann_list:
8383
assert 'ann_file' in ann_dict
84+
suffix = ann_dict['ann_file'].split('.')[-1]
85+
if suffix == 'json':
86+
dataset_type = 'OCRDataset'
87+
elif suffix == 'lmdb':
88+
assert self.task == 'textrecog', \
89+
'LMDB format only works for textrecog now.'
90+
dataset_type = 'RecogLMDBDataset'
91+
else:
92+
raise NotImplementedError(
93+
'ann file only supports JSON file or LMDB file')
94+
ann_dict['dataset_type'] = dataset_type
8495
if ann_dict.get('dataset_postfix', ''):
8596
key = f'{self.dataset_name}_{ann_dict["dataset_postfix"]}_{self.task}_{split}' # noqa
8697
else:

mmocr/datasets/preparers/config_generators/textrecog_config_generator.py

+25-6
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,38 @@ class TextRecogConfigGenerator(BaseDatasetConfigGenerator):
3636
3737
Example:
3838
It generates a dataset config like:
39-
>>> ic15_rec_data_root = 'data/icdar2015/'
39+
>>> icdar2015_textrecog_data_root = 'data/icdar2015/'
4040
>>> icdar2015_textrecog_train = dict(
4141
>>> type='OCRDataset',
42-
>>> data_root=ic15_rec_data_root,
42+
>>> data_root=icdar2015_textrecog_data_root,
4343
>>> ann_file='textrecog_train.json',
44-
>>> test_mode=False,
4544
>>> pipeline=None)
4645
>>> icdar2015_textrecog_test = dict(
4746
>>> type='OCRDataset',
48-
>>> data_root=ic15_rec_data_root,
47+
>>> data_root=icdar2015_textrecog_data_root,
4948
>>> ann_file='textrecog_test.json',
5049
>>> test_mode=True,
5150
>>> pipeline=None)
51+
52+
It generates a lmdb format dataset config like:
53+
>>> icdar2015_lmdb_textrecog_data_root = 'data/icdar2015'
54+
>>> icdar2015_lmdb_textrecog_train = dict(
55+
>>> type='RecogLMDBDataset',
56+
>>> data_root=icdar2015_lmdb_textrecog_data_root,
57+
>>> ann_file='textrecog_train.lmdb',
58+
>>> pipeline=None)
59+
>>> icdar2015_lmdb_textrecog_test = dict(
60+
>>> type='RecogLMDBDataset',
61+
>>> data_root=icdar2015_lmdb_textrecog_data_root,
62+
>>> ann_file='textrecog_test.lmdb',
63+
>>> test_mode=True,
64+
>>> pipeline=None)
65+
>>> icdar2015_lmdb_1811_textrecog_test = dict(
66+
>>> type='RecogLMDBDataset',
67+
>>> data_root=icdar2015_lmdb_textrecog_data_root,
68+
>>> ann_file='textrecog_test_1811.lmdb',
69+
>>> test_mode=True,
70+
>>> pipeline=None)
5271
"""
5372

5473
def __init__(
@@ -100,8 +119,8 @@ def _gen_dataset_config(self) -> str:
100119
cfg = ''
101120
for key_name, ann_dict in self.anns.items():
102121
cfg += f'\n{key_name} = dict(\n'
103-
cfg += ' type=\'OCRDataset\',\n'
104-
cfg += ' data_root=' + f'{self.dataset_name}_{self.task}_data_root,\n' # noqa: E501
122+
cfg += f' type=\'{ann_dict["dataset_type"]}\',\n'
123+
cfg += f' data_root={self.dataset_name}_{self.task}_data_root,\n' # noqa: E501
105124
cfg += f' ann_file=\'{ann_dict["ann_file"]}\',\n'
106125
if ann_dict['split'] in ['test', 'val']:
107126
cfg += ' test_mode=True,\n'
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Copyright (c) OpenMMLab. All rights reserved.
22
from .base import BaseDumper
33
from .json_dumper import JsonDumper
4+
from .lmdb_dumper import TextRecogLMDBDumper
45
from .wild_receipt_openset_dumper import WildreceiptOpensetDumper
56

6-
__all__ = ['BaseDumper', 'JsonDumper', 'WildreceiptOpensetDumper']
7+
__all__ = [
8+
'BaseDumper', 'JsonDumper', 'WildreceiptOpensetDumper',
9+
'TextRecogLMDBDumper'
10+
]

0 commit comments

Comments
 (0)