Skip to content

A structure-aligned OBI dataset constructed to mitigate the long-tail problems in current OBI datasets

Notifications You must be signed in to change notification settings

OBI-Future/Oracle-P15K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark ☯️

The first attempt to apply diffusion model in realistic and controllable OBI generation

1School of Computer Science and Technology, East China Normal University
2Institute of Image Communication and Information Processing, Shanghai Jiao Tong University
3School of Humanities, Shanghai Jiao Tong University
*Both authors contributed equally to this research Corresponding authors


Overview of the proposed Oracle-P15K dataset. The dataset comprises 14,542 OBI images with structure-aligned expert-annotated glyphs. Based on this, we present a pseudo OBI image generator, namely OBIDiff, to alleviate the long-tail distribution problem in current OBI datasets. Extensive experiments demonstrate both the necessity of Oracle-P15K and the effectiveness of OBIDiff in improving the performance of downstream OBI tasks.

Release 🚀

  • [2025/4/13] ⚡️ Github repo for Oracle-P15K is online.

Motivations 💡

The existing OBI datasets suffer from a long-tail distribution problem. Consequently, OBI-related models achieve superior performance in majority classes while underperforming in minority classes. Therefore, we construct Oracle-P15K, a large-scale structure-aligned OBI dataset comprising 14,542 images infused with domain knowledge from OBI experts. The Oracle-P15K dataset can also serve as a comprehensive benchmark for researchers to develop and evaluate their methods for dealing with other OBI information processing tasks, such as OBI denoising, recognition, etc.

Construction Pipeline 🧩

Focusing on structure-aligned image pairs for OBI generation and denoising models.

Pseudo OBI Generator 🤖

Our OBIDiff consists of an autoencoder, a stable diffusion (SD) model, a glyph encoder, and a style encoder. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image.

Results on OBI Generation and Denoising Tasks 📌

Qualitative results on the OBI generation tasks (click to expand)
Quantitative results on the OBI generation tasks (click to expand)
  • Fitted kernel distribution of four low-level features including brightness, contrast, sharpness, and spatial information (SI):
  • Recognition accuracy of augmented images generated by the proposed OBIDiff and other OBI generation methods w.r.t. the scale of data augmentation:
Qualitative results on the OBI denoising tasks (click to expand)
Quantitative results on the OBI denoising tasks (click to expand)

User Preference Study 👥

We develop a web-based user interface with automated navigation to facilitate the evaluation process.

Contact ✉️

Please contact the first author of this paper for queries.

Citation 📎

If you find our work interesting, please feel free to cite our paper:

@misc{li2025mitigatinglongtaildistributionoracle,
      title={Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark}, 
      author={Jinhao Li and Zijian Chen and Runze Dong and Tingzhu Chen and Changbo Wang and Guangtao Zhai},
      year={2025},
      eprint={2504.09555},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.09555}, 
}

Acknowledgements 🏆

This work was supported by the National Social Science Foundation of China (24Z300404220) and the Shanghai Philosophy and Social Science Planning Project (2023BYY003).

About

A structure-aligned OBI dataset constructed to mitigate the long-tail problems in current OBI datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •