Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information
This repository is the official implementation of Group Position Embedding. Group Position Embedding (GPE) is a novel and efficient technique to enhance the layout understanding ca-pabilities of LLMs without architectural changes or additional pre-training. GPE achieves this by strategically grouping the attention heads and feeding each group with distinct positional embeddings, effectively encoding layout information relevant to document comprehension. For more details, please refer to our paper:
2025.02.06
We release the dataset.2025.02.05
We release the paper.
python ./inference.py --model_path MODEL_PATH --image_path IMAGE_PATH --question "YOUR_QUESTION"
Run eval.py and input the directory containing prediction files.
python --input_dir EVAL_PATH
Forms is an English tabular dataset sourced from Fetaqa.
Websites is a Chinese tabular dataset obtained by web crawlers.
Slides is a Chinese tabular dataset sourced from PowerPoint. Each cell may have an irregular representation, such as line breaks or complex symbols.
SynthTables is our synthetic tabular dataset, derived from entity names and values extracted from public entity extraction datasets. It is randomly rotated and discarded to enhance the difficulty of the task.
Newspapers is a Chinese dataset sourced from M6doc newspapers, which contains multiple columns of text information. The layout includes complex information such as horizontal, vertical, and pictures.
SynthDocs is our multi-column document dataset generated by SynthDoc, based on public MRC datasets. It disrupts the typography structure of the text, allowing semantically coherent text to be divided into multi-column document layouts.
If you find GPE useful in your research, please cite the following paper:
@inproceedings{Zhu2025GroupPosition,
title = {Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information},
author = {Zhu Yuke and Zhang Yue and Liu Dongdong and Xie Chi and Xiong Zihu and Zheng Bo and Guo Sheng},
journal = {Proceedings of the 2025 International Conference on Learning Representations},
year = {2025},
}
For help or issues using GPE, please submit a GitHub issue.
For other communications related to GPE, please contact Yuke Zhu ([email protected]
), Sheng Guo ([email protected]
).