页脚有可能被识别出正文内容（已本地处理问题 #2347

LovelyGkotta · 2025-01-22T09:16:04Z

LovelyGkotta
Jan 22, 2025

Description of the bug | 错误描述

[

在使用的过程中发现上述情况，
推测应该是layout模型错误的识别了footer类也就是category_id==2的类识别出了别的

解决方案，
在layout模型推理完一页之后进行后处理，
遍历预测结果 pred_res 中的每个item。
如果该item的 category_id 为 2，表示它需要与 abandon_bboxes 中的现有边界框合并。如果该item与某个现有边界框相近，则合并这两个边界框。如果没有找到相近的边界框，则将该边界框添加到 abandon_bboxes 中。
如果该item的 category_id 不是 2，且它与 abandon_bboxes 中的某个边界框相近，则将该item的 category_id 更新为 2。

效果如下

在docanalyze_by_custom_model.doc_analyze. custom_model(img)推理之后追加后处理即可

代码如下，如写的不足的地方欢迎随时指正

def footer_header_filter(pred_res, abandon_bboxes, threshold=4):
    def get_bbox(poly):
        """
        根据 poly 计算边界框。
        :param poly: List[float]，多边形的点 [x1, y1, x2, y2, ..., xn, yn]
        :return: Tuple[min_x, min_y, max_x, max_y]
        """
        xs = poly[::2]  # 偶数索引是 x 坐标
        ys = poly[1::2]  # 奇数索引是 y 坐标
        return int(min(xs)), int(min(ys)), int(max(xs)), int(max(ys))

    def is_close(bbox1, bbox2, threshold):
        """
        判断两个边界框是否相近。
        """
        # 计算中心点
        center_x1, center_y1 = (bbox1[0] + bbox1[2]) / 2, (bbox1[1] + bbox1[3]) / 2
        center_x2, center_y2 = (bbox2[0] + bbox2[2]) / 2, (bbox2[1] + bbox2[3]) / 2

        # 计算欧几里得距离
        distance = ((center_x1 - center_x2) ** 2 + (center_y1 - center_y2) ** 2) ** 0.5
        # 判断距离是否小于阈值
        return distance <= threshold

    def merge_two(bbox1, bbox2):
        """
        合并两个边界框。
        """
        return (
            min(bbox1[0], bbox2[0]),  # 左边界取最小值
            min(bbox1[1], bbox2[1]),  # 上边界取最小值
            max(bbox1[2], bbox2[2]),  # 右边界取最大值
            max(bbox1[3], bbox2[3])  # 下边界取最大值
        )

    for item in pred_res:
        current_bbox = get_bbox(item['poly'])
        found = False

        # 如果当前类别是 2，将其与 merged_bboxes 合并
        if item['category_id'] == 2:
            for i in range(len(abandon_bboxes)):
                if is_close(abandon_bboxes[i], current_bbox, threshold):
                    abandon_bboxes[i] = merge_two(abandon_bboxes[i], current_bbox)
                    found = True
                    break
            if not found:
                abandon_bboxes.append(current_bbox)

        # 如果当前类别不是 2，但与已有的 merged_bboxes 接近，则更新其类别
        else:
            for merged_bbox in abandon_bboxes:
                if is_close(merged_bbox, current_bbox, threshold):
                    item['category_id'] = 2
                    break
    return abandon_bboxes

How to reproduce the bug | 如何复现

运行就会可能出现

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

1.0.x

Device mode | 设备模式

cuda

LovelyGkotta · 2025-01-22T09:17:28Z

LovelyGkotta
Jan 22, 2025
Author

公司网络无法上传图片，我回家再po图片

0 replies

LovelyGkotta · 2025-01-23T02:03:00Z

LovelyGkotta
Jan 23, 2025
Author

补充图片

0 replies

myhloli · 2025-01-23T02:39:35Z

myhloli
Jan 23, 2025
Maintainer

测试的pdf能给一份吗？

0 replies

LovelyGkotta · 2025-01-23T02:41:48Z

LovelyGkotta
Jan 23, 2025
Author

这个不太行欸，保密的数据，要掉脑袋的,但可以看看公开的技术文档，应该有的

0 replies

myhloli · 2025-01-23T02:58:21Z

myhloli
Jan 23, 2025
Maintainer

可以只截取有问题的两页发到我的邮箱吗？

0 replies

rockeodear · 2025-01-24T07:45:15Z

rockeodear
Jan 24, 2025

mark一下，我本地也遇到过这种情况，用解析的时候传页眉页脚的规则，在结果中正则剔除

0 replies

neverlatetolearn0 · 2025-03-25T03:19:23Z

neverlatetolearn0
Mar 25, 2025

自定义的函数在哪里调用了呢

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

页脚有可能被识别出正文内容（已本地处理问题 #2347

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

页脚有可能被识别出正文内容（已本地处理问题 #2347

Uh oh!

Uh oh!

LovelyGkotta Jan 22, 2025

Description of the bug | 错误描述

How to reproduce the bug | 如何复现

Operating system | 操作系统

Python version | Python 版本

Software version | 软件版本 (magic-pdf --version)

Device mode | 设备模式

Replies: 7 comments

Uh oh!

LovelyGkotta Jan 22, 2025 Author

Uh oh!

LovelyGkotta Jan 23, 2025 Author

Uh oh!

myhloli Jan 23, 2025 Maintainer

Uh oh!

Uh oh!

LovelyGkotta Jan 23, 2025 Author

Uh oh!

myhloli Jan 23, 2025 Maintainer

Uh oh!

rockeodear Jan 24, 2025

Uh oh!

neverlatetolearn0 Mar 25, 2025

LovelyGkotta
Jan 22, 2025

LovelyGkotta
Jan 22, 2025
Author

LovelyGkotta
Jan 23, 2025
Author

myhloli
Jan 23, 2025
Maintainer

LovelyGkotta
Jan 23, 2025
Author

myhloli
Jan 23, 2025
Maintainer

rockeodear
Jan 24, 2025

neverlatetolearn0
Mar 25, 2025