Merge table and title level policy #4770

SpongeBob0318 · 2025-11-27T05:15:17Z

No description provided.

paddle-bot · 2025-11-27T05:15:25Z

Thanks for your contribution!

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

SpongeBob0318 · 2025-12-08T08:38:44Z

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

+                    layout_parsing_result[key].append(value)
+
+        if merge_talble:
+            layout_parsing_result["parsing_res_list"] = merge_tables_across_pages(pages)


修改同一个表格ID

SpongeBob0318 · 2025-12-08T08:47:06Z

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

+        else:
+            layout_parsing_result["input_path"] = None
+
+        layout_parsing_result["page_index"] = None


width,height,count 置为none

SpongeBob0318 · 2025-12-08T09:07:25Z

paddlex/inference/pipelines/layout_parsing/result_v2.py

+    :return: Normalized chapter title string.
+    """
+    level = getattr(block, "title_level", 1)
+    title = getattr(block, "content", "").rstrip(".")


不要去掉.

SpongeBob0318 · 2025-12-08T09:08:44Z

paddlex/inference/pipelines/layout_parsing/result_v2.py

+    :param title: Original chapter title string.
+    :return: Normalized chapter title string.
+    """
+    level = getattr(block, "title_level", 1)


改成2，98行前面的# 去掉

SpongeBob0318 · 2025-12-08T09:27:34Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+
+        if D > 0:
+            bucket = "A"
+        else:


直观一些，不用ABC

A，B，C的if，else调整一下

SpongeBob0318 · 2025-12-08T09:32:21Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+                    B_level = lvl
+                    break
+
+        L_phys = phys_map.get(e["height"], 1)


phy_map[e["height]]

SpongeBob0318 · 2025-12-08T09:42:53Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+
+def assign_levels_to_parsing_res(parsing_res_list):
+    """
+    parsing_res_list 是一个 LayoutBlock 对象列表


SpongeBob0318 · 2025-12-08T09:54:45Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

+    return (tables_match or rows_match), soup_prev, soup_curr
+
+
+def perform_table_merge(soup_prev, soup_curr):


合并到一起加上注释

SpongeBob0318 · 2025-12-08T09:58:21Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

@@ -0,0 +1,193 @@
+import json


借鉴mineru（用大模型写，致谢，说明

变量和函数名改为直观的
注释写上
pre-commit
函数的划分

TingquanGao · 2025-12-09T07:25:40Z

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

+            title_level: Whether to assign title levels
+
+        Returns:
+            LayoutParsingResultV2: Combined parsing result


LayoutParsingResultV2

TingquanGao · 2025-12-09T07:27:35Z

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py

+            for key in [
+                "input_path",
+                "page_count",
+                "width",
+                "height",
+                "doc_preprocessor_res",
+                "layout_det_res",
+                "region_det_res",
+                "overall_ocr_res",
+                "table_res_list",
+                "seal_res_list",
+                "chart_res_list",
+                "formula_res_list",
+                "imgs_in_doc",
+                "model_settings",
+            ]:
+                value = single_img_res.get(key, [])


for res_key in single_img_res

TingquanGao · 2025-12-09T07:30:24Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

+    return "".join(result)
+
+
+# Calculate total columns including colspan and rowspan, accounting for merged cells


统一注释风格

TingquanGao · 2025-12-09T07:30:59Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

+    return max_cols
+
+
+# Calculate the actual number of columns in a single row


TingquanGao · 2025-12-09T07:31:02Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

+    return sum(int(cell.get("colspan", 1)) for cell in row.find_all(["td", "th"]))
+
+
+# Calculate the visual number of columns in a single row, excluding colspan (merged cells count as one)


TingquanGao · 2025-12-09T08:15:12Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+                if any(w in content_u for w in keywords):
+                    RelativeOrder_level = level
+                    break
+            bucket = "RelativeOrder"


这一行应该执行不到？

bucket = "RelativeOrder"

TingquanGao · 2025-12-09T08:16:28Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+        else:
+            bucket = "Cluster"
+
+        Cluster_level = cluster_map[e["height"]]


Cluster_level -> cluster_level

TingquanGao · 2025-12-09T08:22:24Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+        if block.label not in ("paragraph_title", "doc_title"):
+            continue
+
+        content = getattr(block, "content", "")


block.content

TingquanGao · 2025-12-09T08:23:07Z

paddlex/inference/pipelines/layout_parsing/title_level.py

+    if len(entries) == 0:
+        return parsing_res_list


边界逻辑的代码放在前面

TingquanGao · 2025-12-09T08:30:30Z

paddlex/inference/pipelines/layout_parsing/merge_table.py

+            nums += 1
+            merged_html = perform_table_merge(soup_prev, soup_curr)
+            prev_block.content = merged_html
+            curr_block.content = ""


…sing pipeline

paddle-bot bot added the contributor External developers label Nov 27, 2025

SpongeBob0318 commented Dec 8, 2025

View reviewed changes

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py Outdated Show resolved Hide resolved

SpongeBob0318 commented Dec 8, 2025

View reviewed changes

paddlex/inference/pipelines/layout_parsing/pipeline_v2.py Outdated Show resolved Hide resolved

SpongeBob0318 commented Dec 8, 2025

View reviewed changes

SpongeBob0318 force-pushed the merge_table_and_title_level_policy branch from f0fbd59 to 5596f01 Compare December 9, 2025 07:24

TingquanGao reviewed Dec 9, 2025

View reviewed changes

SpongeBob0318 force-pushed the merge_table_and_title_level_policy branch 3 times, most recently from b4d844c to d5683c0 Compare December 15, 2025 09:32

TingquanGao closed this Dec 18, 2025

TingquanGao reopened this Dec 18, 2025

SpongeBob0318 added 12 commits December 18, 2025 06:52

Merge remote-tracking branch 'upstream/develop' into develop

3752cff

Merge table policy and title level policy (需修改)

d3693eb

1

713f583

Move concatenate_page() from pp_doctranslation pipeline to layout_par…

1dca1ba

…sing pipeline

Translate Chinese annotations into English

e565a92

fix format

1d9a226

fix format V2

e923bcd

Add a method to retrieve multi-line headings

28af9fd

Add a method to merge continue blocks

6c55ac4

Add table-merge and title-level policy to pp-ocr-vl pipeline

2d2c5fc

allow two table merge that have 'continue' and number between them

61fe50f

make title_level comment from chinese into english

773a78d

SpongeBob0318 force-pushed the merge_table_and_title_level_policy branch from 6a8db90 to 773a78d Compare December 18, 2025 06:53

		return (tables_match or rows_match), soup_prev, soup_curr


		def perform_table_merge(soup_prev, soup_curr):

		return "".join(result)


		# Calculate total columns including colspan and rowspan, accounting for merged cells

		return max_cols


		# Calculate the actual number of columns in a single row

		return sum(int(cell.get("colspan", 1)) for cell in row.find_all(["td", "th"]))


		# Calculate the visual number of columns in a single row, excluding colspan (merged cells count as one)

Merge table and title level policy #4770

Are you sure you want to change the base?

Merge table and title level policy #4770

Uh oh!

Conversation

SpongeBob0318 commented Nov 27, 2025

Uh oh!

paddle-bot bot commented Nov 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

借鉴mineru（用大模型写，致谢，说明

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants