-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Merge table and title level policy #4770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Merge table and title level policy #4770
Conversation
|
Thanks for your contribution! |
| layout_parsing_result[key].append(value) | ||
|
|
||
| if merge_talble: | ||
| layout_parsing_result["parsing_res_list"] = merge_tables_across_pages(pages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改同一个表格ID
| else: | ||
| layout_parsing_result["input_path"] = None | ||
|
|
||
| layout_parsing_result["page_index"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
width,height,count 置为none
| :return: Normalized chapter title string. | ||
| """ | ||
| level = getattr(block, "title_level", 1) | ||
| title = getattr(block, "content", "").rstrip(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要去掉.
| :param title: Original chapter title string. | ||
| :return: Normalized chapter title string. | ||
| """ | ||
| level = getattr(block, "title_level", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成2,98行前面的# 去掉
|
|
||
| if D > 0: | ||
| bucket = "A" | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直观一些,不用ABC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A,B,C的if,else调整一下
| B_level = lvl | ||
| break | ||
|
|
||
| L_phys = phys_map.get(e["height"], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phy_map[e["height]]
|
|
||
| def assign_levels_to_parsing_res(parsing_res_list): | ||
| """ | ||
| parsing_res_list 是一个 LayoutBlock 对象列表 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文
| return (tables_match or rows_match), soup_prev, soup_curr | ||
|
|
||
|
|
||
| def perform_table_merge(soup_prev, soup_curr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
合并到一起 加上注释
| @@ -0,0 +1,193 @@ | |||
| import json | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
借鉴mineru(用大模型写,致谢,说明
变量和函数名改为直观的
注释写上
pre-commit
函数的划分
f0fbd59 to
5596f01
Compare
| title_level: Whether to assign title levels | ||
| Returns: | ||
| LayoutParsingResultV2: Combined parsing result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LayoutParsingResultV2
| for key in [ | ||
| "input_path", | ||
| "page_count", | ||
| "width", | ||
| "height", | ||
| "doc_preprocessor_res", | ||
| "layout_det_res", | ||
| "region_det_res", | ||
| "overall_ocr_res", | ||
| "table_res_list", | ||
| "seal_res_list", | ||
| "chart_res_list", | ||
| "formula_res_list", | ||
| "imgs_in_doc", | ||
| "model_settings", | ||
| ]: | ||
| value = single_img_res.get(key, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for res_key in single_img_res
| return "".join(result) | ||
|
|
||
|
|
||
| # Calculate total columns including colspan and rowspan, accounting for merged cells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
统一注释风格
| return max_cols | ||
|
|
||
|
|
||
| # Calculate the actual number of columns in a single row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| return sum(int(cell.get("colspan", 1)) for cell in row.find_all(["td", "th"])) | ||
|
|
||
|
|
||
| # Calculate the visual number of columns in a single row, excluding colspan (merged cells count as one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| if any(w in content_u for w in keywords): | ||
| RelativeOrder_level = level | ||
| break | ||
| bucket = "RelativeOrder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一行应该执行不到?
bucket = "RelativeOrder"
| else: | ||
| bucket = "Cluster" | ||
|
|
||
| Cluster_level = cluster_map[e["height"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster_level -> cluster_level
| if block.label not in ("paragraph_title", "doc_title"): | ||
| continue | ||
|
|
||
| content = getattr(block, "content", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block.content
| if len(entries) == 0: | ||
| return parsing_res_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
边界逻辑的代码放在前面
| nums += 1 | ||
| merged_html = perform_table_merge(soup_prev, soup_curr) | ||
| prev_block.content = merged_html | ||
| curr_block.content = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_id
b4d844c to
d5683c0
Compare
6a8db90 to
773a78d
Compare
No description provided.