Skip to content

Commit 22e666f

Browse files
committed
add support for multiple column documents & add read docs
1 parent dfd5b5b commit 22e666f

26 files changed

+701
-144
lines changed

README.md

+116-12
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ from depdf import DePDF
1717
from depdf import DePage
1818

1919
# general
20-
with DePDF.load('test/test_general.pdf') as pdf
20+
with DePDF.load('test/test.pdf') as pdf
2121
pdf_html = pdf.to_html
2222
print(pdf_html)
2323

@@ -27,7 +27,7 @@ c = Config(
2727
verbose_flag=True,
2828
add_line_flag=True
2929
)
30-
pdf = DePDF.load('test/test_general.pdf', config=c)
30+
pdf = DePDF.load('test/test.pdf', config=c)
3131
page_index = 23 # start from zero
3232
page = pdf_file.pages[page_index]
3333
page_soup = page.soup
@@ -62,12 +62,12 @@ print(page_soup.text)
6262
| `bbox` | bounding box region |
6363
| `save_html` | write html tag to local file|
6464

65-
## DePDf HTML structure
65+
## DePDF HTML structure
6666
```html
6767
<div class="{pdf_class}">
6868
%for <!--page-{pid}-->
69-
<div id="page-{}" class="{}">
70-
%for {html_elements} endfor%
69+
<div id="page-{pid}" class="{page_class}">
70+
%for {in_page_elements} endfor%
7171
</div>
7272
endfor%
7373
</div>
@@ -78,7 +78,7 @@ print(page_soup.text)
7878
### Paragraph
7979
```html
8080
<p>
81-
{paragraph-content}
81+
{text-content}
8282
<span> {span-content} </span>
8383
...
8484
</p>
@@ -93,7 +93,7 @@ print(page_soup.text)
9393
...
9494
</tr>
9595
<tr colspan=2>
96-
<td> {cell_1_0} </td>
96+
<td> {merged_cell_1_0} </td>
9797
...
9898
</tr>
9999
...
@@ -104,15 +104,119 @@ print(page_soup.text)
104104
```
105105
<img src="temp_depdf/$prefix.png"></img>
106106
```
107+
108+
# Configuration encyclopedia
109+
110+
## PDF 解析
111+
112+
| **keyword** | detail | default |
113+
|:---|---|---|
114+
| logo_flag | 是否分析不同页面共有的水印信息 | `True` |
115+
| header_footer_flag | 是否分析不同页面共有的页眉页脚信息 | `True` |
116+
| temp_dir_prefix | 是否分析不同页面共有的页眉页脚信息 | temp_depdf |
117+
| unique_prefix | 生成临时文件图片的文件名称(一般会自动生成) | |
118+
119+
## 页面解析
120+
121+
| **keyword** | detail | default |
122+
|:---|---|---|
123+
| table_flag | 是否解析表格 | `True` |
124+
| paragraph_flag | 是否解析段落 | `True` |
125+
| image_flag | 是否解析图片 | `True` |
126+
| resolution | debug 模式下生成页面预览图的分辨率 | 300 |
127+
| main_frame_tolerance | 识别页面内主要文字区域的阈值 | |
128+
| x_tolerance | 识别页面内文本行的横向阈值 | |
129+
| y_tolerance | 识别页面内文本行的纵向阈值 | |
130+
| page_num_top_fraction | 识别页面内页码信息上边界距离和页面的高度比例 | |
131+
| page_num_left_fraction | 识别页面内页码信息 | |
132+
| page_num_right_fraction | 识别页面内页码信息 | |
133+
134+
## 页面分栏识别
135+
136+
| **keyword** | detail | default |
137+
|:---|---|---|
138+
| multiple_columns_flag | 是否识别多栏页面 | `True` |
139+
| max_columns | 识别多栏页面栏数上限 | 3 |
140+
| column_region_half_width | 识别多栏页面栏分界宽度 | |
141+
| min_column_region_objects | 识别多栏页面栏分界内的对象数目上限 | |
142+
143+
## 字符提取
144+
145+
| **keyword** | detail | default |
146+
|:---|---|---|
147+
| char_overlap_size | 判断字符是否重叠的阈值 | |
148+
| default_char_size | 默认的字符大小 | |
149+
| char_size_upper | 探测到字符大小的上限 | |
150+
| char_size_lower | 探测到字符大小的下限 | |
151+
152+
## 表格提取
153+
154+
| **keyword** | detail | default |
155+
|:---|---|---|
156+
| dotted_line_flag | 是否分析页面内的虚线 | |
157+
| curved_line_flag | 是否分析页面内的曲线 | |
158+
| snap_flag | 是否合并表格线段| |
159+
| add_line_flag | 是否为表格增加横竖线 | |
160+
| min_double_line_tolerance | 判断线段是否为临近双线的距离下限 | |
161+
| max_double_line_tolerance | 判断线段是否为临近双线的距离上限 | |
162+
| vertical_double_line_tolerance | 判断线段是否为垂直临近双线的距离上限 | |
163+
| table_cell_merge_tolerance | 合并单元格的宽度差别容错值 | |
164+
| skip_empty_table | 是否忽略空白表格 | |
165+
| add_vertical_lines_flag | 是否增加竖线 | |
166+
| add_horizontal_lines_flag | 是否增加横线 | |
167+
| add_horizontal_line_tolerance | 增加横线的阈值 | |
168+
169+
## 图片提取
170+
171+
| **keyword** | detail | default |
172+
|:---|---|---|
173+
| min_image_size | 识别图片的边长最小像素值 | 80 |
174+
| image_resolution | 提取图片的分辨率 | 300 |
175+
176+
## 页眉页脚识别
177+
178+
| **keyword** | detail | default |
179+
|:---|---|---|
180+
| default_head_tail_page_offset_percent | 页眉页脚的错位比例 | |
181+
182+
## 日志输出
183+
184+
| **keyword** | detail | default |
185+
|:---|---|---|
186+
| log_level | 日志的级别 | `WARNING` |
187+
| verbose_flag | 是否输出运行中间过程信息 | `False` |
188+
| debug_flag | 是否打开调试(生成解析对象的边界信息)| `False` |
189+
190+
## 生成的网页标签
191+
192+
| **keyword** | detail | default |
193+
|:---|---|---|
194+
| span_class | 生成 HTML 的 span 节点的 class | pdf-span |
195+
| paragraph_class | 生成 HTML 的 p 节点的 class | pdf-paragraph |
196+
| table_class | 生成 HTML 的 table 节点的 class | pdf-table |
197+
| pdf_class | 生成 HTML 的最外层 pdf div 节点的 class | pdf-content |
198+
| image_class | 生成 HTML 的 img 节点的 class | pdf-image |
199+
| page_class | 生成 HTML 的 page div 的 class | pdf-page |
200+
| mini_page_class | 生成 HTML 的 mini-page div 的 class | pdf-mini-page |
201+
202+
203+
# Update log
204+
205+
* `2020-03-18` add support for multiple-column PDFs
206+
* `2020-03-12` initial depdf realease
207+
208+
107209
# Appendix
108210

211+
## todo
212+
213+
* [x] add support for multiple-column pdf page
214+
* [x] better table structure recognition
215+
* [x] recognize embedded objects inside page elements
216+
217+
109218
## DePage element denotations
110219
> Useful element properties within page
111220
112221
![page element](annotations.jpg)
113222

114-
## todo
115-
116-
* [ ] add support for multiple-column pdf page
117-
* [ ] better table structure recognition
118-
* [x] recognize embedded objects inside page elements

depdf/__init__.py

+9
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
"""
2+
depdf
3+
====================================
4+
An ultimate pdf file disintegration tool.
5+
DePDF is designed to extract tables and paragraphs
6+
into structured markup language [eg. html] from embedding pdf pages.
7+
You can also use it to convert page/pdf to html.
8+
"""
9+
110
from depdf.api import *
211
from depdf.config import Config
312
from depdf.pdf import DePDF

depdf/api.py

+16-16
Original file line numberDiff line numberDiff line change
@@ -31,46 +31,46 @@ def wrapper(pdf_file_path, *args, **kwargs):
3131

3232

3333
@api_load_pdf
34-
def convert_pdf_to_html(pdf_file, **kwargs):
34+
def convert_pdf_to_html(pdf, **kwargs):
3535
"""
36-
:param pdf_file: pdf file absolute path
36+
:param pdf: pdf file path
3737
:param kwargs: config keyword arguments
38-
:return:
38+
:return: pdf html string
3939
"""
40-
return pdf_file.html
40+
return pdf.html
4141

4242

4343
@api_load_pdf
44-
def convert_page_to_html(pdf_file, pid, **kwargs):
44+
def convert_page_to_html(pdf, pid, **kwargs):
4545
"""
46-
:param pdf_file: pdf file absolute path
46+
:param pdf: pdf file path
4747
:param pid: page number start from 1
4848
:param kwargs: config keyword arguments
49-
:return:
49+
:return: page html string
5050
"""
51-
page = pdf_file.pages[pid - 1]
51+
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
5252
return page.html
5353

5454

5555
@api_load_pdf
56-
def extract_page_tables(pdf_file, pid, **kwargs):
56+
def extract_page_tables(pdf, pid, **kwargs):
5757
"""
58-
:param pdf_file: pdf file absolute path
58+
:param pdf: pdf file path
5959
:param pid: page number start from 1
6060
:param kwargs: config keyword arguments
61-
:return:
61+
:return: page tables list
6262
"""
63-
page = pdf_file.pages[pid - 1]
63+
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
6464
return page.tables
6565

6666

6767
@api_load_pdf
68-
def extract_page_paragraphs(pdf_file, pid, **kwargs):
68+
def extract_page_paragraphs(pdf, pid, **kwargs):
6969
"""
70-
:param pdf_file: pdf file absolute path
70+
:param pdf: pdf file path
7171
:param pid: page number start from 1
7272
:param kwargs: config keyword arguments
73-
:return:
73+
:return: page paragraphs list
7474
"""
75-
page = pdf_file.pages[pid - 1]
75+
page = DePage(pdf.pdf.pages[pid - 1], pid=pid, same=pdf.same, logo=pdf.logo, config=pdf.config)
7676
return page.paragraphs

depdf/base.py

+12-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from decimal import Decimal
22

3-
from depdf.utils import convert_html_to_soup
43
from depdf.error import BoxValueError
4+
from depdf.utils import convert_html_to_soup, repr_str
55

66

77
class Box(object):
@@ -49,6 +49,9 @@ class Base(object):
4949
_cached_properties = ['_html']
5050
_html = ''
5151

52+
def __repr__(self):
53+
return '<depdf.Base: {}>'.format(repr_str(self.soup.text))
54+
5255
@property
5356
def html(self):
5457
return self._html
@@ -61,6 +64,9 @@ def html(self, html_value):
6164
def soup(self):
6265
return convert_html_to_soup(self._html)
6366

67+
def to_soup(self, parser):
68+
return convert_html_to_soup(self._html, parser=parser)
69+
6470
def write_to(self, file_name):
6571
with open(file_name, "w") as file:
6672
file.write(self.html)
@@ -69,7 +75,7 @@ def write_to(self, file_name):
6975
def to_dict(self):
7076
return {
7177
i: getattr(self, i, None) for i in dir(self)
72-
if not i.startswith('_') and i != 'to_dict'
78+
if not i.startswith('_') and i not in ['to_dict', 'refresh', 'reset', 'write_to', 'to_soup']
7379
}
7480

7581
def _get_cached_property(self, key, calculate_function, *args, **kwargs):
@@ -101,4 +107,8 @@ class InnerWrapper(Base):
101107

102108
@property
103109
def inner_objects(self):
110+
return self._inner_objects
111+
112+
@property
113+
def to_dict(self):
104114
return [obj.to_dict if hasattr(obj, 'to_dict') else obj for obj in self._inner_objects]

depdf/components/image.py

+10-4
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,20 @@ class Image(Base, Box):
99
object_type = 'image'
1010

1111
@check_config
12-
def __init__(self, bbox=None, src='', pid=1, img_idx=1, scan=False, config=None):
12+
def __init__(self, bbox=None, src='', percent=100, pid='1', img_idx=1, scan=False, config=None):
1313
self.bbox = bbox
1414
self.scan = scan
15-
width = bbox[2] - bbox[0]
15+
self.src = src
16+
self.img_idx = img_idx
17+
self.pid = pid
1618
img_id = 'page-{pid}-image-{img_idx}'.format(pid=pid, img_idx=img_idx)
1719
img_class = '{img_class} page-{pid}'.format(img_class=getattr(config, 'image_class'), pid=pid)
18-
html = '<img id="{img_id}" class="{img_class}" src="{src}" width="{width}">'.format(
19-
img_id=img_id, img_class=img_class, src=src, width=width
20+
html = '<img id="{img_id}" class="{img_class}" src="{src}" width="{percent}%">'.format(
21+
img_id=img_id, img_class=img_class, src=src, percent=min(round(percent), 100)
2022
)
2123
html += '</img>'
2224
self.html = html
25+
26+
def __repr__(self):
27+
scan_flag = '[scan]' if self.scan else ''
28+
return '<depdf.Image{}: ({}, {}) -> {}>'.format(scan_flag, self.pid, self.img_idx, self.src)

depdf/components/paragraph.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from depdf.base import Box, InnerWrapper
22
from depdf.config import check_config
33
from depdf.log import logger_init
4-
from depdf.utils import calc_bbox, construct_style
4+
from depdf.utils import calc_bbox, construct_style, repr_str
55

66
log = logger_init(__name__)
77

@@ -10,7 +10,7 @@ class Paragraph(InnerWrapper, Box):
1010
object_type = 'paragraph'
1111

1212
@check_config
13-
def __init__(self, bbox=None, text='', pid=1, para_idx=1, config=None, inner_objects=None, style=None, align=None):
13+
def __init__(self, bbox=None, text='', pid='1', para_idx=1, config=None, inner_objects=None, style=None, align=None):
1414
para_id = 'page-{pid}-paragraph-{para_id}'.format(pid=pid, para_id=para_idx)
1515
para_class = '{para_class} page-{pid}'.format(para_class=getattr(config, 'paragraph_class'), pid=pid)
1616
style_text = construct_style(style=style)
@@ -35,7 +35,9 @@ def __init__(self, bbox=None, text='', pid=1, para_idx=1, config=None, inner_obj
3535
self.html = html
3636

3737
def __repr__(self):
38-
return '<depdf.Paragraph: ({}, {})>'.format(self.pid, self.para_id)
38+
if hasattr(self, 'text'):
39+
return '<depdf.Paragraph: ({}, {}) {}>'.format(self.pid, self.para_id, repr_str(self.text))
40+
return '<depdf.Paragraph[InnerObjects]: ({}, {})>'.format(self.pid, self.para_id)
3941

4042
def save_html(self):
4143
paragraph_file_name = '{}_page_{}_paragraph_{}.html'.format(self.config.unique_prefix, self.pid, self.para_id)

depdf/components/span.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from depdf.base import Base, Box
22
from depdf.config import check_config
33
from depdf.log import logger_init
4-
from depdf.utils import construct_style
4+
from depdf.utils import construct_style, repr_str
55

66
log = logger_init(__name__)
77

@@ -18,3 +18,6 @@ def __init__(self, bbox=None, span_text='', config=None, style=None):
1818
self.html = '<span class="{span_class}"{style_text}>{span_text}</span>'.format(
1919
span_class=span_class, span_text=span_text, style_text=style_text
2020
)
21+
22+
def __repr__(self):
23+
return '<depdf.Span: {}>'.format(repr_str(self.text))

0 commit comments

Comments
 (0)