You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An ultimate pdf file disintegration tool. Yet able to extract pages embedded with tables and paragraphs into structured markup language.
3
+
An ultimate pdf file disintegration tool. DePDF is designed to extract tables and paragraphs into structured markup language[eg. html] from embedding pdf pages. You can also use it to convert page/pdf to html.
4
4
5
-
Built on [`pdfplumber`](https://github.com/jsvine/pdfplumber)
5
+
Built on top of [`pdfplumber`](https://github.com/jsvine/pdfplumber)
6
+
7
+
# Table of Contents
8
+
[toc]
9
+
10
+
11
+
# Installation
12
+
`pip install depdf`
13
+
14
+
# Example
15
+
```python
16
+
from depdf import DePDF
17
+
from depdf import DePage
18
+
19
+
# general
20
+
with DePDF.load('test/test_general.pdf') as pdf
21
+
pdf_html = pdf.to_html
22
+
print(pdf_html)
23
+
24
+
# with dedicated configurations
25
+
c = Config(
26
+
debug_flag=True,
27
+
verbose_flag=True,
28
+
add_line_flag=True
29
+
)
30
+
pdf = DePDF.load('test/test_general.pdf', config=c)
31
+
page_index =23# start from zero
32
+
page = pdf_file.pages[page_index]
33
+
page_soup = page.soup
34
+
print(page_soup.text)
35
+
```
36
+
37
+
38
+
# APIs
39
+
|**functions**| usage |
40
+
|:---:|---|
41
+
|`extract_page_paragraphs`| extract paragraphs from specific page |
42
+
|`extract_page_tables`| extract tables from specific page |
43
+
|`convert_pdf_to_html`| convert the entire pdf to html |
44
+
|`convert_page_to_html`| convert specific page to html |
45
+
46
+
47
+
# In-Depth
48
+
49
+
## In-page elements
50
+
* Paragraph
51
+
+ Text
52
+
+ Span
53
+
* Table
54
+
+ Cell
55
+
* Image
56
+
57
+
## Common properties
58
+
|**property & method**| explanation |
59
+
|:---:|---|
60
+
|`html`| converted html string |
61
+
|`soup`| converted beautiful soup |
62
+
|`bbox`| bounding box region |
63
+
|`save_html`| write html tag to local file|
64
+
65
+
## DePDf HTML structure
66
+
```html
67
+
<divclass="{pdf_class}">
68
+
%for <!--page-{pid}-->
69
+
<divid="page-{}"class="{}">
70
+
%for {html_elements} endfor%
71
+
</div>
72
+
endfor%
73
+
</div>
74
+
```
75
+
76
+
## DePage HTML element structure
77
+
78
+
### Paragraph
79
+
```html
80
+
<p>
81
+
{paragraph-content}
82
+
<span> {span-content} </span>
83
+
...
84
+
</p>
85
+
```
86
+
87
+
### Table
88
+
```html
89
+
<table>
90
+
<tr>
91
+
<td> {cell_0_0} </td>
92
+
<td> {cell_0_1} </td>
93
+
...
94
+
</tr>
95
+
<trcolspan=2>
96
+
<td> {cell_1_0} </td>
97
+
...
98
+
</tr>
99
+
...
100
+
</table>
101
+
```
102
+
103
+
### Image
104
+
```
105
+
<img src="temp_depdf/$prefix.png"></img>
106
+
```
107
+
# Appendix
108
+
109
+
## DePage element denotations
110
+
> Useful element properties within page
111
+
112
+

113
+
114
+
## todo
115
+
116
+
*[ ] add support for multiple-column pdf page
117
+
*[ ] better table structure recognition
118
+
*[x] recognize embedded objects inside page elements
0 commit comments