Skip to content

Commit e0601fd

Browse files
dhdainesDavid Huggins-Daines
andauthored
Make tables 60-100x faster with PLAYA-PDF 0.7.0 and its good page.structure (#14)
* feat: optimize table extraction with parent tree * fix: elements will be hashable too * feat: use new page structure and new playa * docs: clarify and improve structure docs --------- Co-authored-by: David Huggins-Daines <[email protected]>
1 parent 921edc0 commit e0601fd

File tree

3 files changed

+32
-18
lines changed

3 files changed

+32
-18
lines changed

README.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,16 +51,27 @@ structure tree, to look at the bounding boxes of the contents of those
5151
structure elements for a given page:
5252

5353
```python
54-
pi.box(pdf.structure.find_all(lambda el: el.page is page))
54+
pi.box(page.structure)
5555
```
5656

5757
![Structure Elements](./docs/page3-elements.png)
5858

59-
You can also look at the marked content sections, which are the
60-
leaf-nodes of the structure tree:
59+
Note however that this only gives you the elements associated with
60+
*marked content sections*, which are the leaf nodes of the structure
61+
tree. So, you can also search up the structure tree to find things
62+
like tables, figures, or list items:
6163

6264
```python
63-
pi.box(page.structure)
65+
pi.box(page.structure.find_all("Table"))
66+
pi.box(page.structure.find_all("Figure"))
67+
pi.box(page.structure.find_all("LI"))
68+
```
69+
70+
You can even search with regular expressions, to find headers for
71+
instance:
72+
73+
```python
74+
pi.box(page.structure.find_all(re.compile(r"H\d+")))
6475
```
6576

6677
Alternately, if you have annotations (such as links), you can look at

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ classifiers = [
3030
"Programming Language :: Python :: Implementation :: PyPy",
3131
]
3232
dependencies = [
33-
"playa-pdf @ git+https://github.com/dhdaines/playa.git",
33+
"playa-pdf>=0.7.0",
3434
"pillow",
3535
]
3636

src/paves/tables.py

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
"""
2-
Simple and not at all Java-damaged interface for table detection
3-
and structure prediction.
2+
Simple and not at all Java-damaged interface for table detection.
43
"""
54

65
from copy import copy
@@ -16,7 +15,12 @@
1615
from playa.content import ContentObject, GraphicState, MarkedContent
1716
from playa.page import Annotation
1817
from playa.pdftypes import Matrix, Rect, BBOX_NONE
19-
from playa.structure import Element, ContentItem, ContentObject as StructContentObject
18+
from playa.structure import (
19+
Element,
20+
ContentItem,
21+
ContentObject as StructContentObject,
22+
Tree,
23+
)
2024
from playa.utils import get_bound_rects
2125
from playa.worker import _ref_page
2226

@@ -139,7 +143,7 @@ def table_elements(
139143
pdf: Union[str, PathLike, Document, Page, PageList],
140144
) -> Iterator[Element]:
141145
"""Iterate over all text objects in a PDF, page, or pages"""
142-
raise NotImplementedError
146+
raise NotImplementedError(f"Not implemented for {type(pdf)}")
143147

144148

145149
@table_elements.register(str)
@@ -155,24 +159,23 @@ def table_elements_path(pdf: Union[str, PathLike]) -> Iterator[Element]:
155159
def table_elements_doc(pdf: Document) -> Iterator[Element]:
156160
structure = pdf.structure
157161
if structure is None:
158-
raise TypeError
162+
raise TypeError("Document has no logical structure")
159163
return structure.find_all("Table")
160164

161165

162166
@table_elements.register
163167
def table_elements_pagelist(pages: PageList) -> Iterator[Element]:
164-
structure = pages.doc.structure
165-
if structure is None:
166-
raise TypeError
167-
# FIXME: Accelerate this with the ParentTree too
168-
return (table for table in structure.find_all("Table") if table.page in pages)
168+
if pages.doc.structure is None:
169+
raise TypeError("Document has no logical structure")
170+
for page in pages:
171+
yield from table_elements_page(page)
169172

170173

171174
@table_elements.register
172175
def table_elements_page(page: Page) -> Iterator[Element]:
173-
# FIXME: Accelerate this with the ParentTree
174-
pagelist = page.doc.pages[(page.page_idx,)]
175-
return table_elements_pagelist(pagelist)
176+
if page.structure is None:
177+
raise TypeError("Page has no ParentTree")
178+
return page.structure.find_all("Table")
176179

177180

178181
def table_elements_to_objects(

0 commit comments

Comments
 (0)