Skip to content

Commit 9c0e217

Browse files
authored
Start making all APIs use iterator protocol instead of bespoke methods/classes ad infinitum (#11)
* feat!: always in-memory parser and use iterator protocol (mostly) * fix: avoid error if x was a tuple for some reason * test: fix tests * fix: minor tweaks * ci: benchmark * chore: ruff * ci: make benchmark a separate job * ci: make benchmark a separate workflow * ci: report ccoverage * refactor!: make lines/revlines behave the same way * refactor!: remove the utterly useless PDFResourceManager * chore: ruff * fix: tolerate mangled PDF headers * refactor!: nexttoken redundant for lexer * refactor!: PDFEliminate PDFExtra PDFCharacters PDFEverwhere PDFWe PDFHave PDFNamespaces PDFAfter PDFAll * refactor!: there can be only one (parser) * refactor!: page indices (0-based), PDFRemove PDFMore PDFPrefixes * docs: describe the desired API * fix: seek 0 in iter * feat: iterator-based layout API * chore: ruff it up * fix(tests): test layout against pdfminer.six * fix: error consistent with pdfminer * fix: ensure xobjects actually work * fix: validate against pdfminer * fix: STRICT breaks things * fix(test): extra-dependencies
1 parent 6f80c3d commit 9c0e217

26 files changed

+1977
-2567
lines changed

.github/workflows/benchmarks.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
name: Benchmark
2+
on:
3+
push:
4+
branches: [ "main" ]
5+
pull_request:
6+
branches: [ "main" ]
7+
8+
jobs:
9+
benchmark:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Set up Python
14+
uses: actions/setup-python@v5
15+
with:
16+
python-version: "3.10"
17+
- name: Install Hatch
18+
uses: pypa/hatch@install
19+
- name: Run benchmarks
20+
run: |
21+
hatch run bench:all

.github/workflows/tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Run all tests
1+
name: Test
22
on:
33
push:
44
branches: [ "main" ]
@@ -17,4 +17,4 @@ jobs:
1717
- name: Install Hatch
1818
uses: pypa/hatch@install
1919
- name: Run tests
20-
run: hatch test
20+
run: hatch test --cover

README.md

Lines changed: 105 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# PLAYA Ain't a LAYout Analyzer 🏖️
1+
# **P**LAYA ain't a **LAY**out **A**nalyzer 🏖️
22

33
## About
44

@@ -28,7 +28,110 @@ Notably this does *not* include the largely undocumented heuristic
2828
to understand due to a Java-damaged API based on deeply nested class
2929
hierarchies, and because layout analysis is best done
3030
probabilistically/visually. Also, pdfplumber does its own, much
31-
nicer, layout analysis.
31+
nicer, layout analysis. Also, if you just want to extract text from a
32+
PDF, there are a lot of better and faster tools and libraries out
33+
there, see [benchmarks]() for a summary (TL;DR pypdfium2 is probably
34+
what you want, but pdfplumber does a nice job of converting PDF to
35+
ASCII art).
36+
37+
## Usage
38+
39+
Do you want to get stuff out of a PDF? You have come to the right
40+
place! Let's open up a PDF and see what's in it:
41+
42+
```python
43+
pdf = playa.open("my_awesome_document.pdf")
44+
raw_byte_stream = pdf.buffer
45+
a_bunch_of_tokens = list(pdf.tokens)
46+
a_bunch_of_objects = list(pdf)
47+
a_particular_indirect_object = pdf[42]
48+
```
49+
50+
The raw PDF tokens and objects are probably not terribly useful to
51+
you, but you might find them interesting.
52+
53+
It probably has some pages. How many? What are their numbers/labels?
54+
(they could be things like "xviii", 'a", or "42", for instance)
55+
56+
```python
57+
npages = len(pdf.pages)
58+
page_numbers = [page.label for page in pdf.pages]
59+
```
60+
61+
What's in the table of contents?
62+
63+
```python
64+
for entry in pdf.outlines:
65+
...
66+
```
67+
68+
If you are lucky it has a "logical structure tree". The elements here
69+
might even be referenced from the table of contents! (or, they might
70+
not... with PDF you never know)
71+
72+
```python
73+
structure = pdf.structtree
74+
for element in structure:
75+
for child in element:
76+
...
77+
```
78+
79+
Now perhaps we want to look at a specific page. Okay!
80+
```python
81+
page = pdf.pages[0] # they are numbered from 0
82+
page = pdf.pages["xviii"] # but you can get them by label
83+
page = pdf.pages["42"] # or "logical" page number (also a label)
84+
a_few_content_streams = list(page.contents)
85+
raw_bytes = b"".join(stream.buffer for stream in page.contents)
86+
```
87+
88+
This page probably has text, graphics, etc, etc, in it. Remember that
89+
**P**LAYA ain't a **LAY**out **A**nalyzer! You can either look at the
90+
stream of tokens or mysterious PDF objects:
91+
```python
92+
for token in page.tokens:
93+
...
94+
for object in page:
95+
...
96+
```
97+
98+
Or you can access individual characters, lines, curves, and rectangles
99+
(if you wanted to, for instance, do layout analysis):
100+
```python
101+
for item in page.layout:
102+
...
103+
```
104+
105+
Do we make you spelunk in a dank class hierarchy to know what these
106+
items are? No, we do not! They are just NamedTuples with a very
107+
helpful field *telling* you what they are, as a string.
108+
109+
In particular you can also extract all these items into a dataframe
110+
using the library of your choosing (I like [Polars]()) and I dunno do
111+
some Artifishul Intelligents or something with them:
112+
```python
113+
```
114+
115+
Or just write them to a CSV file:
116+
```python
117+
```
118+
119+
Note again that PLAYA doesn't guarantee that these characters come at
120+
you in anything other than the order they occur in the file (but it
121+
does guarantee that). It does, however, put them in (hopefully) the
122+
right absolute positions on the page, and keep track of the clipping
123+
path and the graphics state, so yeah, you *could* "render" them like
124+
`pdfminer.six` pretended to do.
125+
126+
Certain PDF tools and/or authors are notorious for using "whiteout"
127+
(set the color to the background color) or "scissors" (the clipping
128+
path) to hide arbitrary text that maybe *you* don't want to see
129+
either. PLAYA gives you some rudimentary tools to detect this:
130+
```python
131+
```
132+
133+
For everything else, there's pdfplumber, pdfium2, pikepdf, pypdf,
134+
borb, pydyf, etc, etc, etc.
32135

33136
## Acknowledgement
34137

playa/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from os import PathLike
1111
from typing import Union
1212

13-
from playa.pdfdocument import PDFDocument
13+
from playa.document import PDFDocument
1414

1515
__version__ = "0.0.1"
1616

playa/cmapdb.py

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@
3232
)
3333

3434
from playa.encodingdb import name2unicode
35-
from playa.exceptions import PSEOF, PDFException, PDFTypeError, PSSyntaxError
36-
from playa.psparser import KWD, PSKeyword, PSLiteral, PSStackParser, literal_name
35+
from playa.exceptions import PDFException, PDFTypeError, PSSyntaxError
36+
from playa.parser import KWD, Parser, PSKeyword, PSLiteral, literal_name
3737
from playa.utils import choplist, nunpack
3838

3939
log = logging.getLogger(__name__)
@@ -275,7 +275,7 @@ def get_unicode_map(cls, name: str, vertical: bool = False) -> UnicodeMap:
275275
return cls._umap_cache[name][vertical]
276276

277277

278-
class CMapParser(PSStackParser[PSKeyword]):
278+
class CMapParser(Parser[PSKeyword]):
279279
def __init__(self, cmap: CMapBase, data: bytes) -> None:
280280
super().__init__(data)
281281
self.cmap = cmap
@@ -284,10 +284,7 @@ def __init__(self, cmap: CMapBase, data: bytes) -> None:
284284
self._warnings: Set[str] = set()
285285

286286
def run(self) -> None:
287-
try:
288-
self.nextobject()
289-
except PSEOF:
290-
pass
287+
next(self, None)
291288

292289
KEYWORD_BEGINCMAP = KWD(b"begincmap")
293290
KEYWORD_ENDCMAP = KWD(b"endcmap")

playa/pdfcolor.py renamed to playa/color.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import collections
22
from typing import Dict
33

4-
from playa.psparser import LIT
4+
from playa.parser import LIT
55

66
LITERAL_DEVICE_GRAY = LIT("DeviceGray")
77
LITERAL_DEVICE_RGB = LIT("DeviceRGB")

0 commit comments

Comments
 (0)