Skip to content

Commit 83587cd

Browse files
feat(agents): add pdf skill
1 parent 7312f00 commit 83587cd

11 files changed

Lines changed: 1887 additions & 0 deletions

agents/.agents/skills/pdf/SKILL.md

Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
---
2+
name: pdf
3+
description:
4+
Extract text/tables from PDFs, merge/split documents, create new PDFs, fill forms, and OCR scanned files. Use when
5+
working with PDF files or when user mentions PDFs, forms, or document extraction.
6+
---
7+
8+
# PDF Processing Guide
9+
10+
## Overview
11+
12+
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced
13+
features, JavaScript libraries, and detailed examples, see [reference.md](reference.md). If you need to fill out a PDF
14+
form, read [forms.md](forms.md) and follow its instructions.
15+
16+
## Quick Start
17+
18+
```python
19+
from pypdf import PdfReader, PdfWriter
20+
21+
# Read a PDF
22+
reader = PdfReader("document.pdf")
23+
print(f"Pages: {len(reader.pages)}")
24+
25+
# Extract text
26+
text = ""
27+
for page in reader.pages:
28+
text += page.extract_text()
29+
```
30+
31+
## Python Libraries
32+
33+
### pypdf - Basic Operations
34+
35+
#### Merge PDFs
36+
37+
```python
38+
from pypdf import PdfWriter, PdfReader
39+
40+
writer = PdfWriter()
41+
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
42+
reader = PdfReader(pdf_file)
43+
for page in reader.pages:
44+
writer.add_page(page)
45+
46+
with open("merged.pdf", "wb") as output:
47+
writer.write(output)
48+
```
49+
50+
#### Split PDF
51+
52+
```python
53+
reader = PdfReader("input.pdf")
54+
for i, page in enumerate(reader.pages):
55+
writer = PdfWriter()
56+
writer.add_page(page)
57+
with open(f"page_{i+1}.pdf", "wb") as output:
58+
writer.write(output)
59+
```
60+
61+
#### Extract Metadata
62+
63+
```python
64+
reader = PdfReader("document.pdf")
65+
meta = reader.metadata
66+
print(f"Title: {meta.title}")
67+
print(f"Author: {meta.author}")
68+
print(f"Subject: {meta.subject}")
69+
print(f"Creator: {meta.creator}")
70+
```
71+
72+
#### Rotate Pages
73+
74+
```python
75+
reader = PdfReader("input.pdf")
76+
writer = PdfWriter()
77+
78+
page = reader.pages[0]
79+
page.rotate(90) # Rotate 90 degrees clockwise
80+
writer.add_page(page)
81+
82+
with open("rotated.pdf", "wb") as output:
83+
writer.write(output)
84+
```
85+
86+
### pdfplumber - Text and Table Extraction
87+
88+
#### Extract Text with Layout
89+
90+
```python
91+
import pdfplumber
92+
93+
with pdfplumber.open("document.pdf") as pdf:
94+
for page in pdf.pages:
95+
text = page.extract_text()
96+
print(text)
97+
```
98+
99+
#### Extract Tables
100+
101+
```python
102+
with pdfplumber.open("document.pdf") as pdf:
103+
for i, page in enumerate(pdf.pages):
104+
tables = page.extract_tables()
105+
for j, table in enumerate(tables):
106+
print(f"Table {j+1} on page {i+1}:")
107+
for row in table:
108+
print(row)
109+
```
110+
111+
#### Advanced Table Extraction
112+
113+
```python
114+
import pandas as pd
115+
116+
with pdfplumber.open("document.pdf") as pdf:
117+
all_tables = []
118+
for page in pdf.pages:
119+
tables = page.extract_tables()
120+
for table in tables:
121+
if table: # Check if table is not empty
122+
df = pd.DataFrame(table[1:], columns=table[0])
123+
all_tables.append(df)
124+
125+
# Combine all tables
126+
if all_tables:
127+
combined_df = pd.concat(all_tables, ignore_index=True)
128+
combined_df.to_excel("extracted_tables.xlsx", index=False)
129+
```
130+
131+
### reportlab - Create PDFs
132+
133+
#### Basic PDF Creation
134+
135+
```python
136+
from reportlab.lib.pagesizes import letter
137+
from reportlab.pdfgen import canvas
138+
139+
c = canvas.Canvas("hello.pdf", pagesize=letter)
140+
width, height = letter
141+
142+
# Add text
143+
c.drawString(100, height - 100, "Hello World!")
144+
c.drawString(100, height - 120, "This is a PDF created with reportlab")
145+
146+
# Add a line
147+
c.line(100, height - 140, 400, height - 140)
148+
149+
# Save
150+
c.save()
151+
```
152+
153+
#### Create PDF with Multiple Pages
154+
155+
```python
156+
from reportlab.lib.pagesizes import letter
157+
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
158+
from reportlab.lib.styles import getSampleStyleSheet
159+
160+
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
161+
styles = getSampleStyleSheet()
162+
story = []
163+
164+
# Add content
165+
title = Paragraph("Report Title", styles['Title'])
166+
story.append(title)
167+
story.append(Spacer(1, 12))
168+
169+
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
170+
story.append(body)
171+
story.append(PageBreak())
172+
173+
# Page 2
174+
story.append(Paragraph("Page 2", styles['Heading1']))
175+
story.append(Paragraph("Content for page 2", styles['Normal']))
176+
177+
# Build PDF
178+
doc.build(story)
179+
```
180+
181+
#### Subscripts and Superscripts
182+
183+
**IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The
184+
built-in fonts do not include these glyphs, causing them to render as solid black boxes.
185+
186+
Instead, use ReportLab's XML markup tags in Paragraph objects:
187+
188+
```python
189+
from reportlab.platypus import Paragraph
190+
from reportlab.lib.styles import getSampleStyleSheet
191+
192+
styles = getSampleStyleSheet()
193+
194+
# Subscripts: use <sub> tag
195+
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
196+
197+
# Superscripts: use <super> tag
198+
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
199+
```
200+
201+
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode
202+
subscripts/superscripts.
203+
204+
## Command-Line Tools
205+
206+
### pdftotext (poppler-utils)
207+
208+
```bash
209+
# Extract text
210+
pdftotext input.pdf output.txt
211+
212+
# Extract text preserving layout
213+
pdftotext -layout input.pdf output.txt
214+
215+
# Extract specific pages
216+
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
217+
```
218+
219+
### qpdf
220+
221+
```bash
222+
# Merge PDFs
223+
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
224+
225+
# Split pages
226+
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
227+
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
228+
229+
# Rotate pages
230+
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
231+
232+
# Remove password
233+
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
234+
```
235+
236+
### pdftk (if available)
237+
238+
```bash
239+
# Merge
240+
pdftk file1.pdf file2.pdf cat output merged.pdf
241+
242+
# Split
243+
pdftk input.pdf burst
244+
245+
# Rotate
246+
pdftk input.pdf rotate 1east output rotated.pdf
247+
```
248+
249+
## Common Tasks
250+
251+
### Extract Text from Scanned PDFs
252+
253+
```python
254+
# Requires: pip install pytesseract pdf2image
255+
import pytesseract
256+
from pdf2image import convert_from_path
257+
258+
# Convert PDF to images
259+
images = convert_from_path('scanned.pdf')
260+
261+
# OCR each page
262+
text = ""
263+
for i, image in enumerate(images):
264+
text += f"Page {i+1}:\n"
265+
text += pytesseract.image_to_string(image)
266+
text += "\n\n"
267+
268+
print(text)
269+
```
270+
271+
### Add Watermark
272+
273+
```python
274+
from pypdf import PdfReader, PdfWriter
275+
276+
# Create watermark (or load existing)
277+
watermark = PdfReader("watermark.pdf").pages[0]
278+
279+
# Apply to all pages
280+
reader = PdfReader("document.pdf")
281+
writer = PdfWriter()
282+
283+
for page in reader.pages:
284+
page.merge_page(watermark)
285+
writer.add_page(page)
286+
287+
with open("watermarked.pdf", "wb") as output:
288+
writer.write(output)
289+
```
290+
291+
### Extract Images
292+
293+
```bash
294+
# Using pdfimages (poppler-utils)
295+
pdfimages -j input.pdf output_prefix
296+
297+
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
298+
```
299+
300+
### Password Protection
301+
302+
```python
303+
from pypdf import PdfReader, PdfWriter
304+
305+
reader = PdfReader("input.pdf")
306+
writer = PdfWriter()
307+
308+
for page in reader.pages:
309+
writer.add_page(page)
310+
311+
# Add password
312+
writer.encrypt("userpassword", "ownerpassword")
313+
314+
with open("encrypted.pdf", "wb") as output:
315+
writer.write(output)
316+
```
317+
318+
## Quick Reference
319+
320+
| Task | Best Tool | Command/Code |
321+
| ------------------ | ------------------------------- | -------------------------- |
322+
| Merge PDFs | pypdf | `writer.add_page(page)` |
323+
| Split PDFs | pypdf | One page per file |
324+
| Extract text | pdfplumber | `page.extract_text()` |
325+
| Extract tables | pdfplumber | `page.extract_tables()` |
326+
| Create PDFs | reportlab | Canvas or Platypus |
327+
| Command line merge | qpdf | `qpdf --empty --pages ...` |
328+
| OCR scanned PDFs | pytesseract | Convert to image first |
329+
| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |
330+
331+
## Next Steps
332+
333+
- For advanced pypdfium2 usage, see [reference.md](reference.md)
334+
- For JavaScript libraries (pdf-lib), see [reference.md](reference.md)
335+
- If you need to fill out a PDF form, follow the instructions in [forms.md](forms.md)
336+
- For troubleshooting guides, see [reference.md](reference.md)

0 commit comments

Comments
 (0)