Skip to content

Commit 26c108b

Browse files
committed
add docs
1 parent 57dc6e4 commit 26c108b

File tree

5 files changed

+539
-0
lines changed

5 files changed

+539
-0
lines changed

docs/api_reference.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# API Reference
2+
3+
This document provides an overview of the core packages in PDFKit.
4+
5+
## Core Packages
6+
7+
### `builder`
8+
High-level fluent API for creating PDF documents.
9+
- **Key Types**: `Builder`, `PageBuilder`
10+
- **Usage**: Use this package to programmatically generate PDFs.
11+
12+
### `parser`
13+
Responsible for parsing PDF files into the Intermediate Representation (IR).
14+
- **Key Types**: `Parser`, `Config`
15+
- **Usage**: Use this package to read existing PDFs.
16+
17+
### `writer`
18+
Handles serialization of the IR back to PDF file format.
19+
- **Key Types**: `Writer`, `Config`
20+
- **Usage**: Use this package to save documents to disk or stream.
21+
22+
### `ir` (Intermediate Representation)
23+
The core data structures representing the PDF.
24+
- **`ir/raw`**: Low-level PDF objects (Dictionaries, Arrays, Streams).
25+
- **`ir/decoded`**: Objects with streams decompressed and decrypted.
26+
- **`ir/semantic`**: High-level semantic objects (Pages, Fonts, Annotations).
27+
28+
## Support Packages
29+
30+
### `compliance`
31+
Unified compliance engine for PDF/A, PDF/X, and PDF/UA.
32+
- **Subpackages**: `pdfa`, `pdfua`, `pdfvt`, `pdfx`
33+
34+
### `contentstream`
35+
Parses and processes PDF content streams (drawing operators).
36+
- **Key Types**: `Processor`, `GraphicsState`
37+
38+
### `filters`
39+
Implements standard PDF stream filters (Flate, DCT, JPX, etc.).
40+
- **Key Types**: `Decoder`, `Pipeline`
41+
42+
### `fonts`
43+
Handles font parsing, subsetting, and embedding.
44+
- **Key Types**: `SubsettingPipeline`, `GlyphAnalyzer`
45+
46+
### `security`
47+
Manages encryption, decryption, and digital signatures.
48+
- **Key Types**: `Handler`, `Permissions`
49+
50+
### `xref`
51+
Resolves cross-reference tables and streams.
52+
- **Key Types**: `Table`, `Resolver`
53+
54+
## Extension Packages
55+
56+
### `extensions`
57+
Plugin system for inspecting, sanitizing, transforming, and validating PDFs.
58+
- **Key Types**: `Hub`, `Extension`
59+
60+
### `layout`
61+
Layout engine for converting Markdown and HTML to PDF.
62+
- **Key Types**: `Engine`
63+
64+
### `scripting`
65+
JavaScript execution environment for PDF forms and actions.
66+
- **Key Types**: `Engine`

docs/architecture.md

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# PDF Parser & Creator Library — Architecture
2+
3+
## 1. Executive Summary
4+
5+
This document specifies the architecture, interfaces, and extensibility model for a **production-grade PDF parser and creator library** in Go.
6+
7+
The library is designed to support:
8+
9+
* **Streaming parsing and writing** for large PDFs with configurable backpressure
10+
* **Three-tier IR (Intermediate Representation)** with explicit transformation boundaries
11+
* **Full fidelity read/write** including vector graphics, text, images, forms, annotations, and metadata
12+
* **Incremental updates** with append-only or full rewrite modes
13+
* **Font subsetting and embedding** with complete pipeline specification
14+
* **PDF/A compliance** (1/2/3 variants) with validation and auto-correction
15+
* **Extensible plugin system** with defined execution phases and ordering
16+
* **Robust security and error recovery**
17+
18+
The library provides both **high-level convenience APIs** for typical PDF tasks and **low-level control** for applications requiring fine-grained operations.
19+
20+
---
21+
22+
## 2. Goals
23+
24+
**Primary Goals**
25+
26+
1. Parse PDFs of any complexity with configurable memory limits
27+
2. Provide three explicit IR levels (Raw, Decoded, Semantic) with clear transformation boundaries
28+
3. Enable deterministic PDF generation with embedded font subsetting
29+
4. Support incremental updates and append-only modifications
30+
5. Ensure PDF/A compliance with validation and automatic fixes
31+
6. Provide extensibility through well-defined plugin phases
32+
7. Handle malformed PDFs with configurable error recovery
33+
8. Support concurrent operations where safe
34+
35+
**Non-Goals**
36+
37+
* Full OCR or text recognition (may be delegated to plugins)
38+
* PDF rendering engine (layout calculation for display)
39+
* Built-in cloud storage integration (external I/O adapters)
40+
41+
---
42+
43+
## 3. Architecture Overview
44+
45+
```
46+
┌─────────────────────────────────────────────────────────────┐
47+
│ High-Level Builder API │
48+
│ (Convenience, Fluent Interface) │
49+
└────────────────────────┬────────────────────────────────────┘
50+
51+
52+
┌─────────────────────────────────────────────────────────────┐
53+
│ Semantic IR (Level 3) │
54+
│ Pages, Fonts, Images, Annotations, Metadata │
55+
└────────────────────────┬────────────────────────────────────┘
56+
57+
58+
┌─────────────────────────────────────────────────────────────┐
59+
│ Decoded IR (Level 2) │
60+
│ Decompressed Streams, Decrypted Objects │
61+
└────────────────────────┬────────────────────────────────────┘
62+
63+
64+
┌─────────────────────────────────────────────────────────────┐
65+
│ Raw IR (Level 1) │
66+
│ Dictionaries, Arrays, Streams, Names, Numbers │
67+
└────────────────────────┬────────────────────────────────────┘
68+
69+
┌──────────────┼──────────────┐
70+
│ │ │
71+
▼ ▼ ▼
72+
┌─────────┐ ┌─────────┐ ┌──────────┐
73+
│ Scanner │ │ XRef │ │ Security │
74+
│Tokenizer│ │Resolver │ │ Handler │
75+
└─────────┘ └─────────┘ └──────────┘
76+
│ │ │
77+
└──────────────┴──────────────┘
78+
79+
80+
Input Stream
81+
82+
───────────────────────────────────────────────────────────────
83+
84+
85+
86+
┌─────────────────────────────────────────────────────────────┐
87+
│ Extension Hub │
88+
│ Inspect → Sanitize → Transform → Validate │
89+
└────────────────────────┬────────────────────────────────────┘
90+
91+
92+
┌─────────────────────────────────────────────────────────────┐
93+
│ Serialization Engine │
94+
│ Full Writer, Incremental Writer, Linearization │
95+
└────────────────────────┬────────────────────────────────────┘
96+
97+
98+
Output Stream
99+
```
100+
101+
**Cross-Cutting Concerns (injected into all layers):**
102+
- Error Recovery Strategy
103+
- Context & Cancellation
104+
- Security Limits
105+
106+
---
107+
108+
## 4. Module Architecture
109+
110+
### 4.1 Core Modules
111+
112+
| Module | Responsibility |
113+
| ---------------- | ----------------------------------------------------------------- |
114+
| `scanner` | Tokenizes raw PDF bytes, handles PDF syntax |
115+
| `xref` | Resolves cross-reference tables and streams |
116+
| `security` | Encryption/decryption, permissions, password handling |
117+
| `parser` | Coordinates scanning, xref, security to build Raw IR |
118+
| `ir/raw` | Raw PDF objects (Level 1): dictionaries, arrays, streams |
119+
| `ir/decoded` | Decoded objects (Level 2): decompressed, decrypted |
120+
| `ir/semantic` | Semantic objects (Level 3): pages, fonts, annotations |
121+
| `filters` | Stream decoders (Flate, DCT, JPX, etc.) with pipeline composition |
122+
| `contentstream` | Content stream parsing, graphics state, text positioning |
123+
| `resources` | Resource resolution with inheritance and scoping |
124+
| `fonts` | Font subsetting, embedding, ToUnicode generation |
125+
| `coords` | Coordinate transformations, user space, device space |
126+
| `writer` | PDF serialization: full, incremental, linearized |
127+
| `pdfa` | PDF/A validation, XMP generation, ICC profiles, compliance fixes |
128+
| `extensions` | Plugin system with phased execution model |
129+
| `recovery` | Error recovery strategies for malformed PDFs |
130+
| `builder` | High-level fluent API for PDF construction |
131+
| `layout` | Layout engine for converting structured content (Markdown/HTML) to PDF |
132+
| `scripting` | JavaScript execution environment and PDF DOM implementation |
133+
| `contentstream/editor` | Spatial indexing (QuadTree) and content redaction/editing |
134+
| `security/validation` | Digital signature validation, LTV, and revocation checking |
135+
| `xfa` | XML Forms Architecture parsing and layout engine |
136+
| `cmm` | Color Management Module (ICC, CxF) |
137+
| `geo` | Geospatial PDF support |
138+
| `compliance` | Unified compliance engine (PDF/A, PDF/X, PDF/UA) |
139+
140+
### 4.2 Module Dependencies
141+
142+
```
143+
builder
144+
└─→ ir/semantic
145+
└─→ ir/decoded
146+
└─→ ir/raw
147+
└─→ scanner, xref, security
148+
149+
layout
150+
└─→ builder
151+
152+
extensions
153+
└─→ ir/semantic (operates on semantic IR)
154+
155+
writer
156+
└─→ ir/semantic
157+
└─→ ir/decoded
158+
└─→ ir/raw
159+
160+
fonts
161+
└─→ ir/semantic (pages, text)
162+
163+
contentstream
164+
└─→ ir/decoded (stream bytes)
165+
└─→ coords (transformations)
166+
167+
filters
168+
└─→ ir/raw (stream dictionaries)
169+
170+
recovery
171+
└─→ (injected into all layers)
172+
```
173+
174+
---
175+
176+
## 5. Three-Tier IR Architecture
177+
178+
### 5.1 Level 1: Raw IR
179+
180+
**Purpose:** Direct representation of PDF primitive objects as per PDF spec.
181+
182+
### 5.2 Level 2: Decoded IR
183+
184+
**Purpose:** Objects after stream decoding and decryption.
185+
186+
### 5.3 Level 3: Semantic IR
187+
188+
**Purpose:** High-level PDF structures with business logic.
189+
190+
### 5.4 IR Transformation Pipeline
191+
192+
---
193+
194+
## 6. Core Component Specifications
195+
196+
### 6.1 Scanner & Parser
197+
### 6.2 XRef Resolution
198+
### 6.3 Object Loader
199+
### 6.4 Filter Pipeline
200+
### 6.5 Security Handler
201+
202+
---
203+
204+
## 7. Content Stream Architecture
205+
206+
---
207+
208+
## 8. Resource Resolution Architecture
209+
210+
---
211+
212+
## 9. Coordinate System Architecture
213+
214+
---
215+
216+
## 10. Font Subsetting Architecture
217+
218+
---
219+
220+
## 11. Streaming Architecture
221+
222+
---
223+
224+
## 12. Extension System Architecture
225+
226+
---
227+
228+
## 13. Writer Architecture
229+
230+
---
231+
232+
## 14. PDF/A Compliance Architecture
233+
234+
---
235+
236+
## 15. Error Recovery Architecture
237+
238+
---
239+
240+
## 16. Advanced Features Architecture (v2.4+)
241+
242+
### 16.1 Scripting Engine
243+
### 16.2 Content Editor & Spatial Indexing
244+
### 16.3 Digital Signature Validation (LTV)
245+
### 16.4 XFA Support
246+
### 16.5 Color Management (CMM)
247+
### 16.6 Geospatial Support
248+
### 16.7 Compliance Engine
249+
250+
---
251+
252+
## 17. High-Level Builder API
253+
254+
---
255+
256+
## 18. Concurrency Model
257+
258+
### 18.1 Thread Safety
259+
### 18.2 Parallel Processing Opportunities
260+
261+
---
262+
263+
## 19. Security Architecture
264+
265+
### 19.1 Security Limits
266+
### 19.2 Input Validation
267+
268+
---
269+
270+
## 20. Layout Engine
271+
272+
### 20.1 Engine Architecture
273+
### 20.2 Supported Features
274+
275+
---
276+
277+
## 21. Testing Strategy
278+
279+
### 20.1 Test Corpus
280+
### 20.2 Test Categories
281+
282+
---
283+
284+
## 22. Performance Targets
285+
286+
---
287+
288+
## 23. Roadmap
289+
290+
---
291+
292+
## 24. Dependencies
293+
294+
---
295+
296+
## 25. API Stability
297+
298+
---
299+
300+
## 26. References
301+
302+
---
303+
304+
## 27. Appendix: Example Workflows
305+
306+
### Example 1: Parse and Extract Text
307+
### Example 2: Create Simple PDF
308+
### Example 3: Font Subsetting
309+
### Example 4: PDF/A Conversion
310+
311+
---
312+
313+
**End of Design Document v2.0**

0 commit comments

Comments
 (0)