Skip to content

Commit d9617b6

Browse files
committed
change readPdfText inputs to an object
add types for the new input setup add readPdfPages for converting a PDF into an array of pages
1 parent 899ae03 commit d9617b6

16 files changed

+7700
-3408
lines changed

README.md

Lines changed: 35 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,26 +12,42 @@ npm install pdf-text-reader
1212

1313
# Usage
1414

15-
<!-- example-link: src/readme-examples/read-pdf-text.example.ts -->
15+
- Read all pages into a single string with `readPdfText`:
1616

17-
```TypeScript
18-
import {readPdfText} from 'pdf-text-reader';
17+
<!-- example-link: src/readme-examples/read-pdf-text.example.ts -->
1918

20-
async function run() {
21-
const pages = await readPdfText('path/to/pdf/file.pdf');
22-
console.log(pages[0]?.lines);
23-
}
19+
```TypeScript
20+
import {readPdfText} from 'pdf-text-reader';
2421

25-
run();
26-
```
22+
async function main() {
23+
const pdfText: string = await readPdfText({url: 'path/to/pdf/file.pdf'});
24+
console.info(pdfText);
25+
}
26+
27+
main();
28+
```
29+
30+
- Read a PDF into individual pages with `readPdfPages`:
31+
<!-- example-link: src/readme-examples/read-pdf-pages.example.ts -->
2732

28-
See [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) (in the git repo) or [dist/index.d.ts](dist/index.d.ts) (in the npm package when installed locally) for detailed argument and return value typing.
33+
```TypeScript
34+
import {readPdfPages} from 'pdf-text-reader';
35+
36+
async function main() {
37+
const pages = await readPdfPages({url: 'path/to/pdf/file.pdf'});
38+
console.info(pages[0]?.lines);
39+
}
40+
41+
main();
42+
```
43+
44+
See [the types](https://github.com/electrovir/pdf-text-reader/tree/master/src/read-pdf.ts) for detailed argument and return value types.
2945

3046
# Details
3147

32-
This uses Mozilla's [`pdf.js`](https://github.com/mozilla/pdf.js/) package through its [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) distribution on npm. As such, any valid input to `pdf.js`'s `getDocument` function are valid inputs to _this_ package's `readPdfText` function. See [`pdfjs-dist`'s types folder](https://github.com/mozilla/pdfjs-dist/blob/master/types/display/api.d.ts) for more info on that or, for just the type information, read [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) in this repo.
48+
This uses Mozilla's [`pdf.js`](https://github.com/mozilla/pdf.js/) package through its [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) distribution on npm.
3349

34-
This package simply reads the output of `pdfjs.getDocument` and sorts it into lines based on the text vertical position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.
50+
This package simply reads the output of `pdfjs.getDocument` and sorts it into lines based on text position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.
3551

3652
Example:
3753

@@ -45,27 +61,25 @@ The number of spaces to insert is calculated by an extremely naive but very simp
4561

4662
# Low Level Control
4763

48-
If you need lower level parsing control, you can also use the exported `parsePageItems` function. This only reads one page at a time as seen below. This function is used by `readPdfText` so the output will be identical for the same pdf page.
64+
If you need lower level parsing control, you can also use the exported `parsePageItems` function. This only reads one page at a time as seen below. This function is used by `readPdfPages` so the output will be identical for the same pdf page.
4965

50-
You may need the [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) npm package independently installed to do this.
66+
You may need to independently install the [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) npm package for this to work.
5167

5268
<!-- example-link: src/readme-examples/lower-level-controls.example.ts -->
5369

5470
```TypeScript
5571
import * as pdfjs from 'pdfjs-dist';
56-
import {TextItem} from 'pdfjs-dist/types/src/display/api';
72+
import type {TextItem} from 'pdfjs-dist/types/src/display/api';
5773
import {parsePageItems} from 'pdf-text-reader';
5874
59-
async function run() {
75+
async function main() {
6076
const doc = await pdfjs.getDocument('myDocument.pdf').promise;
61-
const page = await doc.getPage(1);
77+
const page = await doc.getPage(1); // 1-indexed
6278
const content = await page.getTextContent();
6379
const items: TextItem[] = content.items.filter((item): item is TextItem => 'str' in item);
6480
const parsedPage = parsePageItems(items);
65-
console.log(parsedPage.lines);
81+
console.info(parsedPage.lines);
6682
}
6783
68-
run();
84+
main();
6985
```
70-
71-
See [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) (in the git repo) or [dist/index.d.ts](dist/index.d.ts) (in the npm package when installed locally) for detailed argument and return value typing.

configs/ncu.config.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ export const ncuConfig: RunOptions = {
66
// exclude these
77
reject: [
88
...baseNcuConfig.reject,
9+
/**
10+
* Different versions of this have global pollution issues we're currently on a version that
11+
* doesn't.
12+
*/
13+
'pdfjs-dist',
914
],
1015
// include only these
1116
filter: [],

0 commit comments

Comments
 (0)