You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) (in the git repo) or [dist/index.d.ts](dist/index.d.ts) (in the npm package when installed locally) for detailed argument and return value typing.
See [thetypes](https://github.com/electrovir/pdf-text-reader/tree/master/src/read-pdf.ts) for detailed argument and return value types.
29
45
30
46
# Details
31
47
32
-
This uses Mozilla's [`pdf.js`](https://github.com/mozilla/pdf.js/) package through its [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) distribution on npm. As such, any valid input to `pdf.js`'s `getDocument` function are valid inputs to _this_ package's `readPdfText` function. See [`pdfjs-dist`'s types folder](https://github.com/mozilla/pdfjs-dist/blob/master/types/display/api.d.ts) for more info on that or, for just the type information, read [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) in this repo.
48
+
ThisusesMozilla's [`pdf.js`](https://github.com/mozilla/pdf.js/) package through its [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) distribution on npm.
33
49
34
-
This package simply reads the output of `pdfjs.getDocument` and sorts it into lines based on the text vertical position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.
@@ -45,27 +61,25 @@ The number of spaces to insert is calculated by an extremely naive but very simp
45
61
46
62
# LowLevelControl
47
63
48
-
If you need lower level parsing control, you can also use the exported `parsePageItems` function. This only reads one page at a time as seen below. This function is used by `readPdfText` so the output will be identical for the same pdf page.
64
+
Ifyouneedlowerlevelparsingcontrol, youcanalsousetheexported`parsePageItems`function. This only reads one page at a time as seen below. This function is used by `readPdfPages` so the output will be identical for the same pdf page.
49
65
50
-
You may need the [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) npm package independently installed to do this.
66
+
You may need to independently install the [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) npm package for this to work.
const items: TextItem[] = content.items.filter((item): item is TextItem => 'str' in item);
64
80
const parsedPage = parsePageItems(items);
65
-
console.log(parsedPage.lines);
81
+
console.info(parsedPage.lines);
66
82
}
67
83
68
-
run();
84
+
main();
69
85
```
70
-
71
-
See [src/index.ts](https://github.com/electrovir/pdf-text-reader/tree/master/src/index.ts) (in the git repo) or [dist/index.d.ts](dist/index.d.ts) (in the npm package when installed locally) for detailed argument and return value typing.
0 commit comments