Standalone version of https://www.npmjs.com/package/ppu-pdf. Provide your own instance of pdfjs-dist and pass the getDocument and Util.
Easily extract text from digital PDF files with coordinate and font size included, and optionally group text by lines.
- Text Extraction: Retrieve all text content from a PDF.
- Coordinate Data: Get precise bounding box and dimension information for each text element.
- Line Grouping: Merge individual text tokens into coherent lines.
- Scanned PDF Detection: Determine if a PDF appears to be scanned or digitally generated.
Note: Using Bun is recommended. Install the package via npm:
npm install pdfjs-dist ppu-pdf-headlessOr using Yarn:
yarn add pdfjs-dist ppu-pdf-headlessBun:
bun add pdfjs-dist ppu-pdf-headlessBelow is an example of how to use the library with Bun:
import { PdfReader } from "ppu-pdf-headless";
import { getDocument, Util } from "pdfjs-dist";
import { PdfReader } from "./pdf-reader";
const pdfReader = new PdfReader(getDocument, Util, { verbose: false });
const file = Bun.file("./src/assets/opposite-expectation.pdf");
const buffer = await file.arrayBuffer();
const pdf = await pdfReader.open(buffer);
// remember it's a map
const texts = await pdfReader.getTexts(pdf);
const page1texts = texts.get(1);
console.log("texts: ", page1texts);
const isScanned = pdfReader.isScanned(texts);
console.log("is pdf scanned: ", isScanned);Configuration options for PdfReader, allowing customization of PDF text extraction behavior.
| Option | Type | Default Value | Description |
|---|---|---|---|
verbose |
boolean |
false |
Enables logging for debugging purposes. |
excludeFooter |
boolean |
true |
Excludes detected footer text from the extracted content. |
excludeHeader |
boolean |
true |
Excludes detected header text from the extracted content. |
raw |
boolean |
false |
If true, returns raw text without additional processing. |
headerFromHeightPercentage |
number |
0.02 |
Defines the height percentage from the top used to identify header text. |
footerFromHeightPercentage |
number |
0.95 |
Defines the height percentage from the bottom used to identify footer text. |
mergeCloseTextNeighbor |
boolean |
true |
Merges text elements that are close to each other into a single entity. |
simpleSortAlgorithm |
boolean |
false |
Uses a simplified sorting algorithm for text positioning. |
const reader = new PdfReader({ verbose: true, excludeFooter: false });These options allow fine-tuned control over how text is extracted and processed from PDFs.
Creates an instance of PdfReader.
- Parameters:
options(optional): Partial options to override the defaults. Refer to thePdfReaderOptionsinterface for available options.
Opens a PDF document.
-
Parameters:
filename: The path to the PDF file or anArrayBuffercontaining the PDF data.
-
Returns: A promise that resolves with the
PDFDocumentProxy.
Extracts the text content from the PDF document.
-
Parameters:
pdf: ThePDFDocumentProxyinstance.
-
Returns: A promise that resolves with a
Mapof page numbers to their correspondingPdfTexts.
Sample return:
// Map (1)
{
"1": {
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
},
"dimension": {
"width": 389.4900053795799,
"height": 13.9999995
},
"metadata": {
"direction": "ltr",
"fontName": "g_d0_f1",
"fontSize": 14,
"hasEOL": false,
"pageNum": 1
},
"id": 0
}
]
}
}Retrieves line information from the page texts.
-
Parameters:
pageTexts: AMapof page numbers to their correspondingPdfTexts.
-
Returns: A
Mapof page numbers to an array ofPdfLineobjects.
Sample return:
// Map (1)
{
"1": [
{
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
},
"averageFontSize": 14,
"dimension": {
"width": 389.4900053795799,
"height": 13.999999500000001
},
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
},
"dimension": {
"width": 389.4900053795799,
"height": 13.9999995
},
"metadata": {
"direction": "ltr",
"fontName": "g_d0_f1",
"fontSize": 14,
"hasEOL": false,
"pageNum": 1
},
"id": 0
}
],
"text": "Opposite Expectation: How to See the World as Two-Sided"
}
]
}Method: getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines
Retrieves a compact representation of line information from the page texts using the specified algorithm.
-
Parameters:
pageTexts: AMapof page numbers to their correspondingPdfTexts.algorithm: An optionalPdfCompactLineAlgorithmspecifying the method for grouping lines. Defaults tomiddleY.
-
Returns: A
Mapof page numbers to an array ofCompactPdfLineobjects, where the line extraction method depends on the chosen algorithm.
Sample return:
// Map (1)
{
"1": [
{
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
},
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
}
}
],
"text": "Opposite Expectation: How to See the World as Two-Sided"
}
]
}Determines whether the PDF appears to be a scanned document.
-
Parameters:
pageTexts: AMapof page numbers to their correspondingPdfTexts.options(optional): Thresholds for scanned detection. Defaults toCONSTANT.WORDS_PER_PAGE_THRESHOLDandCONSTANT.TEXT_LENGTH_THRESHOLD.
-
Returns:
trueif the PDF is considered scanned; otherwise,false.
Contributions are welcome! If you would like to contribute, please follow these steps:
- Fork the Repository: Create your own fork of the project.
- Create a Feature Branch: Use a descriptive branch name for your changes.
- Implement Changes: Make your modifications, add tests, and ensure everything passes.
- Submit a Pull Request: Open a pull request to discuss your changes and get feedback.
This project uses Bun for testing. To run the tests locally, execute:
bun build:testEnsure that all tests pass before submitting your pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
If you encounter any issues or have suggestions, please open an issue in the repository.
Happy coding!
Recommended development environment is in linux-based environment. Library template: https://github.com/aquapi/lib-template
All script sources and usage.
Emit .js and .d.ts files to lib.
Move package.json, README.md to lib and publish the package.
Run files that ends with .bench.ts extension.
To run a specific file.
bun bench index # Run bench/index.bench.tsTo run the benchmark in node, add a --node parameter
bun bench --node
bun bench --node index # Run bench/index.bench.ts with node