A Ballerina library that makes it easy to parse and extract content from documents of many formats — similar to Apache Tika, but designed for Ballerina.
import xlibb/docreader;
import ballerina/io;
public function main() returns error? {
// Read a document file
docreader:DocumentInfo|docreader:Error result = docreader:readDocument("./sample.pdf");
if result is docreader:DocumentInfo {
io:println("MIME Type: ", result.mimeType);
io:println("Extension: ", result.extension);
// Access document metadata
io:println("Metadata:");
foreach string key in result.metadata.keys() {
io:println(string ` ${key}: ${result.metadata[key] ?: ""}`);
}
io:println("Content: ", result.content);
} else {
io:println("Error: ", result.message());
}
}The library supports various file types including TXT, DOCX, PDF, XLS, PPT, HTML, CSV, XML, JSON, and RTF.
Reads a document file and extracts its metadata and content.
Parameters:
filePath- The absolute or relative path to the document file
Returns:
DocumentInfo- A record containing:mimeType- The MIME type of the documentextension- The file extension without the dotmetadata- A map containing document metadata (author, title, creation date, etc.)content- The extracted text content
Error- If the file cannot be read or parsed
-
Download and install Java SE Development Kit (JDK) version 21 (from one of the following locations).
-
Export your Github Personal access token with the read package permissions as follows.
export packageUser=<Username> export packagePAT=<Personal access token>
Execute the commands below to build from the source.
-
To build the library:
./gradlew clean build -
To run the integration tests:
./gradlew clean test -
To build the module without the tests:
./gradlew clean build -x test -
To debug module implementation:
./gradlew clean build -Pdebug=<port> ./gradlew clean test -Pdebug=<port> -
To debug the module with Ballerina language:
./gradlew clean build -PbalJavaDebug=<port> ./gradlew clean test -PbalJavaDebug=<port> -
Publish ZIP artifact to the local
.m2repository:./gradlew clean build publishToMavenLocal -
Publish the generated artifacts to the local Ballerina central repository:
./gradlew clean build -PpublishToLocalCentral=true -
Publish the generated artifacts to the Ballerina central repository:
./gradlew clean build -PpublishToCentral=true
We welcome contributions! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.