In this section of the ODH Data Processing repository, we provide reference data-processing pipelines for Open Data Hub / Red Hat OpenShift AI, packaged as Kubeflow Pipelines (KFP).
Two KFP pipelines are included:
- docling-standard: Standard Docling pipeline for standard conversions, OCR, table structure, enrichments.
- docling-vlm: Vision-Language-Model (VLM) pipeline supporting local or remote models.
- Docling-based PDF conversion at scale using KFP
ParallelForbatch splits and parallel executions - Two customizable pipelines to suit different needs:
- Standard PDF pipeline (backends, OCR engines, table structure, image export)
- VLM pipeline (Docling VLM or Granite-Vision pipeline options; remote VLM service supported)
- Multiple input sources: HTTP/S URLs or S3/S3-compatible APIs like MinIO
- Secret-based configuration:
- Remote VLM API configuration via a single mounted Kubernetes Secret
- S3 endpoint and credentials via a single mounted Kubernetes Secret
- Tunable performance and quality: threads, timeouts, OCR forcing, table mode, PDF backends, enrichments
- Works on OpenShift AI/Kubeflow Pipelines
kubeflow-pipelines
|
|- docling-standard
| |- docling_convert_components.py
| |- docling_convert_pipeline.py
| |- docling_convert_pipeline_compiled.yaml (generated)
| |- requirements.txt
|
|- docling-vlm
|- docling_convert_components.py
|- docling_convert_pipeline.py
|- docling_convert_pipeline_compiled.yaml (generated)
|- requirements.txt- Access to a KFP 2.x instance (e.g., Red Hat OpenShift AI)
- Optionally, Kubernetes access to create Secrets in your project/namespace
- Python 3.11+ if you'd like to compile the pipeline for your own needs
Start by importing the compiled YAML file for the desired Docling pipeline (standard or VLM) in the KFP UI.
For the Standard Pipeline:
Download the compiled YAML file and upload it on the Import pipeline screen, or import it by URL by pointing it to https://github.com/opendatahub-io/odh-data-processing/raw/refs/heads/main/kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml.
For the VLM Pipeline:
Download the compiled YAML file and upload it on the Import pipeline screen, or import it by URL by pointing it to https://github.com/opendatahub-io/odh-data-processing/raw/refs/heads/main/kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml.
Optionally, compile from source to generate the pipeline YAML yourself:
# Standard pipeline
cd odh-data-processing/kubeflow-pipelines/docling-standard
python standard_convert_pipeline.py
# VLM pipeline
cd odh-data-processing/kubeflow-pipelines/docling-vlm
python vlm_convert_pipeline.pyWith the imported pipeline, use the Create run option to configure parameters like the source where your documents are stored and start a conversion.
Once conversion finishes, the converted documents are stored in two output formats: a human-readable Markdown representation of the original document, and a JSON representing a lossless serialization of the Docling Document. To find where the converted files were stored, check the Graph of your pipeline Run and click the docling-convert box, which should be green indicating the conversion was successful. In the details of docling-convert, check the Output artifacts section for the link to the S3-compatible storage where the files were stored. If you need the object storage configurations like the access and secret keys, check the Pipeline server actions > View pipeline server configuration option.
Both standard and VLM pipelines provide default conversion options that should be a good starting point for most document conversions. For more advanced conversion options, both pipelines expose a set of runtime parameters that can be changed to tweak the conversion strategy used by Docling:
By default, both pipelines will consume documents stored in an HTTP/S source. To configure the source of the documents you'd like to convert, set the pdf_base_url and the pdf_filenames (comma-separated list of the file names) parameters. The default values of these parameters point to a sample set of PDFs frequently used to test Docling conversions.
- Standard pipeline defaults include
pdf_backend=dlparse_v4,image_export_mode=embedded,table_mode=accurate,num_threads=4,timeout_per_document=300,ocr=True,force_ocr=False,ocr_engine=tesseract. - VLM pipeline defaults include
num_threads=4,timeout_per_document=300,image_export_mode=embedded, andremote_model_enabled=False.
- Standard pipeline parameters:
docling_image_export_mode:embedded(default),placeholder, orreferenced. Inembeddedmode, the image is embedded as base64 encoded string. Withplaceholder, only the position of the image is marked in the output. Inreferencedmode, the image is exported in PNG format and referenced from the main exported document.docling_table_mode: e.g.,accurate(default), orfast. The mode to use in the table structure model.
- Standard pipeline:
docling_ocr=Trueif enabled, the bitmap content will be processed using OCR.docling_force_ocr=Trueforces full-page OCR regardless of input.docling_ocr_engine:tesseract(default),tesserocr, orrapidocr. The OCR engine to use.
- VLM pipeline (
docling-vlm): setdocling_remote_model_enabled=Trueto route processing through a VLM model service. - Configuration for remote VLM models comes from a Kubernetes Secret mounted at
/mnt/secretsinstead of individual KFP parameters.- The secret should be present in the same namespace of the pipeline.
- Secret name:
data-processing-docling-pipeline. - Required keys in the Secret data:
REMOTE_MODEL_ENDPOINT_URLREMOTE_MODEL_API_KEYREMOTE_MODEL_NAME
- The component validates the presence of these files under the mount path when
docling_remote_model_enabled=True. - Example secret creation:
kubectl create secret generic data-processing-docling-pipeline \ --from-literal=REMOTE_MODEL_ENDPOINT_URL="https://example/v1/inference" \ --from-literal=REMOTE_MODEL_API_KEY="REDACTED" \ --from-literal=REMOTE_MODEL_NAME="granite-vision-3-2"
If you'd like to consume documents stored in an S3-compatible object storage rather than in an URL:
- Set
pdf_from_s3=Trueto download and convert documents from S3 or an S3-compatible service. Set the names of the files to convert inpdf_filenames, separated by commas. - Configuration for the S3 connection comes from a Kubernetes Secret mounted at
/mnt/secretsinstead of individual KFP parameters.- The secret should be present in the same namespace of the pipeline.
- Secret name must be
data-processing-docling-pipeline. - Required keys in the Secret data:
S3_ENDPOINT_URLS3_ACCESS_KEYS3_SECRET_KEYS3_BUCKETS3_PREFIX
- The pipeline mounts the secret, and the importer reads and uses those values when
pdf_from_s3=True. - Example secret creation:
kubectl create secret generic data-processing-docling-pipeline \ --from-literal=S3_ENDPOINT_URL="https://s3.us-east-2.amazonaws.com" \ --from-literal=S3_ACCESS_KEY="REDACTED" \ --from-literal=S3_SECRET_KEY="REDACTED" \ --from-literal=S3_BUCKET="my-bucket" \ --from-literal=S3_PREFIX="my-pdfs"
Toggle enrichments via boolean parameters:
docling_enrich_code,docling_enrich_formula,docling_enrich_picture_classes,docling_enrich_picture_description.
- Increase
num_splitsto parallelize across more workers (uses KFPParallelFor). - Tune
num_threadsandtimeout_per_document. - Adjust container resources per component, e.g.
set_memory_limit("6G"),set_cpu_limit("4"), indocling-standard/standard_convert_pipeline.pyordocling-vlm/vlm_convert_pipeline.py. - Change the value of the
base_imagecomponent parameter (example) if you'd like to set a custom container image to be used in the pipeline run. - Recompile the pipeline YAML after code or parameter interface changes to refresh the compiled YAML.