Skip to content

Commit c3de9b8

Browse files
authored
Merge pull request #33 from alimaredia/dp-wb-init
[RHAIENG-1043] Add instructions on creating a custom workbench image
2 parents 0be6daf + 2c0abc2 commit c3de9b8

4 files changed

Lines changed: 254 additions & 2 deletions

File tree

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
This repository provides reference **data-processing pipelines and examples** for [Open Data Hub](https://github.com/opendatahub-io) / [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai). It focuses on **document conversion** and **chunking** using the [Docling](https://docling-project.github.io/docling/) toolkit, packaged as [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/) (KFP), example [Jupyter Notebooks](https://jupyter.org/), and helper scripts.
88

9+
The workbenches directory also provides a guide on how to create a custom [workbench image](https://github.com/opendatahub-io-contrib/workbench-images) to run Docling and the example notebooks in this repository.
10+
911
## 📦 Repository Structure
1012

1113
```bash
@@ -16,8 +18,10 @@ odh-data-processing
1618
| |- docling-vlm
1719
|
1820
|- notebooks
19-
|- model-customization-data-preprocessing
20-
|- model-customization-data-postprocessing
21+
|- tutorials
22+
|- use-cases
23+
|
24+
|- custom-workbench-image
2125
```
2226

2327
## ✨ Getting Started
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
FROM quay.io/modh/odh-workbench-jupyter-minimal-cuda-py312-ubi9@sha256:f7126e237f1dfe3a4cda89b60c4e8d9e45afabb247030765edb1c5532a7010fc
2+
3+
USER root
4+
5+
# Enable EPEL and install build dependencies
6+
RUN rpm -Uvh --nosignature https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm || true && \
7+
dnf -y install \
8+
git gcc gcc-c++ make autoconf automake libtool pkgconfig \
9+
zlib-devel libjpeg-turbo-devel libpng-devel libtiff-devel libwebp-devel openjpeg2-devel
10+
11+
# Build and install Leptonica from source
12+
ENV LEPTONICA_VERSION=1.83.1
13+
RUN curl -L -o /tmp/leptonica.tar.gz https://github.com/DanBloomberg/leptonica/archive/refs/tags/${LEPTONICA_VERSION}.tar.gz && \
14+
tar -xzf /tmp/leptonica.tar.gz -C /tmp && \
15+
cd /tmp/leptonica-${LEPTONICA_VERSION} && \
16+
./autogen.sh && \
17+
./configure && \
18+
make -j"$(nproc)" && \
19+
make install && \
20+
ldconfig && \
21+
rm -rf /tmp/leptonica*
22+
23+
# Ensure build and runtime can find Leptonica from /usr/local
24+
ENV PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig:/usr/local/lib/pkgconfig:${PKG_CONFIG_PATH}"
25+
ENV LD_LIBRARY_PATH="/usr/local/lib64:/usr/local/lib:${LD_LIBRARY_PATH}"
26+
ENV LIBLEPT_HEADERSDIR="/usr/local/include"
27+
28+
# Build and install Tesseract from source
29+
ENV TESSERACT_VERSION=5.4.1
30+
RUN git clone --depth=1 --branch ${TESSERACT_VERSION} https://github.com/tesseract-ocr/tesseract /tmp/tesseract && \
31+
cd /tmp/tesseract && \
32+
./autogen.sh && \
33+
./configure --disable-static && \
34+
make -j"$(nproc)" && \
35+
make install && \
36+
ldconfig && \
37+
rm -rf /tmp/tesseract
38+
39+
# Install language data
40+
ENV TESSDATA_VERSION=4.1.0
41+
ENV TESSDATA_PREFIX="/usr/share/tesseract/tessdata"
42+
RUN mkdir -p ${TESSDATA_PREFIX} && \
43+
git clone --depth=1 --branch ${TESSDATA_VERSION} https://github.com/tesseract-ocr/tessdata ${TESSDATA_PREFIX}
44+
45+
# Build and install tesserocr from source (uses pkg-config to find tesseract/leptonica)
46+
RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools wheel && \
47+
python3 -m pip install --no-binary=:all: tesserocr
48+
49+
USER 1001
50+
51+
# Clone the repository to a temporary, non-mounted directory
52+
RUN git clone https://github.com/opendatahub-io/odh-data-processing.git /opt/app-root/tmp/odh-data-processing
53+
54+
# Copy a custom entrypoint script into the container
55+
COPY --chown=1001:1 odh-dp-entrypoint.sh /opt/app-root/bin/odh-dp-entrypoint.sh
56+
57+
# Ensure script is executable at the user and group level
58+
RUN chmod ug+x /opt/app-root/bin/odh-dp-entrypoint.sh
59+
60+
# Check to make sure entry point script can be executed by random high number user
61+
RUN ls -l /opt/app-root/bin/odh-dp-entrypoint.sh
62+
63+
# Set the NOTEBOOK_ROOT_DIR to the final destination in the user's home directory
64+
ENV NOTEBOOK_ROOT_DIR="/opt/app-root/src/odh-data-processing/notebooks/use-cases"
65+
66+
# Set the custom script as the new entrypoint
67+
ENTRYPOINT ["/opt/app-root/bin/odh-dp-entrypoint.sh"]

custom-workbench-image/README.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Custom Workbench for Data Processing
2+
3+
Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images).
4+
5+
Below are guidelines on how to create a custom workbench image that is started up with the jupyter notebooks in this repository.
6+
7+
## Base Images
8+
9+
Depending on hardware resources (ex: GPU access) it is recommended to start with a jupyter-minimal image on python version 3.12.
10+
11+
Examples include:
12+
```Containerfile
13+
quay.io/modh/odh-workbench-jupyter-minimal-cuda-py312-ubi9
14+
quay.io/modh/odh-workbench-jupyter-minimal-cpu-py312-ubi9
15+
```
16+
17+
The following can be added in a `Containerfile`:
18+
19+
```Containerfile
20+
FROM quay.io/modh/odh-workbench-jupyter-minimal-cuda-py312-ubi9
21+
```
22+
23+
## Starting up Workbench with Example Notebooks at Runtime
24+
25+
Specific Jupyter notebooks can be the starting point of users during the start up of a custom workbench.
26+
27+
To configure this in a custom workbench users must download the notebooks when they are building the workbench image and set `NOTEBOOK_ROOT_DIR` to the path of the notebooks under `/opt/app-root/src`.
28+
29+
The variable `NOTEBOOK_ROOT_DIR` is used by [start-notebook.sh](https://github.com/opendatahub-io/notebooks/blob/main/jupyter/minimal/ubi9-python-3.12/start-notebook.sh) as the value of `--ServerApp.root_dir` as an argument to `jupyter lab`.
30+
31+
Since `/opt/app-root/src` is mounted by the notebook controller upon the start of a workbench image and any content in that directory [will be cleared](https://github.com/opendatahub-io/notebooks/tree/main/examples#opendatahub-dashboard) a script is required to move the notebooks.
32+
33+
An example script that accomplishes this is in `odh-dp-entrypoint.sh` and below:
34+
35+
```bash
36+
#!/bin/bash
37+
set -e
38+
39+
# Define source and destination paths
40+
SRC_DIR="/opt/app-root/tmp/odh-data-processing"
41+
DEST_DIR="/opt/app-root/src/odh-data-processing"
42+
43+
# Copy the notebooks to the user's persistent home directory if they don't exist
44+
# This ensures the notebooks are present on the first and subsequent launch
45+
if [ ! -d "$DEST_DIR" ]; then
46+
echo "Notebooks not found in home directory. Copying from $SRC_DIR"
47+
cp -r $SRC_DIR /opt/app-root/src/
48+
else
49+
echo "$DEST_DIR already exists. $SRC_DIR will not be copied"
50+
fi
51+
52+
# Execute the original command to start the Jupyter server
53+
# This will use the NOTEBOOK_ROOT_DIR env var you set in the Containerfile
54+
# Source code for start-notebook.sh can be seen at:
55+
# https://github.com/opendatahub-io/notebooks/blob/main/jupyter/minimal/ubi9-python-3.12/start-notebook.sh
56+
exec /opt/app-root/bin/start-notebook.sh
57+
```
58+
59+
This script is used within this snippet of a `Containerfile` to set a repository of notebooks at startup.
60+
61+
```Containerfile
62+
USER 1001
63+
64+
# Clone the repository to a temporary, non-mounted directory
65+
RUN git clone https://github.com/opendatahub-io/odh-data-processing.git /opt/app-root/tmp/odh-data-processing
66+
67+
# Copy a custom entrypoint script into the container
68+
COPY --chown=1001:1 odh-dp-entrypoint.sh /opt/app-root/bin/odh-dp-entrypoint.sh
69+
70+
# Check to make sure entry point script can be executed by random high number user
71+
RUN ls -l /opt/app-root/bin/odh-dp-entrypoint.sh
72+
73+
# Set the NOTEBOOK_ROOT_DIR to the final destination in the user's home directory
74+
ENV NOTEBOOK_ROOT_DIR="/opt/app-root/src/odh-data-processing/notebooks/use-cases"
75+
76+
# Set the custom script as the new entrypoint
77+
ENTRYPOINT ["/opt/app-root/bin/odh-dp-entrypoint.sh"]
78+
```
79+
80+
## Building Tesseract OCR Dependencies
81+
82+
One of the OCR engines available to use in Docling is [Tesseract OCR](https://docling-project.github.io/docling/examples/tesseract_lang_detection/).
83+
84+
To use tesseract OCR within Docling, tesseract OC and some of it's dependencies need to be downloaded, built from source, and linked in the custom workbench container before Docling is installed. Below is part of a `Containerfile` that does this.
85+
86+
```Containerfile
87+
USER root
88+
89+
# Enable EPEL and install build dependencies
90+
RUN rpm -Uvh --nosignature https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm || true && \
91+
dnf -y install \
92+
git gcc gcc-c++ make autoconf automake libtool pkgconfig \
93+
zlib-devel libjpeg-turbo-devel libpng-devel libtiff-devel libwebp-devel openjpeg2-devel
94+
95+
# Build and install Leptonica from source
96+
ENV LEPTONICA_VERSION=1.83.1
97+
RUN curl -L -o /tmp/leptonica.tar.gz https://github.com/DanBloomberg/leptonica/archive/refs/tags/${LEPTONICA_VERSION}.tar.gz && \
98+
tar -xzf /tmp/leptonica.tar.gz -C /tmp && \
99+
cd /tmp/leptonica-${LEPTONICA_VERSION} && \
100+
./autogen.sh && \
101+
./configure && \
102+
make -j"$(nproc)" && \
103+
make install && \
104+
ldconfig && \
105+
rm -rf /tmp/leptonica*
106+
107+
# Ensure build and runtime can find Leptonica from /usr/local
108+
ENV PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig:/usr/local/lib/pkgconfig:${PKG_CONFIG_PATH}"
109+
ENV LD_LIBRARY_PATH="/usr/local/lib64:/usr/local/lib:${LD_LIBRARY_PATH}"
110+
ENV LIBLEPT_HEADERSDIR="/usr/local/include"
111+
112+
# Build and install Tesseract from source
113+
ENV TESSERACT_VERSION=5.4.1
114+
RUN git clone --depth=1 --branch ${TESSERACT_VERSION} https://github.com/tesseract-ocr/tesseract /tmp/tesseract && \
115+
cd /tmp/tesseract && \
116+
./autogen.sh && \
117+
./configure --disable-static && \
118+
make -j"$(nproc)" && \
119+
make install && \
120+
ldconfig && \
121+
rm -rf /tmp/tesseract
122+
123+
# Install language data
124+
ENV TESSDATA_VERSION=4.1.0
125+
ENV TESSDATA_PREFIX="/usr/share/tesseract/tessdata"
126+
RUN mkdir -p ${TESSDATA_PREFIX} && \
127+
git clone --depth=1 --branch ${TESSDATA_VERSION} https://github.com/tesseract-ocr/tessdata ${TESSDATA_PREFIX}
128+
129+
# Build and install tesserocr from source (uses pkg-config to find tesseract/leptonica)
130+
RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools wheel && \
131+
python3 -m pip install --no-binary=:all: tesserocr
132+
```
133+
134+
## Building and Pushing the Image
135+
136+
An example `Containerfile`, and shell script (`odh-dp-entrypoint.sh`) that copies notebooks at run time are included in this directory.
137+
138+
The shell script must have execute permission set at the user and group level before it is copied into the container at build time so that it can be executed by the container runtime.
139+
140+
If tesseract OCR and it's dependencies are being built using `podman`, the podman virtual machine should have 4 CPUs and 4096 MB of RAM.
141+
142+
Below are commands that can be run from this directory to build a custom workbench container image.
143+
144+
```
145+
chmod ug+x odh-dp-entrypoint.sh
146+
podman machine stop
147+
podman machine set --cpus 4 --memory 4096
148+
podman machine start
149+
podman build -t custom-dp-wb-image:latest .
150+
podman push localhost/custom-dp-wb-image:latest <CONTAINER REGISTRY>
151+
```
152+
153+
## Workbench Size Recommendations
154+
155+
When users select the container size for the Workbench in the Workbench GUI, they are advised to at least use the `Medium` size container.
156+
This will have 3-6 CPUs and at least 24 GB of memory for the underlying container.
157+
158+
## Consuming Data from S3
159+
160+
To consume data from S3 in your workbench please refer to this [tutorial](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/pdf/working_with_data_in_an_s3-compatible_object_store/Red_Hat_OpenShift_AI_Cloud_Service-1-Working_with_data_in_an_S3-compatible_object_store-en-US.pdf) on how to include files from S3 in a Jupyter notebook.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Define source and destination paths
5+
SRC_DIR="/opt/app-root/tmp/odh-data-processing"
6+
DEST_DIR="/opt/app-root/src/odh-data-processing"
7+
8+
# Copy the notebooks to the user's persistent home directory if they don't exist
9+
# This ensures the notebooks are present on the first and subsequent launches
10+
if [ ! -d "$DEST_DIR" ]; then
11+
echo "Notebooks not found in home directory. Copying from $SRC_DIR"
12+
cp -r $SRC_DIR /opt/app-root/src/
13+
else
14+
echo "$DEST_DIR already exists. $SRC_DIR will not be copied"
15+
fi
16+
17+
# Execute the original command to start the Jupyter server
18+
# This will use the NOTEBOOK_ROOT_DIR env var you set in the Containerfile
19+
# Source code for start-notebook.sh can be seen at:
20+
# https://github.com/opendatahub-io/notebooks/blob/main/jupyter/minimal/ubi9-python-3.12/start-notebook.sh
21+
exec /opt/app-root/bin/start-notebook.sh

0 commit comments

Comments
 (0)