Skip to content

Commit 241c0dc

Browse files
Merge pull request red-hat-data-services#74 from MelissaFlinn/rhai-3596-kfp-kft-flow-edit
Documentation edit for the examples/knowledge-tuning/Kubeflow_Pipelines example
2 parents 6ec9a23 + e17962b commit 241c0dc

10 files changed

Lines changed: 294 additions & 263 deletions

File tree

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
# Knowledge Tuning Pipeline for Kubeflow Pipelines
2+
3+
## Overview
4+
5+
This Kubeflow Pipelines (KFP) example shows how to use Kubeflow pipelines to automate the steps in the [Knowledge Tuning example](../README.md). By using pipelines, you can run long training jobs or retrain your models on a schedule without having to manually run them in a notebook.
6+
7+
The Knowledge Tuning example uses notebooks to implement a workflow that processes documents, generates synthetic training data using multiple knowledge generation strategies, mixes the generated datasets, and fine-tunes a student model.
8+
9+
This Kubeflow Pipelines example converts the steps into KFP (Kubeflow Pipelines) components for production use. The example is structured with independent components at the step level for easier debugging. The example provides defaults for the flows and parameters. Optionally, you can customize the parameter values and the model.
10+
11+
### About the example pipeline workflow
12+
13+
The pipeline follows the Knowledge Tuning example workflow as shown in the following figure:
14+
15+
Figure 1. End-to-end workflow overview
16+
17+
![End-to-end workflow overview diagram](../../../assets/usecase/knowledge-tuning/Overall%20Flow.png)
18+
19+
### Pipeline components
20+
21+
The `Kubeflow_Pipline/components` subfolder has python files for each component in the Knowledge Tuning workflow:
22+
23+
**Data Processing**
24+
25+
Downloads Docling models (cached) and processes documents from web URLs or local files:
26+
27+
- Converts PDF/HTML to Markdown
28+
- Chunks documents with configurable token limits
29+
- Adds domain-specific context and ICL (In-Context Learning) examples
30+
31+
Example python files: `document_processing.py`, `download_docling_models.py`
32+
33+
Source repository: `opendatahub-io/data-processing`
34+
35+
Base image: `quay.io/fabianofranz/docling-ubi9:2.54.0`
36+
37+
Packages: `torch`, `datasets`, `docling`, `tiktoken`
38+
39+
**Knowledge Generation**
40+
41+
Generates four types of synthetic training data in parallel:
42+
43+
- Detailed summaries: Comprehensive summaries with Q&A pairs
44+
- Extractive summaries: Direct extracts from documents with Q&A (runs sequentially)
45+
- Key facts summary: Focuses on key facts and concepts
46+
- Document-based Q&A: Question-answer pairs based on document content
47+
48+
Merges all datasets after generation.
49+
50+
Example python file: `knowledge_generation.py`
51+
52+
Source repository: `red-hat-data-services/red-hat-ai-examples`
53+
54+
Base image: `quay.io/fabianofranz/docling-ubi9:2.54.0`
55+
56+
Packages: `nest-asyncio`, `sdg-hub`, `datasets`
57+
58+
**Knowledge Mixing**
59+
60+
Processes and combines the generated datasets:
61+
62+
- Samples Q&A pairs based on configurable cut sizes
63+
- Tokenizes content using the student model tokenizer
64+
- Validates and filters data
65+
- Creates training-ready JSONL files in chat format
66+
- Selects the optimal dataset (largest feasible cut size)
67+
68+
Example python file: `knowledge_mixing.py`
69+
70+
Source repository: Not applicable. This is a custom component.
71+
72+
Base image: `quay.io/opendatahub/odh-training-th04-cpu-torch29-py312-rhel9:cpu-3.3`
73+
74+
Packages: `polars`, `transformers`, `torch`
75+
76+
**Model Fine-tuning**
77+
78+
Fine-tunes a student model by using the mixed knowledge dataset:
79+
80+
- Supervised Fine-Tuning (SFT)
81+
- Configurable GPU/memory resources
82+
- Multi-epoch training with batch size control
83+
84+
This is a reusable prebuilt component that is managed by the Kubeflow pipelines team.
85+
86+
Source repository: `red-hat-data-services/pipelines-components`
87+
88+
## Prepare the example pipeline
89+
90+
### Procedure
91+
92+
1. Clone the example repository.
93+
94+
a. To clone the example repository to your local environment, run the following command in a terminal window:
95+
96+
```bash
97+
git clone https://github.com/red-hat-data-services/red-hat-ai-examples.git
98+
```
99+
100+
b. Set your working directory to the `Kubeflow_Pipeline` folder:
101+
102+
```bash
103+
cd red-hat-ai-examples/examples/knowledge-tuning/Kubeflow_Pipeline
104+
```
105+
106+
2. Set up the Python environment.
107+
108+
Run the following commands to install these required packages:
109+
110+
- `kfp==2.15.2`
111+
- `kfp-kubernetes>=2.15.2`
112+
- `kfp-components @ git+https://github.com/red-hat-data-services/pipelines-components@main`
113+
114+
```bash
115+
python -m venv .venv
116+
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
117+
pip install -e .
118+
```
119+
120+
3. To use the example pipeline with its default configuration, skip to Step 4.
121+
122+
If you want to customize the pipeline configuration, edit the `pipeline.py` file and change the following values:
123+
124+
- Document URLs
125+
- Model endpoints and credentials
126+
- Default parameters
127+
128+
See _Customize the pipeline configuration_ for details on the parameter values that you can change.
129+
130+
4. Compile the pipeline:
131+
132+
Before you can define your pipeline in the cluster, you must convert your Python-defined pipeline into YAML format. You can use the Kubeflow Pipelines Software Development Kit to compile your pipeline code into a deployable YAML file for declarative GitOps deployment:
133+
134+
```bash
135+
python pipeline.py
136+
```
137+
138+
### Verification
139+
140+
The result of the python `pipeline.py` command is the `knowledge_tuning_pipeline.yaml` file.
141+
142+
## Import and run the example pipeline
143+
144+
After you compile the pipeline, you can import the YAML file and deploy it in OpenShift AI. The imported YAML file visualizes the pipeline flow.
145+
146+
### Prerequisites
147+
148+
- Make sure that the `nfs-csi` storage class is available on your cluster.
149+
150+
- For workspace storage, KFP uses the pipeline configuration to automatically create a PVC with the following configuration:
151+
152+
| Configuration | Value |
153+
|--------------|-------|
154+
| **Size** | 80Gi |
155+
| **Storage Class** | nfs-csi |
156+
| **Access Modes** | ReadWriteMany |
157+
158+
In the OpenShift console, select **Storage** > **StorageClasses**. Verify that `nfs-csi` is listed.
159+
160+
- Confirm that your cluster has the following resources required by the kubeflow pipeline components:
161+
162+
| Stage | CPU | Memory | GPU | Storage |
163+
|-------|-----|--------|-----|---------|
164+
| Document Processing | 2-4 cores | 8-16 GB | 0 | ~5 GB |
165+
| Knowledge Generation | 2-4 cores | 8-16 GB | 0 (uses API) | ~10 GB |
166+
| Knowledge Mixing | 4-8 cores | 16-32 GB | 0 | ~20 GB |
167+
| Model Training | 8-16 cores | 40+ GB | 8 | ~30 GB |
168+
169+
- Create a Kubernetes secret named `kubernetes-credentials` with the following keys:
170+
171+
| Secret Key | Description | Required |
172+
|------------|-------------|----------|
173+
| `KUBERNETES_SERVER_URL` | Kubernetes API server URL | Yes |
174+
| `KUBERNETES_AUTH_TOKEN` | Authentication token for Kubernetes API | Yes |
175+
| `HF_TOKEN` | HuggingFace token for model downloads | Yes |
176+
177+
```bash
178+
kubectl create secret generic kubernetes-credentials \
179+
--from-literal=KUBERNETES_SERVER_URL="https://api.your-cluster.com:6443" \
180+
--from-literal=KUBERNETES_AUTH_TOKEN="your-k8s-token" \
181+
--from-literal=HF_TOKEN="your-huggingface-token" \
182+
-n <your-namespace>
183+
```
184+
185+
- You have configured a pipeline server in OpenShift AI, as described in [Configuring a pipeline server in the Red Hat OpenShift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#configuring-a-pipeline-server_ai-pipelines).
186+
187+
### Procedure
188+
189+
1. Import the pipeline, as described in [Importing a pipeline](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#importing-a-pipeline_ai-pipelines).
190+
191+
2. Run the pipeline in OpenShift AI, as described in [Executing a pipeline run](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#executing-a-pipeline-run_ai-pipelines).
192+
193+
## Troubleshoot
194+
195+
Use the following information to help troubleshoot problems that you might encounter when you run the example pipeline:
196+
197+
**PVC not created or insufficient storage**
198+
199+
- **Solution:** Verify that the storage class `nfs-csi` exists and that it supports the `ReadWriteMany` access mode.
200+
201+
- **Alternative:** Modify `PVC_STORAGE_CLASS` in the `pipeline.py` file.
202+
203+
**Inference timeouts during knowledge generation**
204+
205+
- **Solution:** Increase the `inference_timeout` parameter (default: `2500s`) for the Knowledge Generation component.
206+
207+
- **Alternative:** Reduce the value of the `max_concurrency` parameter to lower the API load.
208+
209+
**Out of memory during training**
210+
211+
- **Solution:** Increase the value of the `training_resource_memory_per_worker` parameter for the Model Training component.
212+
213+
- **Alternative:** Reduce the value of the `training_effective_batch_size` parameter for the Model Training component.
214+
215+
**Cut size validation warnings**
216+
217+
- **Solution:** Reduce the value of the `cut_size` parameter for the Knowledge Mixing component.
218+
219+
- **Details:** Pipeline validates that sufficient summaries exist per raw document
220+
221+
**Missing HuggingFace models**
222+
223+
- **Solution:** Verify that the `HF_TOKEN` is correct in the `kubernetes-credentials` secret
224+
225+
- **Alternative:** Use a publicly-accessible model.
226+
227+
## Customize the pipeline
228+
229+
You can customize the pipeline by changing the values of the parameters and environment variables listed in the following tables.
230+
231+
Here are some optimization tips:
232+
233+
- **Caching:** The Docling model download is cached. You can reuse artifacts across runs.
234+
- **Concurrency:** Adjust `max_concurrency` based on inference server capacity.
235+
- **Subsample:** Use `seed_data_subsample` for testing with smaller datasets.
236+
- **Cut Sizes:** Start with smaller cut sizes (1,5) before using larger values (10+).
237+
- **Reasoning:** Disable `enable_reasoning` for faster generation with simpler outputs.
238+
239+
### Document Processing Parameters
240+
241+
| Parameter | Type | Default | Description |
242+
|-----------|------|---------|-------------|
243+
| `chunk_max_tokens` | int | 512 | Maximum tokens per document chunk |
244+
| `chunk_overlap_tokens` | int | 50 | Overlapping tokens between consecutive chunks |
245+
| `web_urls` | str | "None" | List of web urls separated by , |
246+
| `domain` | str | "None" | Domain context for the documents |
247+
| `domain_outline` | str | "None" | Outline or structure of the domain |
248+
| `icl_document` | str | "None" | In-context learning example document |
249+
| `icl_query1` | str | "None" | In-context learning example query 1 |
250+
| `icl_query2` | str | "None" | In-context learning example query 2 |
251+
| `icl_query3` | str | "None" | In-context learning example query 3 |
252+
253+
### Knowledge Generation Parameters
254+
255+
| Parameter | Type | Default | Description |
256+
|-----------|------|---------|-------------|
257+
| `model_name` | str | "openai/gpt-oss-20b" | Teacher model for synthetic data generation |
258+
| `api_key` | str | (JWT token) | API key/token for model inference |
259+
| `api_base` | str | (OpenShift URL) | Base URL for the inference API endpoint |
260+
| `seed_data_subsample` | int | 0 | Number of documents to subsample (0 = all) |
261+
| `enable_reasoning` | bool | True | Enable reasoning/thinking in generated responses |
262+
| `number_of_summaries` | int | 1 | Number of summary variations per document |
263+
| `max_concurrency` | int | 5 | Maximum concurrent API requests |
264+
| `inference_timeout` | int | 2500 | API request timeout in seconds |
265+
266+
### Knowledge Mixing Parameters
267+
268+
| Parameter | Type | Default | Description |
269+
|-----------|------|---------|-------------|
270+
| `tokenizer_model_name` | str | "Qwen/Qwen2.5-1.5B-Instruct" | Tokenizer model for token counting |
271+
| `cut_size` | str | "1,5,10" | Comma-separated cut sizes (summaries per raw doc) |
272+
| `qa_per_doc` | int | 3 | Maximum Q&A pairs per document/summary |
273+
| `save_gpt_oss_format` | bool | False | Apply GPT-OSS specific filtering |
274+
275+
### Model Training Parameters
276+
277+
| Parameter | Type | Default | Description |
278+
|-----------|------|---------|-------------|
279+
| `student_model_name` | str | "Qwen/Qwen2.5-1.5B-Instruct" | Base model to fine-tune |
280+
| `training_resource_gpu_per_worker` | int | 8 | Number of GPUs per training worker |
281+
| `training_num_epochs` | int | 1 | Number of training epochs |
282+
| `training_effective_batch_size` | int | 32 | Effective batch size for training |
283+
| `training_resource_memory_per_worker` | str | "40Gi" | Memory allocation per worker |
284+
285+
### Environment Variables
286+
287+
| Variable | Set Automatically by this component| Purpose |
288+
|----------|--------|---------|
289+
| `LITELLM_REQUEST_TIMEOUT` | Knowledge Generation | API request timeout configuration |
290+
| `HF_HOME` | Knowledge Mixing | HuggingFace cache directory |
291+
| `DOCLING_CACHE_DIR` | Document Processing | Docling model cache location |

examples/knowledge-tuning/kfp/components/__init__.py renamed to examples/knowledge-tuning/Kubeflow_Pipeline/components/__init__.py

File renamed without changes.

examples/knowledge-tuning/kfp/components/document_processing.py renamed to examples/knowledge-tuning/Kubeflow_Pipeline/components/document_processing.py

File renamed without changes.

examples/knowledge-tuning/kfp/components/download_docling_models.py renamed to examples/knowledge-tuning/Kubeflow_Pipeline/components/download_docling_models.py

File renamed without changes.

examples/knowledge-tuning/kfp/components/knowledge_generation.py renamed to examples/knowledge-tuning/Kubeflow_Pipeline/components/knowledge_generation.py

File renamed without changes.

examples/knowledge-tuning/kfp/components/knowledge_mixing.py renamed to examples/knowledge-tuning/Kubeflow_Pipeline/components/knowledge_mixing.py

File renamed without changes.
File renamed without changes.

examples/knowledge-tuning/kfp/pyproject.toml renamed to examples/knowledge-tuning/Kubeflow_Pipeline/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[project]
22
name = "kfp-knowledge-tuning"
33
version = "0.1.0"
4-
description = "Kubeflow pipeline for the instruct lab knowledge tuning workflow"
4+
description = "Kubeflow pipeline for the Knowledge Tuning workflow"
55
readme = "README.md"
66
requires-python = ">=3.11"
77
dependencies = [

examples/knowledge-tuning/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ In this example workflow, you complete the following modules sequentially in you
1616
5. Model Training — Fine-tune a model by using the training mixes.
1717
6. Evaluation — Run the trained model and generated datasets against held-out test data.
1818

19+
Optionally, the Kubeflow Pipeline module automates the Data Processing, Knowledge Generation, Knowledge Mixing, and Model Training stages.
20+
1921
*Figure 1. End-to-end workflow overview*
2022

2123
![End-to-end workflow overview diagram](../../assets/usecase/knowledge-tuning/Overall%20Flow.png)

0 commit comments

Comments
 (0)