|
| 1 | +# Knowledge Tuning Pipeline for Kubeflow Pipelines |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This Kubeflow Pipelines (KFP) example shows how to use Kubeflow pipelines to automate the steps in the [Knowledge Tuning example](../README.md). By using pipelines, you can run long training jobs or retrain your models on a schedule without having to manually run them in a notebook. |
| 6 | + |
| 7 | +The Knowledge Tuning example uses notebooks to implement a workflow that processes documents, generates synthetic training data using multiple knowledge generation strategies, mixes the generated datasets, and fine-tunes a student model. |
| 8 | + |
| 9 | +This Kubeflow Pipelines example converts the steps into KFP (Kubeflow Pipelines) components for production use. The example is structured with independent components at the step level for easier debugging. The example provides defaults for the flows and parameters. Optionally, you can customize the parameter values and the model. |
| 10 | + |
| 11 | +### About the example pipeline workflow |
| 12 | + |
| 13 | +The pipeline follows the Knowledge Tuning example workflow as shown in the following figure: |
| 14 | + |
| 15 | +Figure 1. End-to-end workflow overview |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +### Pipeline components |
| 20 | + |
| 21 | +The `Kubeflow_Pipline/components` subfolder has python files for each component in the Knowledge Tuning workflow: |
| 22 | + |
| 23 | +**Data Processing** |
| 24 | + |
| 25 | +Downloads Docling models (cached) and processes documents from web URLs or local files: |
| 26 | + |
| 27 | +- Converts PDF/HTML to Markdown |
| 28 | +- Chunks documents with configurable token limits |
| 29 | +- Adds domain-specific context and ICL (In-Context Learning) examples |
| 30 | + |
| 31 | +Example python files: `document_processing.py`, `download_docling_models.py` |
| 32 | + |
| 33 | +Source repository: `opendatahub-io/data-processing` |
| 34 | + |
| 35 | +Base image: `quay.io/fabianofranz/docling-ubi9:2.54.0` |
| 36 | + |
| 37 | +Packages: `torch`, `datasets`, `docling`, `tiktoken` |
| 38 | + |
| 39 | +**Knowledge Generation** |
| 40 | + |
| 41 | +Generates four types of synthetic training data in parallel: |
| 42 | + |
| 43 | +- Detailed summaries: Comprehensive summaries with Q&A pairs |
| 44 | +- Extractive summaries: Direct extracts from documents with Q&A (runs sequentially) |
| 45 | +- Key facts summary: Focuses on key facts and concepts |
| 46 | +- Document-based Q&A: Question-answer pairs based on document content |
| 47 | + |
| 48 | +Merges all datasets after generation. |
| 49 | + |
| 50 | +Example python file: `knowledge_generation.py` |
| 51 | + |
| 52 | +Source repository: `red-hat-data-services/red-hat-ai-examples` |
| 53 | + |
| 54 | +Base image: `quay.io/fabianofranz/docling-ubi9:2.54.0` |
| 55 | + |
| 56 | +Packages: `nest-asyncio`, `sdg-hub`, `datasets` |
| 57 | + |
| 58 | +**Knowledge Mixing** |
| 59 | + |
| 60 | +Processes and combines the generated datasets: |
| 61 | + |
| 62 | +- Samples Q&A pairs based on configurable cut sizes |
| 63 | +- Tokenizes content using the student model tokenizer |
| 64 | +- Validates and filters data |
| 65 | +- Creates training-ready JSONL files in chat format |
| 66 | +- Selects the optimal dataset (largest feasible cut size) |
| 67 | + |
| 68 | +Example python file: `knowledge_mixing.py` |
| 69 | + |
| 70 | +Source repository: Not applicable. This is a custom component. |
| 71 | + |
| 72 | +Base image: `quay.io/opendatahub/odh-training-th04-cpu-torch29-py312-rhel9:cpu-3.3` |
| 73 | + |
| 74 | +Packages: `polars`, `transformers`, `torch` |
| 75 | + |
| 76 | +**Model Fine-tuning** |
| 77 | + |
| 78 | +Fine-tunes a student model by using the mixed knowledge dataset: |
| 79 | + |
| 80 | +- Supervised Fine-Tuning (SFT) |
| 81 | +- Configurable GPU/memory resources |
| 82 | +- Multi-epoch training with batch size control |
| 83 | + |
| 84 | +This is a reusable prebuilt component that is managed by the Kubeflow pipelines team. |
| 85 | + |
| 86 | +Source repository: `red-hat-data-services/pipelines-components` |
| 87 | + |
| 88 | +## Prepare the example pipeline |
| 89 | + |
| 90 | +### Procedure |
| 91 | + |
| 92 | +1. Clone the example repository. |
| 93 | + |
| 94 | + a. To clone the example repository to your local environment, run the following command in a terminal window: |
| 95 | + |
| 96 | + ```bash |
| 97 | + git clone https://github.com/red-hat-data-services/red-hat-ai-examples.git |
| 98 | + ``` |
| 99 | + |
| 100 | + b. Set your working directory to the `Kubeflow_Pipeline` folder: |
| 101 | + |
| 102 | + ```bash |
| 103 | + cd red-hat-ai-examples/examples/knowledge-tuning/Kubeflow_Pipeline |
| 104 | + ``` |
| 105 | + |
| 106 | +2. Set up the Python environment. |
| 107 | + |
| 108 | + Run the following commands to install these required packages: |
| 109 | + |
| 110 | + - `kfp==2.15.2` |
| 111 | + - `kfp-kubernetes>=2.15.2` |
| 112 | + - `kfp-components @ git+https://github.com/red-hat-data-services/pipelines-components@main` |
| 113 | + |
| 114 | + ```bash |
| 115 | + python -m venv .venv |
| 116 | + source .venv/bin/activate # On Windows use: .venv\Scripts\activate |
| 117 | + pip install -e . |
| 118 | + ``` |
| 119 | + |
| 120 | +3. To use the example pipeline with its default configuration, skip to Step 4. |
| 121 | + |
| 122 | + If you want to customize the pipeline configuration, edit the `pipeline.py` file and change the following values: |
| 123 | + |
| 124 | + - Document URLs |
| 125 | + - Model endpoints and credentials |
| 126 | + - Default parameters |
| 127 | + |
| 128 | + See _Customize the pipeline configuration_ for details on the parameter values that you can change. |
| 129 | + |
| 130 | +4. Compile the pipeline: |
| 131 | + |
| 132 | + Before you can define your pipeline in the cluster, you must convert your Python-defined pipeline into YAML format. You can use the Kubeflow Pipelines Software Development Kit to compile your pipeline code into a deployable YAML file for declarative GitOps deployment: |
| 133 | + |
| 134 | + ```bash |
| 135 | + python pipeline.py |
| 136 | + ``` |
| 137 | + |
| 138 | +### Verification |
| 139 | + |
| 140 | +The result of the python `pipeline.py` command is the `knowledge_tuning_pipeline.yaml` file. |
| 141 | + |
| 142 | +## Import and run the example pipeline |
| 143 | + |
| 144 | +After you compile the pipeline, you can import the YAML file and deploy it in OpenShift AI. The imported YAML file visualizes the pipeline flow. |
| 145 | + |
| 146 | +### Prerequisites |
| 147 | + |
| 148 | +- Make sure that the `nfs-csi` storage class is available on your cluster. |
| 149 | + |
| 150 | +- For workspace storage, KFP uses the pipeline configuration to automatically create a PVC with the following configuration: |
| 151 | + |
| 152 | + | Configuration | Value | |
| 153 | + |--------------|-------| |
| 154 | + | **Size** | 80Gi | |
| 155 | + | **Storage Class** | nfs-csi | |
| 156 | + | **Access Modes** | ReadWriteMany | |
| 157 | + |
| 158 | + In the OpenShift console, select **Storage** > **StorageClasses**. Verify that `nfs-csi` is listed. |
| 159 | + |
| 160 | +- Confirm that your cluster has the following resources required by the kubeflow pipeline components: |
| 161 | + |
| 162 | + | Stage | CPU | Memory | GPU | Storage | |
| 163 | + |-------|-----|--------|-----|---------| |
| 164 | + | Document Processing | 2-4 cores | 8-16 GB | 0 | ~5 GB | |
| 165 | + | Knowledge Generation | 2-4 cores | 8-16 GB | 0 (uses API) | ~10 GB | |
| 166 | + | Knowledge Mixing | 4-8 cores | 16-32 GB | 0 | ~20 GB | |
| 167 | + | Model Training | 8-16 cores | 40+ GB | 8 | ~30 GB | |
| 168 | + |
| 169 | +- Create a Kubernetes secret named `kubernetes-credentials` with the following keys: |
| 170 | + |
| 171 | + | Secret Key | Description | Required | |
| 172 | + |------------|-------------|----------| |
| 173 | + | `KUBERNETES_SERVER_URL` | Kubernetes API server URL | Yes | |
| 174 | + | `KUBERNETES_AUTH_TOKEN` | Authentication token for Kubernetes API | Yes | |
| 175 | + | `HF_TOKEN` | HuggingFace token for model downloads | Yes | |
| 176 | + |
| 177 | + ```bash |
| 178 | + kubectl create secret generic kubernetes-credentials \ |
| 179 | + --from-literal=KUBERNETES_SERVER_URL="https://api.your-cluster.com:6443" \ |
| 180 | + --from-literal=KUBERNETES_AUTH_TOKEN="your-k8s-token" \ |
| 181 | + --from-literal=HF_TOKEN="your-huggingface-token" \ |
| 182 | + -n <your-namespace> |
| 183 | + ``` |
| 184 | + |
| 185 | +- You have configured a pipeline server in OpenShift AI, as described in [Configuring a pipeline server in the Red Hat OpenShift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#configuring-a-pipeline-server_ai-pipelines). |
| 186 | + |
| 187 | +### Procedure |
| 188 | + |
| 189 | +1. Import the pipeline, as described in [Importing a pipeline](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#importing-a-pipeline_ai-pipelines). |
| 190 | + |
| 191 | +2. Run the pipeline in OpenShift AI, as described in [Executing a pipeline run](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/3.3/html-single/working_with_ai_pipelines/index#executing-a-pipeline-run_ai-pipelines). |
| 192 | + |
| 193 | +## Troubleshoot |
| 194 | + |
| 195 | +Use the following information to help troubleshoot problems that you might encounter when you run the example pipeline: |
| 196 | + |
| 197 | +**PVC not created or insufficient storage** |
| 198 | + |
| 199 | +- **Solution:** Verify that the storage class `nfs-csi` exists and that it supports the `ReadWriteMany` access mode. |
| 200 | + |
| 201 | +- **Alternative:** Modify `PVC_STORAGE_CLASS` in the `pipeline.py` file. |
| 202 | + |
| 203 | +**Inference timeouts during knowledge generation** |
| 204 | + |
| 205 | +- **Solution:** Increase the `inference_timeout` parameter (default: `2500s`) for the Knowledge Generation component. |
| 206 | + |
| 207 | +- **Alternative:** Reduce the value of the `max_concurrency` parameter to lower the API load. |
| 208 | + |
| 209 | +**Out of memory during training** |
| 210 | + |
| 211 | +- **Solution:** Increase the value of the `training_resource_memory_per_worker` parameter for the Model Training component. |
| 212 | + |
| 213 | +- **Alternative:** Reduce the value of the `training_effective_batch_size` parameter for the Model Training component. |
| 214 | + |
| 215 | +**Cut size validation warnings** |
| 216 | + |
| 217 | +- **Solution:** Reduce the value of the `cut_size` parameter for the Knowledge Mixing component. |
| 218 | + |
| 219 | +- **Details:** Pipeline validates that sufficient summaries exist per raw document |
| 220 | + |
| 221 | +**Missing HuggingFace models** |
| 222 | + |
| 223 | +- **Solution:** Verify that the `HF_TOKEN` is correct in the `kubernetes-credentials` secret |
| 224 | + |
| 225 | +- **Alternative:** Use a publicly-accessible model. |
| 226 | + |
| 227 | +## Customize the pipeline |
| 228 | + |
| 229 | +You can customize the pipeline by changing the values of the parameters and environment variables listed in the following tables. |
| 230 | + |
| 231 | +Here are some optimization tips: |
| 232 | + |
| 233 | +- **Caching:** The Docling model download is cached. You can reuse artifacts across runs. |
| 234 | +- **Concurrency:** Adjust `max_concurrency` based on inference server capacity. |
| 235 | +- **Subsample:** Use `seed_data_subsample` for testing with smaller datasets. |
| 236 | +- **Cut Sizes:** Start with smaller cut sizes (1,5) before using larger values (10+). |
| 237 | +- **Reasoning:** Disable `enable_reasoning` for faster generation with simpler outputs. |
| 238 | + |
| 239 | +### Document Processing Parameters |
| 240 | + |
| 241 | +| Parameter | Type | Default | Description | |
| 242 | +|-----------|------|---------|-------------| |
| 243 | +| `chunk_max_tokens` | int | 512 | Maximum tokens per document chunk | |
| 244 | +| `chunk_overlap_tokens` | int | 50 | Overlapping tokens between consecutive chunks | |
| 245 | +| `web_urls` | str | "None" | List of web urls separated by , | |
| 246 | +| `domain` | str | "None" | Domain context for the documents | |
| 247 | +| `domain_outline` | str | "None" | Outline or structure of the domain | |
| 248 | +| `icl_document` | str | "None" | In-context learning example document | |
| 249 | +| `icl_query1` | str | "None" | In-context learning example query 1 | |
| 250 | +| `icl_query2` | str | "None" | In-context learning example query 2 | |
| 251 | +| `icl_query3` | str | "None" | In-context learning example query 3 | |
| 252 | + |
| 253 | +### Knowledge Generation Parameters |
| 254 | + |
| 255 | +| Parameter | Type | Default | Description | |
| 256 | +|-----------|------|---------|-------------| |
| 257 | +| `model_name` | str | "openai/gpt-oss-20b" | Teacher model for synthetic data generation | |
| 258 | +| `api_key` | str | (JWT token) | API key/token for model inference | |
| 259 | +| `api_base` | str | (OpenShift URL) | Base URL for the inference API endpoint | |
| 260 | +| `seed_data_subsample` | int | 0 | Number of documents to subsample (0 = all) | |
| 261 | +| `enable_reasoning` | bool | True | Enable reasoning/thinking in generated responses | |
| 262 | +| `number_of_summaries` | int | 1 | Number of summary variations per document | |
| 263 | +| `max_concurrency` | int | 5 | Maximum concurrent API requests | |
| 264 | +| `inference_timeout` | int | 2500 | API request timeout in seconds | |
| 265 | + |
| 266 | +### Knowledge Mixing Parameters |
| 267 | + |
| 268 | +| Parameter | Type | Default | Description | |
| 269 | +|-----------|------|---------|-------------| |
| 270 | +| `tokenizer_model_name` | str | "Qwen/Qwen2.5-1.5B-Instruct" | Tokenizer model for token counting | |
| 271 | +| `cut_size` | str | "1,5,10" | Comma-separated cut sizes (summaries per raw doc) | |
| 272 | +| `qa_per_doc` | int | 3 | Maximum Q&A pairs per document/summary | |
| 273 | +| `save_gpt_oss_format` | bool | False | Apply GPT-OSS specific filtering | |
| 274 | + |
| 275 | +### Model Training Parameters |
| 276 | + |
| 277 | +| Parameter | Type | Default | Description | |
| 278 | +|-----------|------|---------|-------------| |
| 279 | +| `student_model_name` | str | "Qwen/Qwen2.5-1.5B-Instruct" | Base model to fine-tune | |
| 280 | +| `training_resource_gpu_per_worker` | int | 8 | Number of GPUs per training worker | |
| 281 | +| `training_num_epochs` | int | 1 | Number of training epochs | |
| 282 | +| `training_effective_batch_size` | int | 32 | Effective batch size for training | |
| 283 | +| `training_resource_memory_per_worker` | str | "40Gi" | Memory allocation per worker | |
| 284 | + |
| 285 | +### Environment Variables |
| 286 | + |
| 287 | +| Variable | Set Automatically by this component| Purpose | |
| 288 | +|----------|--------|---------| |
| 289 | +| `LITELLM_REQUEST_TIMEOUT` | Knowledge Generation | API request timeout configuration | |
| 290 | +| `HF_HOME` | Knowledge Mixing | HuggingFace cache directory | |
| 291 | +| `DOCLING_CACHE_DIR` | Document Processing | Docling model cache location | |
0 commit comments