You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: components/data_processing/autorag/test_data_loader/README.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,17 +4,18 @@
4
4
5
5
## Overview 🧾
6
6
7
-
Download test data json file from S3 into a KFP artifact.
7
+
Download test data JSON from S3 and sample it for benchmarking.
8
8
9
-
The component reads S3-compatible credentials from environment variables (injected by the pipeline from a Kubernetes secret) and downloads a JSON test data file from the provided bucket and path to the output artifact.
9
+
The component reads S3-compatible credentials from environment variables (injected by the pipeline from a Kubernetes secret), downloads a JSON test data file, and randomly samples up to ``benchmark_sample_size`` records to limit evaluation cost in downstream components.
10
10
11
11
## Inputs 📥
12
12
13
13
| Parameter | Type | Default | Description |
14
14
| --------- | ---- | ------- | ----------- |
15
15
|`test_data_bucket_name`|`str`|`None`| S3 (or compatible) bucket that contains the test data file. |
16
16
|`test_data_path`|`str`|`None`| S3 object key to the JSON test data file. |
17
-
|`test_data`|`dsl.Output[dsl.Artifact]`|`None`| Output artifact that receives the downloaded file. |
17
+
|`benchmark_sample_size`|`int`|`25`| Maximum number of records to keep from the test data. When the dataset exceeds this limit, a reproducible random sample is drawn (seed 42). Set to 0 to disable sampling and keep all records. |
18
+
|`test_data`|`dsl.Output[dsl.Artifact]`|`None`| Output artifact that receives the (possibly sampled) file. |
0 commit comments