Tabular Data Loader ✨

⚠️ Stability: alpha — This asset is not yet stable and may change.

Overview 🧾

Automl Data Loader component.

Loads tabular (CSV) data from S3 in batches, sampling up to 100 MB of data, then splits the sampled data into test, selection-train, and extra-train sets.

The component reads data in chunks to efficiently handle large files without loading the entire dataset into memory at once. After sampling, it performs a two-stage split:

Primary split (default 80/20): separates a test set (20%, written to the sampled_test_dataset S3 artifact) from the train portion (80%).
Secondary split (default 30/70 of the train portion): produces models_selection_train_dataset.csv (30%, used for model selection) and extra_train_dataset.csv (70%, passed to refit_full as extra data). Both are written to the PVC workspace under {workspace_path}/datasets/.

For regression tasks the split is random; for binary and multiclass tasks the split is stratified by the label column by default.

Rows with a missing label (NaN / empty in label_column) are dropped after load and before splitting, so regression runs do not propagate null targets into splits or the sample_row JSON (stratified sampling already dropped per chunk; this applies the same rule to random and first-n-rows paths).

After cleansing (infinity replacement, duplicate removal, and label drop), at least 100 valid records must remain; otherwise the component fails with a clear error so downstream AutoML training does not run on datasets too small to split reliably.

After sampling, +/- infinity values in the frame are replaced with NaN (same idea as AutoAI loadXy), then full-row duplicates are dropped before the label drop and train/test split.

Authentication uses AWS-style credentials provided via environment variables (e.g. from a Kubernetes secret).

Inputs 📥

Parameter	Type	Default	Description
`file_key`	`str`	`None`	S3 object key of the CSV file.
`bucket_name`	`str`	`None`	S3 bucket name containing the file.
`workspace_path`	`str`	`None`	PVC workspace directory where train CSVs will be written.
`label_column`	`str`	`None`	Name of the label/target column in the dataset.
`sampled_test_dataset`	`dsl.Output[dsl.Dataset]`	`None`	Output dataset artifact for the test split.
`component_status`	`dsl.Output[dsl.Artifact]`	`None`	Output artifact containing stage-level progress tracking for this component.
`sampling_method`	`Optional[str]`	`None`	"first_n_rows", "stratified", or "random"; if None, derived from task_type.
`task_type`	`str`	`regression`	"binary", "multiclass", or "regression" (default); used when sampling_method is None.
`split_config`	`Optional[dict]`	`None`	Split configuration dictionary. Available keys: "test_size" (float), "random_state" (int), "stratify" (bool).
`selection_train_size`	`float`	`0.3`	Fraction of the train portion used for model selection (default 0.3).

Outputs 📤

Name	Type	Description
Output	`NamedTuple('outputs', sample_config=dict, split_config=dict, sample_row=str, models_selection_train_data_path=str, extra_train_data_path=str)`	Contains sample config, split config, a sample row, and paths to selection-train and extra-train CSVs.

Usage Examples 🧪

"""Example pipelines demonstrating usage of automl_data_loader."""

from kfp import dsl
from kfp_components.components.data_processing.automl.tabular_data_loader import automl_data_loader


@dsl.pipeline(name="tabular-data-loader-example")
def example_pipeline(
    file_key: str = "data/train.csv",
    bucket_name: str = "my-bucket",
    workspace_path: str = "/tmp/workspace",
    label_column: str = "target",
    task_type: str = "regression",
    selection_train_size: float = 0.3,
):
    """Example pipeline using automl_data_loader.

    Args:
        file_key: S3 key of the data file.
        bucket_name: S3 bucket name.
        workspace_path: Path to the workspace directory.
        label_column: Name of the label column.
        task_type: Type of ML task.
        selection_train_size: Fraction of data for training.
    """
    automl_data_loader(
        file_key=file_key,
        bucket_name=bucket_name,
        workspace_path=workspace_path,
        label_column=label_column,
        task_type=task_type,
        selection_train_size=selection_train_size,
    )

Metadata 🗂️

Name: tabular_data_loader
Stability: alpha
Dependencies:
- Kubeflow:
  - Name: Pipelines, Version: >=2.15.2
Tags:
- data-processing
Last Verified: 2026-05-22 00:00:00+00:00
Owners:
- No Parent Owners: Yes
- Approvers:
  - LukaszCmielowski
  - DorotaDR
- Reviewers:
  - Mateusz-Switala
  - DorotaDR

Sampling strategies

Available values for the sampling_method parameter are:

"first_n_rows": Reads the first N rows from the file up to the component's memory limit (default 100 MB).
"stratified": Samples the dataset in a way that preserves the distribution of the label_column. Only available if label_column is specified and task type is classification.
"random": Randomly samples rows from the dataset up to the size limit.

If sampling_method is not set, it is automatically derived from task_type ("random" for regression, "stratified" for classification).

Split Configuration

The split_config dictionary parameter supports:

{
    "test_size": 0.2,       # Proportion of dataset for test split (default: 0.2)
    "random_state": 42,     # Random seed for reproducibility (default: 42)
    "stratify": True        # Use stratified split for binary/multiclass (default: True)
}

Regression: stratify is ignored; the split is always random.
Binary / multiclass: If stratify is True (default), the split is stratified by label_column; if False, the split is random.

The selection_train_size parameter (default: 0.3) controls the secondary split of the train portion:

30% of train data goes to models_selection_train_data.csv (used for model selection).
70% of train data goes to extra_train_dataset.csv (passed to refit_full as extra training data).

Credentials

S3 access uses environment variables (e.g. from a Kubernetes secret):

AWS_S3_ENDPOINT, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY — required for S3.

Usage Examples 💡

Basic usage (regression)

With default parameters, sampling_method is derived from task_type (e.g. regression -> random sampling). The data is sampled from S3 and split into train/test sets:

from kfp import dsl
from kfp_components.components.data_processing.automl.tabular_data_loader import automl_data_loader

@dsl.pipeline(name="automl-training-pipeline")
def my_pipeline():
    load_task = automl_data_loader(
        bucket_name="my-ml-bucket",
        file_key="data/train.csv",
        workspace_path=dsl.WORKSPACE_PATH_PLACEHOLDER,
        label_column="price",
        task_type="regression",
    )
    # load_task.outputs["models_selection_train_data_path"] - PVC path for model selection training
    # load_task.outputs["extra_train_data_path"] - PVC path for extra training data (refit_full)
    # load_task.outputs["sampled_test_dataset"] - S3 artifact for test evaluation
    # load_task.outputs["sample_row"] - JSON string with one sample row from test set
    return load_task

Classification with stratified split

load_task = automl_data_loader(
    bucket_name="my-ml-bucket",
    file_key="data/train.csv",
    workspace_path=dsl.WORKSPACE_PATH_PLACEHOLDER,
    label_column="target",
    task_type="binary",
    split_config={"test_size": 0.2, "stratify": True},
)

Custom split configuration

load_task = automl_data_loader(
    bucket_name="my-ml-bucket",
    file_key="data/train.csv",
    workspace_path=dsl.WORKSPACE_PATH_PLACEHOLDER,
    label_column="target",
    task_type="regression",
    split_config={"test_size": 0.25, "random_state": 123},
)

Explicit sampling method

load_task = automl_data_loader(
    bucket_name="my-ml-bucket",
    file_key="data/train.csv",
    workspace_path=dsl.WORKSPACE_PATH_PLACEHOLDER,
    label_column="target",
    sampling_method="first_n_rows",
)

Stratified sampling (classification)

load_task = automl_data_loader(
    bucket_name="my-ml-bucket",
    file_key="data/train.csv",
    workspace_path=dsl.WORKSPACE_PATH_PLACEHOLDER,
    sampling_method="stratified",
    label_column="target",
    task_type="binary",
)

Component status artifact

In the tabular training pipeline, this component writes component_status.json under the component_status output artifact. The file includes component_id (automl_data_loader), started_at, completed_at, a stages list (ids such as prepare_data, split_and_export), and optional metadata. Match stage ids to the tabular pipeline entry in component_stage_map.json from the publish-component-stage-map task.

Supported formats and limits 📋

Format: CSV only.
Size limit: Up to 100 MB of data in memory (sampled if larger).
Streaming: Data is read in batches (10k rows per chunk) to handle large files.

Logging 📝

The component logs at INFO level:

Which sampling method is used (including when derived from task_type).
Number of rows read and the S3 location (bucket_name, file_key).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular Data Loader ✨

Overview 🧾

Inputs 📥

Outputs 📤

Usage Examples 🧪

Metadata 🗂️

Sampling strategies

Split Configuration

Credentials

Usage Examples 💡

Basic usage (regression)

Classification with stratified split

Custom split configuration

Explicit sampling method

Stratified sampling (classification)

Component status artifact

Supported formats and limits 📋

Logging 📝

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tabular Data Loader ✨

Overview 🧾

Inputs 📥

Outputs 📤

Usage Examples 🧪

Metadata 🗂️

Sampling strategies

Split Configuration

Credentials

Usage Examples 💡

Basic usage (regression)

Classification with stratified split

Custom split configuration

Explicit sampling method

Stratified sampling (classification)

Component status artifact

Supported formats and limits 📋

Logging 📝