Skip to content

[BUG] Data leakage in _LifelinesAdapter when y has multiple columns #947

@MurtuzaShaikh26

Description

@MurtuzaShaikh26

Describe the bug:
A data leakage bug exists in _LifelinesAdapter. When y (target) has multiple columns, the adapter concatenates the entire y into the training DataFrame but only designates the first column as the duration_col. In lifelines, any columns in the DataFrame that are not explicitly marked as duration_col or event_col are automatically treated as covariates (features). This results in the model using secondary target columns from y as predictors for the primary duration, leading to artificial "leaky" performance.

To Reproduce:
Install dependencies:

!pip install skpro lifelines

Run the following script:

import pandas as pd
import numpy as np
from skpro.survival.adapters.lifelines import _LifelinesAdapter
from unittest.mock import MagicMock

class MockAdapter(_LifelinesAdapter):
    def _get_lifelines_class(self):
        return MagicMock()
    def get_params(self, deep=True):
        return {}

X = pd.DataFrame({"feature1": [1.0, 2.0, 3.0]})
y = pd.DataFrame({
    "duration": [10, 20, 30],
    "leakage_col": [10, 20, 30] # This ground truth should NOT be a feature
})
C = pd.DataFrame({"event": [0] * 3})

adapter = MockAdapter()
mock_est = MagicMock()
adapter._init_lifelines_object = MagicMock(return_value=mock_est)
adapter._fit(X, y, C)

args, kwargs = mock_est.fit.call_args
df_passed = kwargs.get("df")
duration_col = kwargs.get("duration_col")
event_col = kwargs.get("event_col")
covariates = [c for c in df_passed.columns if c != duration_col and c != event_col]

print(f"Duration column: {duration_col}")
print(f"Columns passed as covariates: {covariates}")

if "leakage_col" in covariates:
    print("\nBUG CONFIRMED: 'leakage_col' from y is being used as a feature!")

Expected behavior:
Expected behavior Only columns from X should be used as covariates. Any extra columns in y should be excluded from the feature set passed to
lifelines function.

Output:

Image

Environment:
OS: Windows
Python: 3.14.2
skpro: 2.11.0
lifelines: 0.30.3

Additional context:
The issue is in the _fit method where X and y are concatenated. _lifelines treats every column in the resulting DataFrame as a covariate unless it is explicitly named in duration_col or event_col.

Proposed Fix:
The _fit method should ensure the DataFrame passed to lifelines only contains the features from X, the duration column, and (if applicable) the event column. This can be achieved by subsetting the concatenated DataFrame before calling lifelines_est.fit. If confirmed, I can open PR to fix this issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions