Skip to content

Conversation

@hbaniecki
Copy link
Collaborator

@hbaniecki hbaniecki commented Nov 26, 2025

Introducing CTEImputer

closes #225

Adds the CTEImputer following the compress then explain (CTE) methodology. It replaces missing features of the explanation point by values sampled from the background data, which is first subsampled using a distribution compression algorithm, specifically Compress++ with Kernel Thinning. CTE has shown to provide accurate and stable estimates of explanations while being computationally efficient. It is a new default imputer in TabularExplainer, removing the necessity to set sample_size.

TODO

  • add tests
  • fix Windows CI to include some missing C++ packages

@hbaniecki hbaniecki requested a review from mmschlk November 26, 2025 13:17
@mmschlk mmschlk requested review from Copilot and removed request for mmschlk December 28, 2025 15:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the CTEImputer (Compress Then Explain imputer) as a new imputation strategy for the shapiq package, following the methodology from Baniecki et al. (2025). The imputer uses distribution compression (Compress++ with Kernel Thinning) to subsample background data before imputing missing features, providing accurate and stable explanations while being computationally efficient.

Key Changes:

  • Adds new CTEImputer class implementing the compress-then-explain methodology with background data compression
  • Makes CTEImputer the new default imputer in TabularExplainer, replacing MarginalImputer
  • Adds goodpoints library as a new dependency for distribution compression functionality

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/shapiq/imputer/cte_imputer.py New implementation of CTEImputer class with distribution compression for efficient feature imputation
src/shapiq/imputer/init.py Exports CTEImputer to make it available in the imputer module
src/shapiq/explainer/tabular.py Updates TabularExplainer to use "cte" as the default imputer instead of "marginal"
pyproject.toml Adds goodpoints dependency and shapiq keyword to project metadata
CHANGELOG.md Documents the addition of CTEImputer feature for the development version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state)
self._replacement_data = data[id_compressed]
self.calc_empty_prediction() # reset the empty prediction to the new background data
return self
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a trailing space at the end of this line that should be removed to maintain code cleanliness.

Suggested change
return self
return self

Copilot uses AI. Check for mistakes.
"""
d = data.shape[1]
sigma = np.sqrt(2 * d)
id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state)
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed parameter passed to compresspp_kt may not properly handle the case when self.random_state is None. The function expects an integer seed, but random_state can be None according to the __init__ signature. This could lead to unexpected behavior or errors when random_state=None.

Suggested change
id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state)
if self.random_state is None:
id_compressed = compress.compresspp_kt(
data,
kernel_type=b"gaussian",
k_params=np.array([sigma**2]),
g=4,
)
else:
id_compressed = compress.compresspp_kt(
data,
kernel_type=b"gaussian",
k_params=np.array([sigma**2]),
g=4,
seed=self.random_state,
)

Copilot uses AI. Check for mistakes.
``["cte", "marginal", "baseline", "conditional"]``. Defaults to ``"cte"``, which
initializes the default
:class:`~shapiq.games.imputer.marginal_imputer.MarginalImputer` with its default
:class:`~shapiq.games.imputer.marginal_imputer.CTEImputer` with its default
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation incorrectly references the wrong class name. It should say "CTEImputer" instead of "marginal_imputer.CTEImputer". The CTEImputer is defined in the cte_imputer module, not the marginal_imputer module.

Suggested change
:class:`~shapiq.games.imputer.marginal_imputer.CTEImputer` with its default
:class:`~shapiq.games.imputer.cte_imputer.CTEImputer` with its default

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +109
imputed_data = np.tile(self.x, (n_coalitions, 1))
for i in range(sample_size):
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imputed_data array is initialized outside the loop but modified inside the loop, which means that after the first iteration, it will contain modifications from the previous iteration. This should be moved inside the loop to ensure each iteration starts with a fresh copy of self.x tiled for all coalitions, otherwise the results will be incorrect for iterations after the first one.

Suggested change
imputed_data = np.tile(self.x, (n_coalitions, 1))
for i in range(sample_size):
for i in range(sample_size):
imputed_data = np.tile(self.x, (n_coalitions, 1))

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,164 @@
"""Implementation of the marginal imputer."""
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring incorrectly states "Implementation of the marginal imputer" when it should say "Implementation of the CTE imputer" or "Implementation of the compress then explain imputer".

Copilot uses AI. Check for mistakes.
normalize: bool = True,
random_state: int | None = None,
) -> None:
"""Initializes the marginal imputer.
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring incorrectly states "Initializes the marginal imputer" when it should say "Initializes the CTE imputer" or "Initializes the compress then explain imputer".

Copilot uses AI. Check for mistakes.
Examples:
>>> model = lambda x: np.sum(x, axis=1)
>>> data = np.random.rand(10, 3)
>>> imputer = MarginalImputer(model=model, data=data, x=data[0])
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example in the docstring incorrectly uses "MarginalImputer" instead of "CTEImputer". This should be updated to match the class being documented.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Include Compress Then Explain (CTE) in the Explainer / Imputer interface

2 participants