-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Hi all,
I am intrigued by the experimental result of Aas, Jullum, and Loland (https://arxiv.org/abs/1903.10464) that TreeSHAP fails to capture covariate dependence in any meaningful way. Do you have any insight into why this may be?
I ask because the conditional dependence estimation procedures shapr implements, particularly the empirical method, seem very similar to the adaptive nearest-neighbour interpretation of random forests, e.g. the causal forests used by the grf package (Athey and Wager https://arxiv.org/abs/1902.07409). A TreeSHAP-like algorithm might be an effective way of calculating the conditional expectation for a subset of variables using the adaptive neighbourhoods already learned by an underlying random forest model.
I wonder if part of the reason TreeSHAP failed the tests is that it was run on a boosted model rather than a random forest (as far as I know, boosted trees don't have a nearest-neighbour interpretation). Would this be worth investigating further?
Update:
The intuition might be that because of the scale reduction in a boosted tree ensemble, removal of some covariates tends to make the resulting expectations rather unpredictable (mostly dependant on the high-variance initial large-scale trees), whereas the redundancy of a bagged tree ensemble means that other the remaining covariates may still informatively partition the space.
If TreeSHAP on random forests does work for estimating conditional expectations, then it might be a viable option to be built into the package for non-forest underlying models. The focus would not be on estimating p(S' | S) (distribution of missing variables conditional on present variables) but instead on estimating the required conditional expectation directly - E(f(S',S)|S).
I would be happy to run some tests with random forests using the R treeshap package on the test data sets used in the paper if you could provide them?