We present an implementation of the Model-X knockoff framework using flexible, nonparametric machine learning models, including gradient boosted machines and neural networks. The goal is to perform valid feature selection with false discovery rate (FDR) control in high-dimensional settings by testing conditional independence between each predictor and the response.
Letting
The knockoff procedure first characterizes the conditional distribution of
Where $ \tilde{X}$ denotes the knockoff (generated) data, and
As $ \tilde{X}$ is constructed irrespective of Y, the predictor-response relationships between
Where
If we assume the null hypothesis that
Given this information, the distribution of
In the above calculation, we take advantage of the fact that null features are symmetric about 0, while alternative features are strictly positive. We can estimate the null proportion by the number of features with
The latter procedure is reliant on the fact that we can construct a reliable joint distribution for X. To do this, we apply the sequential conditional independent pairs procedure, outlined in Barber and Candes, 2015.
(1) Sampling of the first variable — For
Where
Where
(2) Sequential sampling of remaining variables — For
Where
We now have a model X knockoff,