Description
Hello,
First of all: thanks for the great package! I have gotten a lot of good use out of it, especially the sequential feature selection.
SFS becomes problematic as the number of features d increases, since the complexity grows as O(d^2). I have found that one way to deal with this is to take a random subset of the remaining dimensions to check at each step instead of trying all of them. If the random subset has size k then the complexity goes down to O(dk).
Take an example of sequential forward selection with d=1000 and k=25.
During the first step, we can either try all 1000 univariate models or pick a random subset of 25 univariate models, and then take the best of them. It makes sense to try them all so as to start with a good baseline.
During the second step, instead of trying 999 bivariate models, we try only 25 of them.
Then 25 instead of 998 trivariate models. And so on until we have 25 left, at which point we revert to trying them all.
If you're interested in some empirical results, I wrote a blog post about this a while back: http://blog.explainmydata.com/2012/07/speeding-up-greedy-feature-selection.html
This would be a great feature to have!