Sometimes it would be useful to specify how parameters are changed inside a Learner / PipeOp, e.g. as in mlr-org/mlr3pipelines#24. A typical example is the mtry parameter of a random forest, which should range from 1 to task$ncol. It would be nice if one could introduce an mtry.pexp parameter ranging from 0 to 1, so that the actual mtry is set to round(task$ncol ^ mtry.pexp).
The $trafo function, as it currently stands, is not a good fit for this, because it (1) operates before the Learner even sees the Task, so wouldn't know about task$ncol, and (2) would not be able to introduce a new parameter mtry.pexp, it would only be able to re-scale the present mtry, which is an integer between 1 and Inf, not a real number between 0 and 1.
I think the following UI would be quite nice:
lrn = mlr_learners$get("classif.ranger")
ps = lrn$param_set$clone()
ps$subset(setdiff(ps$ids(), "mtry"))
ps$add(ParamDbl$new("mtry.pexp", 0, 1))
ps$trafo = function(x, env, param_set) {
x$mtry = round(env$task$ncol ^ x$mtry.pexp)
x$mtry.pexp = NULL
x
}
lrn$param_set$add_interface(ps) # !!
# set effective `mtry` to `round(ncol(task) ^ 0.7)` when training happens
lrn$param_set$values$mtry.pexp = 0.7
lrn$param_set$values$mtry = 3 # ERROR
This would change the lrn$param_set to "look and feel" like the ps constructed / modified before, but internally the Learner (or e.g. a PipeOp) would get the parameter values as performed by the $trafo function.
A way to implement this would be the following:
- Add a
private$.learnerside = NULL slot that points to the ParamSet that the Learner / PipeOp should see.
- Add a
$has_interface active binding:
has_interface = function() !is.null(private$.learnerside)
- Add a
self$learnerside(last = TRUE) function that gives the ParamSet that the Learner / PipeOp should see. Because private$.learnerside could point to a ParamSet that itself has a private$.learnerside set, it should be recursive if last is TRUE, and only give the "next" learnerside if last is FALSE.
learnerside = function(last = TRUE) {
if (!self$has_interface)
return(self)
if (last) {
private$.learnerside$learnerside(last = TRUE)
} else {
private$.learnerside
}
}
- Implement a
private$copy_param_set() helper function. It copies all relevant items from its argument to the ParamSet itself, to turn the self into an effective copy of that argument:
copy_param_set = function(param_set) {
private$.params = param_set$params
private$.deps = param_set$deps
private$.values = param_set$values
private$.trafo = param_set$trafo
invisible(self)
}
- Implement the public
$add_interface() function:
add_interface = function(param_set) {
private$.learnerside = self$clone(deep = TRUE)
private$copy_param_set(param_set)
}
- Implement a public
$remove_interface() function:
remove_interface = function(param_set, all = FALSE) {
if (!self$has_interface)
stop("no interface to remove")
replace_with = self$learnerside(last = all)
private$copy_param_set(replace_with)
private$.learnerside = replace_with$.learnerside
}
- How does the
Learner / PipeOp get its value out of this? There probably should be a $get_values() function that gets the values for the operation, which should also have the filter functionality that ids currently has.
get_values = function(class = NULL, tags = NULL, learnerside = FALSE, env) {
if (learnerside && self$has_interface) {
private$.learnerside$values = self$trafo(self$values, env)
return(private$.learnerside$get_values(
class = class, tags = tags, learnerside = learnerside, env = env
))
}
values = self$values
values[intersect(names(values), self$ids(class = class, tags = tags))]
}
- Change the
trafo active binding to also accept functions of the form function(x, env)
This implementation has the advantage that multiple interfaces can be "stacked" on top of each other: A user who gets a Learner does not need to know or care if something put an interface in front of its ParamSet. When the user sets a parameter using param_set$values$param = x, the value gets checked against the constraints of the interface parameter set. When he calls lrn$train(), the train() function calls get_values(tags = "train", learnerside = TRUE, env = list(task = task)), which recurses through the different interfaces that were added, and sets $values in each one of them after transforming. This automatically checks that the trafo function returns a feasible value for the original ParamSet.
This change would also be completely transparent to everything ParamSet is doing so far.
Things that I am not sure about:
- It is a bit inelegant to have the
env parameter depend on what kind of object the ParamSet belongs to: Some PipeOps (e.g. PipeOpModelAvg) have parameters in a different context, where no task is present (and instead maybe a prediction). One would probably want to agree on an interface (always task in a Learner / preprocessing PipeOp, always prediction in a "post-processing" PipeOp, other contexts..?)
- There are no checks on the feasibility of the trafo function output until the actual training / predicting happens.
- Maybe one still wants to use the
"train" / "predict" tags from the outside, e.g. maybe a tuning algorithm wants to train a model with one set of "train" parameters and then evaluate these with different "predict" parameters to get multiple performance datapoints with only a single train() call for efficiency. In that case it would be nice if the trafo could also respect the "train" / "predict" tags and work when only a subset of parameter values is present. In that case, the get_values would need to be adapted to only give self$values[intersect(names(self$values), set$ids(...tags = tags))] to self$trafo.
- I don't know if it would be useful to do this for
ParamSetCollection. Maybe a GraphLearner would want to have an interface as well? I wouldn't know what the UI for that would look like, however. In that case it would probably be easiest to intervene with the individual PipeOps' ParamSet.
Sometimes it would be useful to specify how parameters are changed inside a
Learner/PipeOp, e.g. as in mlr-org/mlr3pipelines#24. A typical example is themtryparameter of a random forest, which should range from 1 totask$ncol. It would be nice if one could introduce anmtry.pexpparameter ranging from 0 to 1, so that the actualmtryis set toround(task$ncol ^ mtry.pexp).The
$trafofunction, as it currently stands, is not a good fit for this, because it (1) operates before theLearnereven sees theTask, so wouldn't know abouttask$ncol, and (2) would not be able to introduce a new parametermtry.pexp, it would only be able to re-scale the presentmtry, which is an integer between 1 andInf, not a real number between 0 and 1.I think the following UI would be quite nice:
This would change the
lrn$param_setto "look and feel" like thepsconstructed / modified before, but internally theLearner(or e.g. aPipeOp) would get the parameter values as performed by the$trafofunction.A way to implement this would be the following:
private$.learnerside = NULLslot that points to theParamSetthat theLearner/PipeOpshould see.$has_interfaceactive binding:self$learnerside(last = TRUE)function that gives theParamSetthat theLearner/PipeOpshould see. Becauseprivate$.learnersidecould point to aParamSetthat itself has aprivate$.learnersideset, it should be recursive iflastisTRUE, and only give the "next"learnersideiflastisFALSE.private$copy_param_set()helper function. It copies all relevant items from its argument to theParamSetitself, to turn theselfinto an effective copy of that argument:$add_interface()function:$remove_interface()function:Learner/PipeOpget its value out of this? There probably should be a$get_values()function that gets the values for the operation, which should also have the filter functionality thatidscurrently has.trafoactive binding to also accept functions of the formfunction(x, env)This implementation has the advantage that multiple interfaces can be "stacked" on top of each other: A user who gets a
Learnerdoes not need to know or care if something put an interface in front of itsParamSet. When the user sets a parameter usingparam_set$values$param = x, the value gets checked against the constraints of the interface parameter set. When he callslrn$train(), thetrain()function callsget_values(tags = "train", learnerside = TRUE, env = list(task = task)), which recurses through the different interfaces that were added, and sets$valuesin each one of them after transforming. This automatically checks that the trafo function returns a feasible value for the originalParamSet.This change would also be completely transparent to everything
ParamSetis doing so far.Things that I am not sure about:
envparameter depend on what kind of object theParamSetbelongs to: SomePipeOps(e.g.PipeOpModelAvg) have parameters in a different context, where notaskis present (and instead maybe aprediction). One would probably want to agree on an interface (alwaystaskin aLearner/ preprocessingPipeOp, alwayspredictionin a "post-processing"PipeOp, other contexts..?)"train"/"predict"tags from the outside, e.g. maybe a tuning algorithm wants to train a model with one set of"train"parameters and then evaluate these with different"predict"parameters to get multiple performance datapoints with only a singletrain()call for efficiency. In that case it would be nice if thetrafocould also respect the"train"/"predict"tags and work when only a subset of parameter values is present. In that case, theget_valueswould need to be adapted to only giveself$values[intersect(names(self$values), set$ids(...tags = tags))]toself$trafo.ParamSetCollection. Maybe aGraphLearnerwould want to have an interface as well? I wouldn't know what the UI for that would look like, however. In that case it would probably be easiest to intervene with the individualPipeOps'ParamSet.