-
-
Notifications
You must be signed in to change notification settings - Fork 87
feat: add mirai parallelization #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for running experiments in parallel using the mirai
package and exposes a new "mirai"
option in the learner encapsulation workflow.
- Introduces a
mirai
branch infuture_map
to dispatch work viamirai_map
/collect_mirai
. - Updates
Learner$encapsulate()
to accept"mirai"
as a method choice. - Adds
mirai
toDESCRIPTION
Suggestions and configures a GitHub remote; updates.Rbuildignore
.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
File | Description |
---|---|
R/helper_exec.R | Added mirai branch in future_map with mirai_map /collect_mirai . |
R/Learner.R | Extended encapsulate() to accept "mirai" method. |
DESCRIPTION | Added mirai to Suggests and configured a Remotes entry. |
.Rbuildignore | Added ^attic$ to ignore list. |
Comments suppressed due to low confidence (5)
R/helper_exec.R:26
- There's no test for the new
mirai
branch infuture_map
. Add a unit test that simulates an availablemirai
environment and verifies thatmirai_map
/collect_mirai
is invoked.
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) {
R/Learner.R:556
- The roxygen block for
encapsulate()
wasn't updated to mention the newmirai
method. Include a description of whatmirai
does and any additional arguments it requires.
#' @return `self` (invisibly).
R/helper_exec.R:28
- The code calls
workhorse
insidemirai_map
, butFUN
is the user-supplied function parameter. Replaceworkhorse
withFUN
(or clarify whereworkhorse
is defined) to ensure the intended function is executed.
mirai::collect_mirai(mirai::mirai_map(data.table(...), workhorse, .args = c(MoreArgs, list(is_sequential = FALSE))))
R/Learner.R:557
- You’ve added "mirai" as a valid choice, but there is no corresponding branch handling
method == "mirai"
inencapsulate()
. Implement the execution logic for the new method or remove it until ready.
assert_choice(method, c("none", "try", "evaluate", "callr", "mirai"))
DESCRIPTION:78
- The Remotes entry points to
mlr-org/mlr3misc@mirai
, which appears unrelated to themirai
package. Update this to reference the actualmirai
repository (or remove if mirai is on CRAN).
mlr-org/mlr3misc@mirai
R/helper_exec.R
Outdated
@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) { | |||
if (getOption("mlr3.debug", FALSE)) { | |||
lg$info("Running experiments sequentially in debug mode with %i iterations", n) | |||
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE) | |||
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this will automatically prefer mirai over future? Maybe there should be an option for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the user also started a daemon with mirai::daemons()
@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) { | |||
if (getOption("mlr3.debug", FALSE)) { | |||
lg$info("Running experiments sequentially in debug mode with %i iterations", n) | |||
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE) | |||
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) { | |||
lg$debug("Running resample() via mirai with %i iterations", n) | |||
mirai::collect_mirai(mirai::mirai_map(data.table(...), FUN, .args = c(MoreArgs, list(is_sequential = FALSE)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it call data.table(...)
? And does it have any chance of getting into conflict with any of the special args of data.table()
(e.g. key
, keep.rownames
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mirai::mirai_map()
only accepts data.frame
and matrix
when you want to map over multiple input. data.table
handles list columns better. ...
contains the iteration and learner when called by resample()
and additionally task and resampling when called by benchmark()
.
And does it have any chance of getting into conflict with any of the special args
I think no. future_map
is only called internally.
R/helper_exec.R
Outdated
@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) { | |||
if (getOption("mlr3.debug", FALSE)) { | |||
lg$info("Running experiments sequentially in debug mode with %i iterations", n) | |||
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE) | |||
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works with the default compute profile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't want to make it too complicated. What would be the best way to pass the compute profile? We also have to pass it to mirai_map()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we expose a .compute
argument to benchmark()
and resample()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so happy with that because we are introducing parameters for a special backend. Moreover, this argument would then have to be added to all tuning and feature selection functions.
Most importantly: we need to ensure that using One way to achiev this is to use L-Ecuyer-CMRG and generate seeds using: RNGkind("L'Ecuyer-CMRG")
set.seed(1)
s <- .Random.seed
for (i in 1:10) {
s <- nextRNGStream(s)
# send s to worker i as .Random.seed
} This is taken from the parallel documentation. |
Yes, for now I pointed out in the documentation that it does not deliver the same results as
Yes I did the same in |
Actually, Charlie is also doing it. But every mirai worker instantites the L-Ecuyer seed once at startup and But this causes e.g. the following behavior: library(mirai)
set.seed(1)
daemons(2)
#> [1] 2
x1 = mirai_map(1, function(i) rnorm(1))[][[1]]
set.seed(1)
x2 = mirai_map(1, function(i) rnorm(1))[][[1]]
x1 == x2
#> [1] FALSE Created on 2025-05-23 with reprex v2.1.1 I think it would be more user friendly for us to also set the seeds again before starting the parallel work. |
Another idea for optimization: If we had more control over the daemons, e.g. via a special compute profile We could offer both options: |
Does this mean that the substreams are not attached to the tasks but to the daemons? It somehow sounds like that in https://mirai.r-lib.org/reference/daemons.html#arg-seed. Otherwise, the order in which tasks are sent would have no influence. |
First draft for the book section https://github.com/mlr-org/mlr3book/pull/880/files#diff-e7d626b13fe29f0532326141ae93476419df13a6f5850becfb1359d4bc86a611 |
Yes, I think so |
No description provided.