Skip to content

feat: add mirai parallelization #1314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

feat: add mirai parallelization #1314

wants to merge 23 commits into from

Conversation

be-marc
Copy link
Member

@be-marc be-marc commented May 22, 2025

No description provided.

@be-marc be-marc requested a review from Copilot May 22, 2025 10:56
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for running experiments in parallel using the mirai package and exposes a new "mirai" option in the learner encapsulation workflow.

  • Introduces a mirai branch in future_map to dispatch work via mirai_map/collect_mirai.
  • Updates Learner$encapsulate() to accept "mirai" as a method choice.
  • Adds mirai to DESCRIPTION Suggestions and configures a GitHub remote; updates .Rbuildignore.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File Description
R/helper_exec.R Added mirai branch in future_map with mirai_map/collect_mirai.
R/Learner.R Extended encapsulate() to accept "mirai" method.
DESCRIPTION Added mirai to Suggests and configured a Remotes entry.
.Rbuildignore Added ^attic$ to ignore list.
Comments suppressed due to low confidence (5)

R/helper_exec.R:26

  • There's no test for the new mirai branch in future_map. Add a unit test that simulates an available mirai environment and verifies that mirai_map/collect_mirai is invoked.
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) {

R/Learner.R:556

  • The roxygen block for encapsulate() wasn't updated to mention the new mirai method. Include a description of what mirai does and any additional arguments it requires.
#' @return `self` (invisibly).

R/helper_exec.R:28

  • The code calls workhorse inside mirai_map, but FUN is the user-supplied function parameter. Replace workhorse with FUN (or clarify where workhorse is defined) to ensure the intended function is executed.
mirai::collect_mirai(mirai::mirai_map(data.table(...), workhorse, .args = c(MoreArgs, list(is_sequential = FALSE))))

R/Learner.R:557

  • You’ve added "mirai" as a valid choice, but there is no corresponding branch handling method == "mirai" in encapsulate(). Implement the execution logic for the new method or remove it until ready.
assert_choice(method, c("none", "try", "evaluate", "callr", "mirai"))

DESCRIPTION:78

  • The Remotes entry points to mlr-org/mlr3misc@mirai, which appears unrelated to the mirai package. Update this to reference the actual mirai repository (or remove if mirai is on CRAN).
    mlr-org/mlr3misc@mirai

R/helper_exec.R Outdated
@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) {
if (getOption("mlr3.debug", FALSE)) {
lg$info("Running experiments sequentially in debug mode with %i iterations", n)
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE)
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will automatically prefer mirai over future? Maybe there should be an option for this

Copy link
Member Author

@be-marc be-marc May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the user also started a daemon with mirai::daemons()

@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) {
if (getOption("mlr3.debug", FALSE)) {
lg$info("Running experiments sequentially in debug mode with %i iterations", n)
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE)
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) {
lg$debug("Running resample() via mirai with %i iterations", n)
mirai::collect_mirai(mirai::mirai_map(data.table(...), FUN, .args = c(MoreArgs, list(is_sequential = FALSE))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it call data.table(...)? And does it have any chance of getting into conflict with any of the special args of data.table() (e.g. key, keep.rownames)

Copy link
Member Author

@be-marc be-marc May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mirai::mirai_map() only accepts data.frame and matrix when you want to map over multiple input. data.table handles list columns better. ... contains the iteration and learner when called by resample() and additionally task and resampling when called by benchmark().

And does it have any chance of getting into conflict with any of the special args

I think no. future_map is only called internally.

R/helper_exec.R Outdated
@@ -23,6 +23,9 @@ future_map = function(n, FUN, ..., MoreArgs = list()) {
if (getOption("mlr3.debug", FALSE)) {
lg$info("Running experiments sequentially in debug mode with %i iterations", n)
mapply(FUN, ..., MoreArgs = MoreArgs, SIMPLIFY = FALSE, USE.NAMES = FALSE)
} else if (requireNamespace("mirai", quietly = TRUE) && mirai::status()$connections) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works with the default compute profile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't want to make it too complicated. What would be the best way to pass the compute profile? We also have to pass it to mirai_map() .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we expose a .compute argument to benchmark() and resample().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so happy with that because we are introducing parameters for a special backend. Moreover, this argument would then have to be added to all tuning and feature selection functions.

@sebffischer
Copy link
Member

sebffischer commented May 22, 2025

Most importantly: we need to ensure that using mirai_map is reproducible and gets proper RNG streams.

One way to achiev this is to use L-Ecuyer-CMRG and generate seeds using:

RNGkind("L'Ecuyer-CMRG")
set.seed(1)
s <- .Random.seed
for (i in 1:10) {
  s <- nextRNGStream(s)
  # send s to worker i as .Random.seed
}

This is taken from the parallel documentation.

@be-marc
Copy link
Member Author

be-marc commented May 22, 2025

we need to ensure that using mirai_map is reproducible and gets proper RNG streams.

Yes, for now I pointed out in the documentation that it does not deliver the same results as future.

One way to achieve this is to use L-Ecuyer-CMRG and generate seeds using

Yes I did the same in rush. Not sure why Charlie is doing it differently.

@sebffischer
Copy link
Member

sebffischer commented May 23, 2025

we need to ensure that using mirai_map is reproducible and gets proper RNG streams.

Yes, for now I pointed out in the documentation that it does not deliver the same results as future.

One way to achieve this is to use L-Ecuyer-CMRG and generate seeds using

Yes I did the same in rush. Not sure why Charlie is doing it differently.

Actually, Charlie is also doing it. But every mirai worker instantites the L-Ecuyer seed once at startup and set.seed() then has no effect on the parallel workers.

But this causes e.g. the following behavior:

library(mirai)

set.seed(1)
daemons(2)
#> [1] 2
x1 = mirai_map(1, function(i) rnorm(1))[][[1]]
set.seed(1)
x2 = mirai_map(1, function(i) rnorm(1))[][[1]]
x1 == x2
#> [1] FALSE

Created on 2025-05-23 with reprex v2.1.1

I think it would be more user friendly for us to also set the seeds again before starting the parallel work.

@sebffischer
Copy link
Member

sebffischer commented May 23, 2025

Another idea for optimization:

If we had more control over the daemons, e.g. via a special compute profile .mlr3, we could also globally load some packages on the workers (e.g. mlr3 itself).
Not sure whether results in a noticeable runtime improvement but we could explore it.

We could offer both options:
a) The user uses a existing compute profile`
b) We spawn daemons, which is where we can add such optimizations.

@be-marc
Copy link
Member Author

be-marc commented May 23, 2025

Actually, Charlie is also doing it. But every mirai worker instantites the L-Ecuyer seed once at startup and set.seed() then has no effect on the parallel workers.

Does this mean that the substreams are not attached to the tasks but to the daemons? It somehow sounds like that in https://mirai.r-lib.org/reference/daemons.html#arg-seed. Otherwise, the order in which tasks are sent would have no influence.

@be-marc
Copy link
Member Author

be-marc commented May 23, 2025

@sebffischer
Copy link
Member

Actually, Charlie is also doing it. But every mirai worker instantites the L-Ecuyer seed once at startup and set.seed() then has no effect on the parallel workers.

Does this mean that the substreams are not attached to the tasks but to the daemons? It somehow sounds like that in https://mirai.r-lib.org/reference/daemons.html#arg-seed. Otherwise, the order in which tasks are sent would have no influence.

Yes, I think so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants