Handling dead workers in future.apply #732

felixschweigkofler · 2024-07-04T09:14:48Z

felixschweigkofler
Jul 4, 2024

I want to parallelize a model fitting process that will reliably randomly crash the R session to many dozens of cores/workers. I was unable to locate the issue that crashes the session as it is not my own code, and as it is not systematic, meaning a participant that crashes in one attempt is modeled perfectly fine in the next. I therefore want to work around this issue, let a worker die, and resurrect the dead worker, and the run the same participant's data on the same worker node again. However, as of now, it seems that there is no option to do so, as future.apply::future_lapply() throws an error and the entire function stops when it cannot connect to this worker.

Or so it seems. I have namely noticed that after such an error, the function seems to have stopped running and my r-console is free again and I can write commands and work normally again, but logs of my fitting-function continue to be written, although with decreasing frequency, until at some point that stops as well. It seems that future_lapply continues to run in the background while ignoring killed workers, until all workers are killed. I did not use future::plan(sequential) in that time.

Does this behaviour mean that I could use the parallely package to create a cluster of workers, use this cluster with future_lapply, detect when an error is thrown by the function (which might or might not happen when future_lapply runs in the background) with tryCatch, and use parallely::isAliveNode and parallely::cloneNode to remove the dead worker and add a living worker to the cluster dynamically while future_lapply is using it?

Catching the error of future_lapply() and restarting the entire cluster with future::plan(multisession) is not a good option, because with 100 cores, one of the workers would die frequently and restarting the entire cluster every time would massively slow down the procedure.

My current code looks something like this:

future::plan(multisession, workers = 4)
future.apply::future_lapply(unique(data$participant), function(x) myfitfunc(data %>% dplyr::filter(participant == x), additional_args = 'many'))

Thanks for any help

Answered by HenrikBengtsson

May 16, 2025

UPDATE: As of future 1.40.0 (2025-04-10), there's no longer a need to orchestrate this manually. Futureverse detects when cluster and multisession workers have crashed and automatically relaunch them in the background. For example, if we launch three futures where one causes the parallel worker to terminate abruptly:

library(future)

cl <- parallelly::makeClusterPSOCK(4)
plan(cluster, workers = cl)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 4/4

fs <- list()

fs$A <- future(42)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 3/4

fs$B <- future(tools::pski…

View full answer

HenrikBengtsson · 2024-07-04T09:31:00Z

HenrikBengtsson
Jul 4, 2024
Maintainer

Only got a few minutes, but then future.callr::callr works similarly to multisession, but where there is fresh R process spun up for each future. That means, regardless whether the future finished successfully or not, there is no R process left behind. That'll take you the first step. You still have to handle FutureError errors thrown by each future failing because the R process crashed. Such errors are considered so severe that they are not handled by higher-level APIs, e.g. future.apply, *furrr, and doFuture. Instead, you need to roll your own, e.g.

f <- future(<expr>)
v <- tryCatch(value(f), FutureError = identity)
if (inherits(v, "FutureError")) {
  <do something>
}
...

FWIW, the multisession backend uses parallelly::isAliveNode() to give a more informative error message when it fails to communicate with the worker process. I added parallelly::cloneNode() to get one step closer to handle failed workers and possibly recreating them. However, recreating crashed workers and relaunching futures is potentially very risky, because it can keep bringing down processes and the operating system.

3 replies

felixschweigkofler Jul 5, 2024
Author

Hi Henrik, I am trying to implement it now, but I am stuck and also cant find much advice online.

In my example I calculate the mean of columns in a for loop. First question: Does callr parallelize the for-loop operations or am I misunderstanding this? Would I need to use lapply or future.lapply for example?

Column 2 simulates a crashing process by quitting the session with q(), which I think is the closest I can get to a crashing session. If i use stop() instead it works fine and just returns NULL for the mean of column 2, but with q() I get the following error message when trying to get the value of f (or printing v):

<UnexpectedFutureResultError: Unexpected result (of class ‘NULL’ != ‘FutureResult’) retrieved for CallrFuture future (label = ‘’, expression = ‘exampleFunc(df)’): >

df <- data.frame(
  V1 = rnorm(100),
  V2 = rnorm(100),
  V3 = rnorm(100))

exampleFunc <- function(df) {
  results <- list()
  for (i in seq_along(df)) {
    tryCatch({
      if (i == 2) q() #  stop()
      results[[i]] <- mean(df[[i]])}, error = function(e) {
      results[[i]] <- 'aborted'
    })
  }
  return(results)
}

plan(callr)
f <- future::future(exampleFunc(df))
v <- tryCatch(future::value(f), FutureError = identity)
if (inherits(v, "FutureError")) print(v)

HenrikBengtsson Jul 7, 2024
Maintainer

Yes, you should expect to get an FutureError for each future launched on a crashed parallel workers.

Here's one way to relaunch parallel workers:

library(future)

cl <- parallelly::makeClusterPSOCK(4)
plan(cluster, workers = cl)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))

fs <- list()

fs[[1]] <- future(42)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))

fs[[2]] <- future(tools::pskill(Sys.getpid()))  ## emulates a crashed worker
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))

fs[[3]] <- future(3.14)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))


vs <- lapply(fs, function(f) tryCatch(value(f), FutureError = identity))
str(vs)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))

## Relaunch crashed workers?
## (1) A crashed worker results in futures failing with an FutureError
failed <- vapply(vs, inherits, "FutureError", FUN.VALUE = NA)
if (any(failed)) {
  ## (2) Scan for cluster nodes no longer alive?
  crashed <- !parallelly::isNodeAlive(cl)
  if (any(crashed)) {
    ## Restart crashed workers
    cl[crashed] <- parallelly::cloneNode(cl[crashed])
    plan(cluster, workers = cl)
    message(sprintf("Relaunched %d crashed workers", sum(crashed)))
  }
}

message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))

But, you're on your own if you use this approach; it's not an official approach (we don't have an official approach to restart crashed workers).

felixschweigkofler Aug 19, 2024
Author

In the end this was too complicated for the project and I stuck with a simpler approach. Thanks nevertheless :)

HenrikBengtsson · 2025-05-16T17:55:32Z

HenrikBengtsson
May 16, 2025
Maintainer

UPDATE: As of future 1.40.0 (2025-04-10), there's no longer a need to orchestrate this manually. Futureverse detects when cluster and multisession workers have crashed and automatically relaunch them in the background. For example, if we launch three futures where one causes the parallel worker to terminate abruptly:

library(future)

cl <- parallelly::makeClusterPSOCK(4)
plan(cluster, workers = cl)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 4/4

fs <- list()

fs$A <- future(42)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 3/4

fs$B <- future(tools::pskill(Sys.getpid()))  ## emulates a crashed worker
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 2/4

fs$C <- future(3.14)
message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 1/4

We will, first of all, get a FutureInterruptError for the future that did not resolve, because the worker crashed;

vs <- lapply(fs, function(f) tryCatch(value(f), FutureError = identity))
str(vs)
#> List of 3
#>  $ A: num 42
#>  $ B:List of 2
#>   ..$ message: chr "Future (NULL) of class ClusterFuture interrupted, while running on 'localhost' (pid 249671)"
#>   ..$ call   : NULL
#>   ..- attr(*, "class")= chr [1:5] "FutureInterruptError" "FutureError" "error" "FutureCondition" ...
#>   ..- attr(*, "uuid")= chr [1:2] "861bcd4389679da99fc75f8e73d1face" "5"
#>   ..- attr(*, "future")=Classes 'ClusterFuture', 'MultiprocessFuture', 'Future' <environment: 0x5cc117467260> 
#>  $ C: num 3.14

Note how the value of future B is an FutureInterruptError, which we caught with tryCatch(). What's even more useful is that the crashed work was automatically restarted, so we still have access to four (4) workers after the incident;

message(sprintf("Number of free workers: %d/%d", nbrOfFreeWorkers(), nbrOfWorkers()))
#> Number of free workers: 4/4

Moreover, if we would use:

vs <- value(fs)

then value() will look for any type of errors, e.g. a regular stop() or a severe error such as FutureInterruptError. As soon as one is detected, it will, if the backend supports it, cancel any running futures, which frees up those workers sooner.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling dead workers in future.apply #732

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling dead workers in future.apply #732

Uh oh!

Uh oh!

felixschweigkofler Jul 4, 2024

Replies: 2 comments · 3 replies

Uh oh!

HenrikBengtsson Jul 4, 2024 Maintainer

Uh oh!

felixschweigkofler Jul 5, 2024 Author

Uh oh!

Uh oh!

HenrikBengtsson Jul 7, 2024 Maintainer

Uh oh!

felixschweigkofler Aug 19, 2024 Author

Uh oh!

HenrikBengtsson May 16, 2025 Maintainer

felixschweigkofler
Jul 4, 2024

Replies: 2 comments 3 replies

HenrikBengtsson
Jul 4, 2024
Maintainer

felixschweigkofler Jul 5, 2024
Author

HenrikBengtsson Jul 7, 2024
Maintainer

felixschweigkofler Aug 19, 2024
Author

HenrikBengtsson
May 16, 2025
Maintainer