`multisession` resets number of threads for OpenMP and OpenBLAS reported by `RhpcBLASctl` #779

fspecque · 2025-04-17T16:54:38Z

fspecque
Apr 17, 2025

Hi!
This doesn't seem to be a bug, I just wanted to report this behavior. Here is a reproducible example (with BLAS, but reproducible with OpenMP):

library(future)
RhpcBLASctl::blas_get_num_procs()
# 8
RhpcBLASctl::blas_set_num_threads(5L)
RhpcBLASctl::blas_get_num_procs()
# 5

plan(multicore, workers = 3)
n %<-% { RhpcBLASctl::blas_get_num_procs() }
n
# 5 -> the value is preserved with multicore

plan(multisession, workers = 3)
n %<-% { RhpcBLASctl::blas_get_num_procs() }
n
# 8 -> default value

plan(multisession, workers = 3)
n %<-% { RhpcBLASctl::blas_set_num_threads(5L) ; RhpcBLASctl::blas_get_num_procs() }
n
# 5

So unless I am missing something, the only way to preserve the value of RhpcBLASctl::blas_get_num_procs() in a multisession future is to set it inside the future, whereas it is preserved inside a multicore future. (as a side note, callr behaves as a multisession)

Why is there such a difference between the back-ends? Would it be possible to harmonize the behaviors?

I think it would be preferable to preserve the number of threads in both cases, because of nested futures:

library(future)
library(future.apply)
RhpcBLASctl::blas_set_num_threads(5L)

foo <- function()     # just to demonstrate
  unique(future_sapply(1:10, function(x) RhpcBLASctl::blas_get_num_procs()))

plan(list(tweak(multisession, workers = 2L), tweak(multisession, workers = I(3L))))
n %<-% {
  RhpcBLASctl::blas_set_num_threads(5L) 
  foo()
  }
n
# 8

Now, let's imagine that foo() is a function from a package parallelized with future_apply, with each apply loop calling a function that is multi-threaded with BLAS. A Windows or RStudio user could not properly manage the number of threads when calling this function inside a future (to run foo() in the background for example) . He would have to replace the nested multisession by a sequential plan to avoid the excessive number of threads.

I've seen this old comment I think it would be nice to have as a developer.

Sorry for the verbosity, I hope it was clear. Any help, feedback or comment would be appreciated.

Thanks !
Best

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 21

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] future.apply_1.11.3 future_1.40.0      

loaded via a namespace (and not attached):
[1] compiler_4.4.3      parallelly_1.43.0   tools_4.4.3        
[4] parallel_4.4.3      listenv_0.9.1       codetools_0.2-20   
[7] digest_0.6.37       globals_0.16.3      RhpcBLASctl_0.23-42

HenrikBengtsson · 2025-04-17T17:23:42Z

HenrikBengtsson
Apr 17, 2025
Maintainer

Thanks for this. This is an important topic. It's a discussion that is still open and applies to all parallelization frameworks - not just futureverse.

Why is there such a difference between the back-ends? Would it be possible to harmonize the behaviors?

The multicore backend uses forked parallelization, which means the worker processes inherit everything from their parent, including settings for multi-threading. Forking is done by the operating system, i.e. futureverse is not involved here. The other backend that inherits these settings from the main R session is obviously sequential, which runs in the main R session.

All other backends runs parallel workers in standalone R processes running in the background, either on the local machine or on external machines. Those backends are spun up only inheriting so much from the parent process, which is because they may run on hosts completely unrelated to the machine where the main R session runs. When we know we are running on the same machine ("the local host"), we can carry forward some more settings. Specifically, localhost parallel workers are configured to use the same R package library (.libPaths()) as the main R session. That is something we cannot assume will work if we run on another machine. So, in summary, it's not that the multi-threading settings done in the main R session are reset on workers - instead, nothing is set on the workers, which therefore default to the default settings that operating system gives R.

I think it would be preferable to preserve the number of threads in both cases, because of nested futures: ...

For similar reason, I don't think it is safe for a parallel worker to inherit settings for multi-threading process from the main R session. One might be able to argue it could be done for localhost parallelization (e.g. multisession, mirai_multisession, callr, batchtools_local, ...), but I don't know. OTH, I would argue that the safest would be if we instead disabled multi-threading by default in parallel workers (think RhpcBLASctl::blas_set_num_threads(1L)). That would prevent parallelization (processes and multi-threaded) from blowing up when there is (a risk for) nested parallelization taking place.

While writing this, other than #388, which you've found, and futureverse/parallelly#47, I realize I don't have a public issue on the idea of forcing single-threaded(*) processing by default in parallel workers. With the major internal redesign in future 1.40.0, the proposal to support something like:

plan(multisession, workers = 3, threads_per_worker = 4)

should not be that much work anymore, and maybe the default could be threads_per_worker = 1, or slightly better, threads_per_worker = availableThreadsPerCore() [https://github.com/futureverse/parallelly/issues/111, https://github.com/futureverse/parallelly/issues/47].

(*) single or dual, depending on CPU architecture.

I agree that we should strive for a cross-platform solution, and I ideally this can be encapsulated internally by the futureverse.

I should also clarify that I've punted on this topic and design decision for quite a while. This is simply, because I haven't had the bandwidth, but also due to limitations in cross-platform solutions. I'm also hoping there's some precedence on this out there in R or elsewhere, but we just haven't stumbled up on it yet.

Please feel free to continue this discussion and add thoughts and comments here, because it is important, and it would be nice to settle on best practices around this.

0 replies

fspecque · 2025-04-18T18:42:50Z

fspecque
Apr 18, 2025
Author

Hi!

Thanks a lot for your detailed response, it shed light on many things! I'll try to help as much as I can, but I'm not too savvy on the subject.

I would argue that the safest would be if we instead disabled multi-threading by default in parallel workers

I believe you're right (but I have other suggestions at the end).

Detection of number of physical cores and logical threads

I used parallel::detectCores() in the past, switching the value of the logical parameter (TRUE or FALSE). Just realized that parallel is part R core and that it's used in parallelly::availableCores(), which passes the logical parameter to detectCores(). I've tested detectCores() again, and according to my test on the few machines I can access:

On Linux: it invariably returns the number of logical threads - 8 on my Linux Mint laptop which has an Intel I7 with 4 cores * 2 threads , 40 on a cluster login node with Centos 7 with 2 sockets * 10 physical cores * 2 threads (Intel Xeon)
On Windows: I tested on Windows Server 2025 with 2 sockets * 6 cores * 2 threads. Values returned by parallel::detectCores() are correct (12 and 24 with logical = F and logical = T respectively). Also Intel Xeon
On MacOs: I asked two mac os owners to test the command. Both have M chips with 8 single-threaded cores and parallel::detectCores() always return 8 so it's not really informative.

I checked the source code of detectCores(). It actually relies on system command calls (except for Windows). I've noticed that the value of the logical argument is not taken into account on Linux OS (nor on BSD). I've stumbled across that piece of documentation but I don't know how relevant it is in this regard.

For Mac OS, it seems to use the canonical command.

source code detectCores (R 4.5.0)

detectCores <-
    if(.Platform$OS.type == "windows") {
        function(all.tests = FALSE, logical = TRUE) {
            ## result is # cores, logical processors.
            res <- .Call(C_ncpus, FALSE)
	    res[if(logical) 2L else 1L]
        }
    } else {
        function(all.tests = FALSE, logical = TRUE) {
            ## Commoner OSes first
            ## for Linux systems, physical id is 1 for second hyperthread
            ## Irix support removed in R 4.1.0
            systems <-
                ## quoting needed for a Bourne shell
                list(linux = 'grep "^processor" /proc/cpuinfo 2>/dev/null | wc -l',
                     ## hw.physicalcpu is not documented for 10.9, but works
                     darwin = if(logical) "/usr/sbin/sysctl -n hw.logicalcpu 2>/dev/null" else "/usr/sbin/sysctl -n hw.physicalcpu 2>/dev/null",
                     solaris = if(logical) "/usr/sbin/psrinfo -v | grep 'Status of.*processor' | wc -l" else "/bin/kstat -p -m cpu_info | grep :core_id | cut -f2 | uniq | wc -l",
                     freebsd = "/sbin/sysctl -n hw.ncpu 2>/dev/null",
                     openbsd = "/sbin/sysctl -n hw.ncpuonline 2>/dev/null")
            nm <- names(systems)
            m <- pmatch(nm, R.version$os); m <- nm[!is.na(m)]
            if (length(m)) {
                cmd <- systems[[m]]
                if(!is.null(a <- tryCatch(suppressWarnings(system(cmd, TRUE)),
                                         error = function(e) NULL))) {
                    a <- gsub("^ +","", a[1])
                    if (grepl("^[1-9]", a)) return(as.integer(a))
                }
            }
            if (all.tests) {
                for (i in seq(systems))
                    for (cmd in systems[i]) { # Irix had two commands
			if(is.null(a <- tryCatch(suppressWarnings(system(cmd, TRUE)),
						 error = function(e) NULL)))
			    next
                        a <- gsub("^ +","", a[1])
                        if (grepl("^[1-9]", a)) return(as.integer(a))
                    }
            }
            NA_integer_
        }
    }

The simplest alternative on Linux would be to use lscpu as in futureverse/parallelly#47 (comment) (+ a regex like sed 's/[^:]*:[ ]*//' or sed -E 's/[^:]+:[[:blank:]]*//' to precisely extract the integer):

lscpu | grep -i ^socket        # sockets
lscpu | grep -i ^core          # cores (per socket)
lscpu | grep -i ^thread        # threads per core

But I don't know if lscpu is installed on all Linux machines. This should work more robustly:

grep -i 'physical id' /proc/cpuinfo | sort -u | wc -l               # sockets
grep -i 'cpu core' /proc/cpuinfo | sort -u | sed 's/[^:]*:[ ]*//'   # cores (might not be sufficient, see *)

grep -i 'processor' /proc/cpuinfo | sort -u | wc -l                 # threads (total)
# OR
grep -ci 'cpu core' /proc/cpuinfo

* special case (?): multiple sockets with different number of cores on a same machine. Is it possible? If yes, is it possible that some sockets carry dual cores and other single cores on a same machine?

Thoughts

But I don't know to what extent this distinction between cores and logical threads (as specified by the OS) is relevant on Linux machines. I can perfectly setup a plan with workers = 5L on my 4 cores * 2 threads machine and it will create 5 processes, each running a future. Looks similar on Windows.

And regardless of the OS, I'm not sure how relevant it is to link the putative threads_per_worker to the hardware structure. What would be the rational here? Efficiency? Am I missing something? Because workers seem to correspond to the number of threads reported by htop rather than cores. And I am not limited to workers = 4 (nor 8) on my Mint with 8 logical threads.

My feeling is that the default values could be:

0n the local host:
- threads_per_worker = 1 (simplest), or
- something like min(1, floor(available_workers / workers)) (with available_workers the total number of logical cores as output by detectCores(logical = TRUE) and workers the specified number of workers for the plan. detectCores() can probably be replaced by availableCores() if called outside of the future.
  In case of nested futures, the only situation that makes sense to me is that only the last future can be multi-threaded. But I don't know if it's safe to assume. It would result in something like that:

# assume 40 threads
# not nested
plan(
  tweak(multisession, workers = 10, threads_per_worker = 40 / 10)      # = 4 per worker -> 4 * 10 = 40
)

# nested
plan(
  tweak(multisession, workers = 10, threads_per_worker = 1),           # single-threaded -> 30 threads available
  tweak(multisession, workers = 2, threads_per_worker = 40 / (10 * 1)) # = 4 per worker
)

But I don't know if 1) it is possible to make a future aware that it is calling another future and 2) if the future of the first multisession are using CPU resources while the nested futures are being resolved. Not necessarily right ? some futures are non-blocking

With HPC job scheduler back-ends:
- threads_per_worker = parallelly::availableCores() for HPC job scheduler back-ends with the right method. And if there is a future nested with a local backend, again, we fallback to the previous bullet (parallelly::availableCores()/workers).

I'll stop here because this is long-winded enough and I feel like I'm starting to go off on a tangent (I hope not).
I'm looking forward to your comments!
Best

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`multisession` resets number of threads for OpenMP and OpenBLAS reported by `RhpcBLASctl` #779

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

multisession resets number of threads for OpenMP and OpenBLAS reported by RhpcBLASctl #779

Uh oh!

fspecque Apr 17, 2025

Replies: 2 comments

Uh oh!

Uh oh!

HenrikBengtsson Apr 17, 2025 Maintainer

Uh oh!

fspecque Apr 18, 2025 Author

Detection of number of physical cores and logical threads

Thoughts

`multisession` resets number of threads for OpenMP and OpenBLAS reported by `RhpcBLASctl` #779

fspecque
Apr 17, 2025

HenrikBengtsson
Apr 17, 2025
Maintainer

fspecque
Apr 18, 2025
Author