Allow to change default behavior for num.thread = NULL in ranger() #513

cderv · 2020-05-19T15:20:09Z

Hi,

Some users in our teams are using ranger on our shared RStudio Server Pro cluster. As many R users are not familiar with threading and paralellization so they use the default behavior in ranger.
This means that they will use all the hardware available threads

ranger/src/Forest.cpp

Lines 198 to 208 in d1ecade

    
             // Set number of threads 
        
             if (num_threads == DEFAULT_NUM_THREADS) { 
        
           #ifdef OLD_WIN_R_BUILD 
        
               this->num_threads = 1; 
        
           #else 
        
               this->num_threads = std::thread::hardware_concurrency(); 
        
           #endif 
        
             } else { 
        
               this->num_threads = num_threads; 
        
             }

This is not ideal on shared servers where several datascientists needs to share ressources.
Currently we have some documentation to warn them so that they do not forget the num.threads argument when calling ranger().

However, it would be nice if, as an analytic admin of the service we provide to our user, I could change the default behavior so that ranger() does not use the full available capacity on the server for one user.

I think it could be done :

either on the R package only in ranger function by dealing with a R option or an environment variable that would be used if num. thread is NULL (the default).
Something like (with more control I guess)

if (is.null(num.thread)) {
    num.thread <- as.numeric(Sys.getenv("R_RANGER_NUM_THREAD", getOption("ranger.num.thread", 0L)))
}

or on the C++ side by an environment variable that would be used if set when using default value num.thread = 0

This type of configuration are already done in other R package like

data.table => see ?getDTthread and associated C file. they use a combination of data.table specific environment variables on the C side, or using openMP control feature.
xgboost => using OpenMP that allow to change the value returned by omp_get_* functions using an environment variable

This may be a specific use case but it would help a lot in some shared environment.

Would you consider something like that ?

Thank you very much.

The text was updated successfully, but these errors were encountered:

mnwright · 2020-05-20T08:17:37Z

Thanks, that's a very good idea. I like your first idea (R side) and will have a look on how other packages solve this.

mgoplerud · 2021-01-19T02:24:22Z

Hi! Has there been any update on this? I found this really helpful in figuring out some issues I was having with using ranger in conjunction with doParallel

mnwright · 2021-01-20T20:56:07Z

Sorry, no update yet. A PR would be very welcome!

mnwright · 2023-12-06T20:31:00Z

Done in #713.

kcgthb mentioned this issue Mar 6, 2023

HPC node starting many threads despite setting num.threads = 1 grf-labs/grf#1266

Closed

mnwright added enhancement contributions welcome labels Sep 25, 2023

mnwright mentioned this issue Dec 6, 2023

Default to 2 threads but show a startup message #713

Merged

twest820 mentioned this issue Sep 16, 2024

apparent typo in ranger(num.threads)'s default value #743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to change default behavior for num.thread = NULL in ranger() #513

Allow to change default behavior for num.thread = NULL in ranger() #513

cderv commented May 19, 2020

mnwright commented May 20, 2020

mgoplerud commented Jan 19, 2021

mnwright commented Jan 20, 2021

mnwright commented Dec 6, 2023

Allow to change default behavior for num.thread = NULL in ranger() #513

Allow to change default behavior for num.thread = NULL in ranger() #513

Comments

cderv commented May 19, 2020

mnwright commented May 20, 2020

mgoplerud commented Jan 19, 2021

mnwright commented Jan 20, 2021

mnwright commented Dec 6, 2023