Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not limit DTthreads to number of cores? #2914

Closed
jangorecki opened this issue May 30, 2018 · 5 comments
Closed

do not limit DTthreads to number of cores? #2914

jangorecki opened this issue May 30, 2018 · 5 comments

Comments

@jangorecki
Copy link
Member

jangorecki commented May 30, 2018

For maximum threads used in parallel OpenMP code we are using getDTthreads which uses omp_get_max_threads(). That make sense.
But today I read on http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-basics.html#Aboutthreadsandcores

On some modern processors there are hardware threads, meaning that a core can actually let more than thread be executed, with some speedup over the single thread. To use such a processor efficiently you would let the number of OpenMP threads be 2x or 4x the number of cores, depending on the hardware.

I checked my proc with dummy sleep call, and despite having 2 cores, setting more threads did actually more stuff in parallel then it "should".
Below function write to integer array number of thread that is evaluating the loop, then it prints that array.

#include<stdio.h>
#include<unistd.h>
#include "omp.h"

int main() {
  int n=10;
  int ans[n];
  #pragma omp parallel for schedule(static)
  for (int i=0; i<n; i++) {
    sleep(1);
    ans[i] = omp_get_thread_num();
  }
  for (int i=0; i<n; i++) {
    printf("ans[%d]=%d\n", i, ans[i]);
  }
  return 0;
}
gcc -fopenmp threads.c -o threads

I have 2 cores, thus this is current default, works as expected, instead of 10s, it spent 5s, half of iterations were made with thread 0, another half with thread 1:

export OMP_NUM_THREADS=2
time ./threads
#ans[0]=0
#ans[1]=0
#ans[2]=0
#ans[3]=0
#ans[4]=0
#ans[5]=1
#ans[6]=1
#ans[7]=1
#ans[8]=1
#ans[9]=1
#
#real	0m5.005s
#user	0m0.000s
#sys 	0m0.005s

Now force to single thread, as expected took 10s, and all iterations were made by thread num 0:

export OMP_NUM_THREADS=1
time ./threads
#ans[0]=0
#ans[1]=0
#ans[2]=0
#ans[3]=0
#ans[4]=0
#ans[5]=0
#ans[6]=0
#ans[7]=0
#ans[8]=0
#ans[9]=0
#
#real	0m10.008s
#user	0m0.000s
#sys 	0m0.003s

And now the tricky one, more threads than cores. Why each iteration has own thread number I can image - more threads are there, they are just sequentially evaluated by 2 cores, retaining own numbers. But why total time is only 1s, as I would have 10 cores?

export OMP_NUM_THREADS=10
time ./threads
#ans[0]=0
#ans[1]=1
#ans[2]=2
#ans[3]=3
#ans[4]=4
#ans[5]=5
#ans[6]=6
#ans[7]=7
#ans[8]=8
#ans[9]=9
#
#real	0m1.008s
#user	0m0.000s
#sys 	0m0.008s

Maybe during sleep cores can switch to another thread? maybe sleep is just not good for this case. Don't know, thus question, what are drawbacks of setting more threads than cores? I assume they are queued on all available cores anyway, so doesn't looks it harms anyway. Why my 2 core cpu behaves like 10 core cpu in above case?

My lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 92
Model name:            Intel(R) Celeron(R) CPU N3350 @ 1.10GHz
Stepping:              9
CPU MHz:               1420.726
CPU max MHz:           2400.0000
CPU min MHz:           800.0000
BogoMIPS:              2188.80
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti retpoline intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts

gcc:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
@mattdowle
Copy link
Member

Maybe during sleep cores can switch to another thread?

Yes. The OS has thousands of threads doing this all the time.

The downside to setting nth > omp_get_max_threads() is that data.table algos often allocate thread-private buffers. The more threads, the more buffers sloshing through cache, although buffMB can be reduced too. It's this balancing act that needs to be tackled on a per-algo per-dataset per-hardware basis.

Apparently omp_get_max_threads() can simply be wrong sometimes (too low). It's only supposed to be a best-guess. So that would be a good reason to, yes, allow setDTthreads(nth>omp_get_max_threads()) but with warning and a hard limit of 4X? Then it could be used to experiment on different hardware, yes.

@jangorecki
Copy link
Member Author

Any suggestion to achieve sleep for threads in omp? It would be useful for debugging when using openmp in a function that is called from another function that uses openmp too.

@mattdowle
Copy link
Member

I'm not following this question: just call sleep() in a thread will sleep that thread.

@jangorecki
Copy link
Member Author

related to #3300

@mattdowle mattdowle added this to the 1.12.2 milestone Mar 5, 2019
@mattdowle
Copy link
Member

PR #3435 solved this. setDThreads() now allows up to omp_get_num_procs() which is the number of logical CPUs on the machine. On my laptop with 4 cores, this limit is 8 for example. Given the cache-bound algorithms we have, we shouldn't allow more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants