Suspiciously low User and Sys time #597

moncefmechri · 2022-12-29T15:15:08Z

I am benchmarking the mold linker on an in-house application, using hyperfine 1.15.0 (from conda-forge) to drive the benchmarks.

While the wall-clock times reported seem correct, the "User" and "Sys" times seem way too low (and also are not impacted by the number of threads used). Example:

Benchmark 1: mold180 (1 threads)
  Time (mean ± σ):     24.302 s ±  0.477 s    [User: 0.392 s, System: 0.234 s]
  Range (min … max):   23.836 s … 24.789 s    3 runs

Benchmark 2: mold180 (2 threads)
  Time (mean ± σ):     14.871 s ±  0.597 s    [User: 0.385 s, System: 0.246 s]
  Range (min … max):   14.351 s … 15.523 s    3 runs

The benchmark invokes the linker by calling ninja MyBinary (with --setup and --prepare used to make sure ninja does not much else than invoking the linker)

Specs of the machine used to run the benchmark:

OS: Ubuntu 20.04 with kernel 5.4
CPU: Dual socket Intel Xeon Gold 5218
RAM: 128GB

The text was updated successfully, but these errors were encountered:

sharkdp · 2023-01-08T07:18:26Z

Please see https://stackoverflow.com/a/556411/704831. User and system times only refer to the time that the benchmarked process itself spent in user/system mode. Ninja calls out to subprocesses (the linker) which takes the majority of the time. So what you see above is only the time that Ninja itself spends in user/system mode.

If you are interested in the user/system times for the linker, maybe you can benchmark it directly?

I realize that this can be confusing. We should probably document it. Now that I'm thinking about it some more, I'm also not sure if that works correctly with and without the intermediate shell (--shell=…). And if it is consistent across operating systems. This should be investigated, (consolidated) and documented.

moncefmechri · 2023-01-08T22:51:29Z

I think there might be something more subtle at play here.

First of all, for some reason, when linking using lld instead of mold, the user and system times look a lot more reasonable, even when going through ninja.

If I skip ninja and invoke g++ explicitly, I see again a similar pattern. The following runs both use 1 thread:

Benchmark 1: lld
  Time (mean ± σ):     10.933 s ±  0.029 s    [User: 9.437 s, System: 1.493 s]
  Range (min … max):   10.912 s … 10.953 s    2 runs
  
Benchmark 1: mold
  Time (mean ± σ):     12.026 s ±  0.028 s    [User: 0.013 s, System: 0.002 s]
  Range (min … max):   12.006 s … 12.045 s    2 runs

The only difference in the command line is how the linker is specified. For lld, I pass to the compiler -fuse-ld=lld, while for mold I pass -B/path/to/mold/folder. I can't think of how this would impact the results.

I also ran both commands with strace -f. Both traces look almost identical. With both linkers:

With a vfork() + execve(), the compiler invokes collect2
With another vfork() + execve(), collect2 invokes the linker

Once collect2 has invoked the linker, the main difference I see in the strace -f outputs between lld and mold is that mold ends up calling clone(), and the actual work seems to be done in that clone()-d process, while lld does the work in its main process.

The call to clone() in the strace -f output for mold:

[pid 36238] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f33768d0ad0) = 36239

Maybe that's a clue toward the real answer.

Note that I am not the only one to have seen this behavior. In this blog post, one of the lld devs uses hyperfine to compare lld and mold. You can see the same pattern there.

Finally, I will add that time behaves similarly for both linkers:

With lld:

real    0m10.922s
user    0m9.469s
sys     0m1.452s

With mold:

real    0m12.083s
user    0m0.008s
sys     0m0.008s

sharkdp · 2023-01-09T07:25:00Z

Thank you for the detailed answer. I sent you that stackoverflow link but didn't read it carefully myself - sorry. This also explains my confusion ("I'm also not sure if that works correctly with and without the intermediate shell").

The reality is a bit more complicated, as you said. Typically, user and sys times of child processes are collected. Which is why hyperfines measurement usually works with and without the intermediate shell.

However, user and sys times of child processes that did not finish in time (when the main processes ended) are not counted. This is described in this comment:

The time spent by child processes and their descendants count only when the time could have been collected by wait(2) or waitpid(2), and recursively with descendants. This implies that the child processes must have terminated. For instance, compare time sh -c 'foo & sleep 1' and time sh -c 'foo & sleep 2', where foo is a command that takes CPU time between 1 and 2 seconds. The former outputs something around 0.

We can demonstrate this using stress --cpu 1 --timeout 2, which spends most of its time in "user" mode.. for two seconds:

▶ time sh -c 'stress --cpu 1 --timeout 2 --quiet & sleep 1'
[…]  0,00s user 0,00s system 0% cpu 1,004 total

~                                                                                                      
▶ time sh -c 'stress --cpu 1 --timeout 2 --quiet & sleep 3'
[…]  2,00s user 0,00s system 66% cpu 3,004 total

The same thing can also be demonstrated with hyperfine:

▶ hyperfine \
    'stress --cpu 1 --timeout 2 --quiet & sleep 1' \
    'stress --cpu 1 --timeout 2 --quiet & sleep 3'

Benchmark 1: stress --cpu 1 --timeout 2 --quiet & sleep 1
  Time (mean ± σ):      1.001 s ±  0.000 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.001 s …  1.001 s    10 runs
 
Benchmark 2: stress --cpu 1 --timeout 2 --quiet & sleep 3
  Time (mean ± σ):      3.001 s ±  0.000 s    [User: 1.996 s, System: 0.001 s]
  Range (min … max):    3.001 s …  3.002 s    10 runs

Note that the 2 seconds in user mode are only reported in the lower example, where the waiting time is long enough. In the first example, a user time of 0 is reported.

Now the question remains why this happens in your example. Could it be that one of the tools spawns children but then doesn't wait for them?

moncefmechri · 2023-01-10T21:40:04Z

Thank you for your reply. I will try to figure out what mold is doing in the next few days, and will report back

sharkdp · 2023-04-17T20:34:43Z

@moncefmechri Any update on this?

sharkdp added the documentation label Apr 17, 2023

sharkdp mentioned this issue Jun 19, 2023

Exporting detailed user/system times (in the JSON output)? #660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspiciously low User and Sys time #597

Suspiciously low User and Sys time #597

moncefmechri commented Dec 29, 2022 •

edited

Loading

sharkdp commented Jan 8, 2023

moncefmechri commented Jan 8, 2023 •

edited

Loading

sharkdp commented Jan 9, 2023 •

edited

Loading

moncefmechri commented Jan 10, 2023

sharkdp commented Apr 17, 2023

Suspiciously low User and Sys time #597

Suspiciously low User and Sys time #597

Comments

moncefmechri commented Dec 29, 2022 • edited Loading

sharkdp commented Jan 8, 2023

moncefmechri commented Jan 8, 2023 • edited Loading

sharkdp commented Jan 9, 2023 • edited Loading

moncefmechri commented Jan 10, 2023

sharkdp commented Apr 17, 2023

moncefmechri commented Dec 29, 2022 •

edited

Loading

moncefmechri commented Jan 8, 2023 •

edited

Loading

sharkdp commented Jan 9, 2023 •

edited

Loading