-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality of native perf profiling on x64 #105690
Comments
@SingleAccretion pointed me to these two places: |
On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?) ^ arm64 same sub-graph on x64: |
It's much more complicated; frankly I'm pleased it's only 2x formatting an Int32. runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/DateTimeFormat.cs Lines 1439 to 1513 in 35b94da
But if you have ideas for significantly improving it, that'd be great. |
I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random. |
Any ideas what that could be? We do rely on a random for e.g. PGO, but I'd expect arm to be also affected if that was the major factor. Here is a diff example between two x64 runs: I wonder - why do we see OnTryWrite twice here, is it recursive? |
A bit of a context how the perf is collected: I ask BDN to run longer, wait 15 sec and attach with DOTNET_PerfMapEnabled=1 DOTNET_EnableWriteXorExecute=0 nohup $DIR_WORK/core_root_$1/corerun $TFM_PATH/linux-$ARCH/publish/benchapp.dll -i --filter "*" --noForcedGCs --noOverheadEvaluation --disableLogFile --maxWarmupCount 8 --minIterationCount 150000 --maxIterationCount 200000 -a perfarts 1> $DIR_LOGS/profiler.out 2> $DIR_LOGS/profiler.err &
sleep 15
perf record $PERF_RECORD_ARGS -k 1 -g -F 49999 -p $(pgrep corerun) sleep 8
perf inject --input perf.data --jit --output perfjit.data
perf report --input perfjit.data --no-children --percent-limit 5 --stdio > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1_functions.txt
perf annotate --stdio -i perfjit.data --percent-limit 5 -M intel > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1.asm
pkill corerun || true |
I don't have any immediate speculation on what the cause is. My initial thought is that while the architecture difference is a little surprising this might be a "correlation != causation" scenario and it will turn out the culprit is something quite different. A few suggestions that might help make progress:
Hope that helps! |
I am not sure if this is the same issue, but when I was working on adding By detached I mean for method A calling method B and then C, the profile showed "A calling B" on the root level and "C" just next to them (rather than B calling C). Example: I've not tried it with .NET 9 so the issue might be gone. IIRC @janvorli knew the reason behind this particular difference. |
@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution? |
@adamsitnik I tried that first, but decided to have a more explicit control over raw Overall, there are 2 problems currently:
|
I have built a toy bot to run arbitrary benchmarks on Azure Linux VMs to then report the results back to PRs/issues the bot was invoked from. It also runs native Linux
perf
tool to collect traces/flamegraphs on demand. What I noticed from the beginning is that the quality of flamegraphs on x64 is often poor and a bit random between runs compared to exactly the same config on arm64.Example: #105593 (comment)
So here is how flamegraphs look like between sequential runs on arm64 (which is Ampere Altra):
if you open these graphs and try to switch between tabs in your browser, you will notice almost no difference between "run 0" and "run 1" as expected.
On x64 (Amd Milano) the picture is a bit different:
(presumably, with less aggressive/disabled inlining it's a lot worse)
the graphs are a lot "random" between the runs. We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them, although, we have an issue to eventually fix it: #88823 and #35274). Should we never use that optimization when PerfMap is enabled? (or even introduce a new knob since it's not only
perf
specific).cc @jkotas @dotnet/dotnet-diag
The text was updated successfully, but these errors were encountered: