Quality of native perf profiling on x64 #105690

EgorBo · 2024-07-30T11:54:09Z

I have built a toy bot to run arbitrary benchmarks on Azure Linux VMs to then report the results back to PRs/issues the bot was invoked from. It also runs native Linux perf tool to collect traces/flamegraphs on demand. What I noticed from the beginning is that the quality of flamegraphs on x64 is often poor and a bit random between runs compared to exactly the same config on arm64.

Example: #105593 (comment)

So here is how flamegraphs look like between sequential runs on arm64 (which is Ampere Altra):

Default arm64 run 1: https://telegafiles.blob.core.windows.net/telega/base_flamegraph_8f9736ce.svg
Default arm64 run 2: https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_8f9736ce.svg

if you open these graphs and try to switch between tabs in your browser, you will notice almost no difference between "run 0" and "run 1" as expected.

On x64 (Amd Milano) the picture is a bit different:

Default x64 run 1: https://telegafiles.blob.core.windows.net/telega/base_flamegraph_1837f76f.svg
Default x64 run 2: https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_1837f76f.svg

(presumably, with less aggressive/disabled inlining it's a lot worse)

the graphs are a lot "random" between the runs. We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them, although, we have an issue to eventually fix it: #88823 and #35274). Should we never use that optimization when PerfMap is enabled? (or even introduce a new knob since it's not only perf specific).

cc @jkotas @dotnet/dotnet-diag

The text was updated successfully, but these errors were encountered:

EgorBo · 2024-07-30T12:02:28Z

@SingleAccretion pointed me to these two places:

How jit decides when to set the frame: link
JIT ignores VM's JIT_FLAG_FRAMED on x64: link

EgorBo · 2024-07-30T13:38:02Z

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)

^ arm64

same sub-graph on x64:

stephentoub · 2024-07-30T13:45:27Z

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)

It's much more complicated; frankly I'm pleased it's only 2x formatting an Int32.

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/DateTimeFormat.cs

Lines 1439 to 1513 in 35b94da

    
           internal static unsafe bool TryFormatO<TChar>(DateTime dateTime, TimeSpan offset, Span<TChar> destination, out int charsWritten) where TChar : unmanaged, IUtfChar<TChar> 
        
           { 
        
               int charsRequired = FormatOMinLength; 
        
               DateTimeKind kind = DateTimeKind.Local; 
        
               if (offset.Ticks == NullOffset) 
        
               { 
        
                   kind = dateTime.Kind; 
        
                   if (kind == DateTimeKind.Local) 
        
                   { 
        
                       offset = TimeZoneInfo.Local.GetUtcOffset(dateTime); 
        
                       charsRequired += 6; 
        
                   } 
        
                   else if (kind == DateTimeKind.Utc) 
        
                   { 
        
                       charsRequired++; 
        
                   } 
        
               } 
        
               else 
        
               { 
        
                   charsRequired += 6; 
        
               } 
        
               if (destination.Length < charsRequired) 
        
               { 
        
                   charsWritten = 0; 
        
                   return false; 
        
               } 
        
               charsWritten = charsRequired; 
        
               dateTime.GetDate(out int year, out int month, out int day); 
        
               dateTime.GetTimePrecise(out int hour, out int minute, out int second, out int tick); 
        
               fixed (TChar* dest = &MemoryMarshal.GetReference(destination)) 
        
               { 
        
                   Number.WriteFourDigits((uint)year, dest); 
        
                   dest[4] = TChar.CastFrom('-'); 
        
                   Number.WriteTwoDigits((uint)month, dest + 5); 
        
                   dest[7] = TChar.CastFrom('-'); 
        
                   Number.WriteTwoDigits((uint)day, dest + 8); 
        
                   dest[10] = TChar.CastFrom('T'); 
        
                   Number.WriteTwoDigits((uint)hour, dest + 11); 
        
                   dest[13] = TChar.CastFrom(':'); 
        
                   Number.WriteTwoDigits((uint)minute, dest + 14); 
        
                   dest[16] = TChar.CastFrom(':'); 
        
                   Number.WriteTwoDigits((uint)second, dest + 17); 
        
                   dest[19] = TChar.CastFrom('.'); 
        
                   Number.WriteDigits((uint)tick, dest + 20, 7); 
        
                   if (kind == DateTimeKind.Local) 
        
                   { 
        
                       int offsetTotalMinutes = (int)(offset.Ticks / TimeSpan.TicksPerMinute); 
        
                       char sign = '+'; 
        
                       if (offsetTotalMinutes < 0) 
        
                       { 
        
                           sign = '-'; 
        
                           offsetTotalMinutes = -offsetTotalMinutes; 
        
                       } 
        
                       (int offsetHours, int offsetMinutes) = Math.DivRem(offsetTotalMinutes, 60); 
        
                       dest[27] = TChar.CastFrom(sign); 
        
                       Number.WriteTwoDigits((uint)offsetHours, dest + 28); 
        
                       dest[30] = TChar.CastFrom(':'); 
        
                       Number.WriteTwoDigits((uint)offsetMinutes, dest + 31); 
        
                   } 
        
                   else if (kind == DateTimeKind.Utc) 
        
                   { 
        
                       dest[27] = TChar.CastFrom('Z'); 
        
                   } 
        
               } 
        
               return true; 
        
           }

But if you have ideas for significantly improving it, that'd be great.

jkotas · 2024-07-30T15:58:22Z

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

EgorBo · 2024-07-30T16:07:48Z

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

Any ideas what that could be? We do rely on a random for e.g. PGO, but I'd expect arm to be also affected if that was the major factor.

Here is a diff example between two x64 runs:

I wonder - why do we see OnTryWrite twice here, is it recursive?

EgorBo · 2024-07-30T16:10:23Z

A bit of a context how the perf is collected: I ask BDN to run longer, wait 15 sec and attach with perf to it, collect the samples for 8 seconds and quit. Hence, it's expected to be a steady state for such trivial benchmarks.

DOTNET_PerfMapEnabled=1 DOTNET_EnableWriteXorExecute=0 nohup $DIR_WORK/core_root_$1/corerun $TFM_PATH/linux-$ARCH/publish/benchapp.dll -i --filter "*" --noForcedGCs --noOverheadEvaluation --disableLogFile --maxWarmupCount 8 --minIterationCount 150000 --maxIterationCount 200000 -a perfarts 1> $DIR_LOGS/profiler.out 2> $DIR_LOGS/profiler.err &
sleep 15
perf record $PERF_RECORD_ARGS -k 1 -g -F 49999 -p $(pgrep corerun) sleep 8
perf inject --input perf.data --jit --output perfjit.data
perf report --input perfjit.data --no-children --percent-limit 5 --stdio > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1_functions.txt
perf annotate --stdio -i perfjit.data --percent-limit 5 -M intel > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1.asm
pkill corerun || true

noahfalk · 2024-08-04T22:37:12Z

I don't have any immediate speculation on what the cause is. My initial thought is that while the architecture difference is a little surprising this might be a "correlation != causation" scenario and it will turn out the culprit is something quite different. A few suggestions that might help make progress:

You may want to try some benchmarks that exercise non-jitted code - its possible the noise has more to do with the hardware, background load, or measurement approach than anything in the codegen.
I find the flame graphs nice for seeing a difference exists, but hard to do deeper analysis on. For that I usually turn to a table of method name + exclusive time. Perfview also has a regression analysis feature that will take two traces and compare them pointing out the methods whose relative times have changed the most.
Performing more trial runs could give us a better idea what the distribution of results looks like and how does it vary in respect to other variables. For example variation by host machine, background processes, choice of benchmark app, localized to specific methods or smeared across all of them?
Folks like @adamsitnik and @brianrob may have more suggestions.

Hope that helps!

adamsitnik · 2024-08-05T09:17:05Z

I am not sure if this is the same issue, but when I was working on adding --profiler perf support to BDN (which is just a wrapper for perfcollect) I've noticed that on x64 the profiles were often detached and on arm64 everything worked as expected.

By detached I mean for method A calling method B and then C, the profile showed "A calling B" on the root level and "C" just next to them (rather than B calling C).

Example:

I've not tried it with .NET 9 so the issue might be gone.

IIRC @janvorli knew the reason behind this particular difference.

adamsitnik · 2024-08-05T09:18:26Z

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

EgorBo · 2024-08-07T12:33:53Z

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

@adamsitnik I tried that first, but decided to have a more explicit control over raw perf e.g. test different events, frequencies, get annotated asm code. E.g. I am not sure it's possible to extract raw asm annotated with % for the top 5 hottest functions in a profile with your profiler.

Overall, there are 2 problems currently:

Some frames look "dettached" on x64, today I've ran a quick benchmark for the Write-Barrier issue and just compare these two flamegraphs: x64 and arm64 - for some reason, the write barrier is not "on top" of void WB() function as expected (it is on arm64) - see original C# code.
General noisy differences between runs - I think it's indeed caused by tiered compilation, after I enabled DOTNET_PerfMapShowOptimizationTiers=1 I noticed mismatches between tier names, so I bumped the timeout perf has to wait before collecting the samples (to 30 sec). Alternatively, users can just disable tiered compilation.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 30, 2024

EgorBo added area-Diagnostics-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 30, 2024

EgorBo mentioned this issue Aug 2, 2024

Never ignore JitFramed flag #105850

Closed

tommcdon added this to the 10.0.0 milestone Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality of native perf profiling on x64 #105690

Quality of native perf profiling on x64 #105690

EgorBo commented Jul 30, 2024 •

edited

Loading

EgorBo commented Jul 30, 2024

EgorBo commented Jul 30, 2024 •

edited

Loading

stephentoub commented Jul 30, 2024 •

edited

Loading

jkotas commented Jul 30, 2024

EgorBo commented Jul 30, 2024

EgorBo commented Jul 30, 2024

noahfalk commented Aug 4, 2024

adamsitnik commented Aug 5, 2024

adamsitnik commented Aug 5, 2024

EgorBo commented Aug 7, 2024 •

edited

Loading

Quality of native perf profiling on x64 #105690

Quality of native perf profiling on x64 #105690

Comments

EgorBo commented Jul 30, 2024 • edited Loading

EgorBo commented Jul 30, 2024

EgorBo commented Jul 30, 2024 • edited Loading

stephentoub commented Jul 30, 2024 • edited Loading

jkotas commented Jul 30, 2024

EgorBo commented Jul 30, 2024

EgorBo commented Jul 30, 2024

noahfalk commented Aug 4, 2024

adamsitnik commented Aug 5, 2024

adamsitnik commented Aug 5, 2024

EgorBo commented Aug 7, 2024 • edited Loading

EgorBo commented Jul 30, 2024 •

edited

Loading

EgorBo commented Jul 30, 2024 •

edited

Loading

stephentoub commented Jul 30, 2024 •

edited

Loading

EgorBo commented Aug 7, 2024 •

edited

Loading