Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality of native perf profiling on x64 #105690

Open
EgorBo opened this issue Jul 30, 2024 · 10 comments
Open

Quality of native perf profiling on x64 #105690

EgorBo opened this issue Jul 30, 2024 · 10 comments

Comments

@EgorBo
Copy link
Member

EgorBo commented Jul 30, 2024

I have built a toy bot to run arbitrary benchmarks on Azure Linux VMs to then report the results back to PRs/issues the bot was invoked from. It also runs native Linux perf tool to collect traces/flamegraphs on demand. What I noticed from the beginning is that the quality of flamegraphs on x64 is often poor and a bit random between runs compared to exactly the same config on arm64.

Example: #105593 (comment)

So here is how flamegraphs look like between sequential runs on arm64 (which is Ampere Altra):

if you open these graphs and try to switch between tabs in your browser, you will notice almost no difference between "run 0" and "run 1" as expected.

On x64 (Amd Milano) the picture is a bit different:

(presumably, with less aggressive/disabled inlining it's a lot worse)

the graphs are a lot "random" between the runs. We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them, although, we have an issue to eventually fix it: #88823 and #35274). Should we never use that optimization when PerfMap is enabled? (or even introduce a new knob since it's not only perf specific).

cc @jkotas @dotnet/dotnet-diag

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 30, 2024
@EgorBo EgorBo added area-Diagnostics-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 30, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jul 30, 2024

@SingleAccretion pointed me to these two places:

  1. How jit decides when to set the frame: link
  2. JIT ignores VM's JIT_FLAG_FRAMED on x64: link

@EgorBo
Copy link
Member Author

EgorBo commented Jul 30, 2024

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)
image

^ arm64

same sub-graph on x64:

image

@stephentoub
Copy link
Member

stephentoub commented Jul 30, 2024

On a completely unrelated note, it's interesting that DateTime formatting is 2x slower than Guid and Int32, wonder if there is some optimization potential here 🙂 cc @stephentoub (or is it expected because the rules for dates are a lot more complicated?)

It's much more complicated; frankly I'm pleased it's only 2x formatting an Int32.

internal static unsafe bool TryFormatO<TChar>(DateTime dateTime, TimeSpan offset, Span<TChar> destination, out int charsWritten) where TChar : unmanaged, IUtfChar<TChar>
{
int charsRequired = FormatOMinLength;
DateTimeKind kind = DateTimeKind.Local;
if (offset.Ticks == NullOffset)
{
kind = dateTime.Kind;
if (kind == DateTimeKind.Local)
{
offset = TimeZoneInfo.Local.GetUtcOffset(dateTime);
charsRequired += 6;
}
else if (kind == DateTimeKind.Utc)
{
charsRequired++;
}
}
else
{
charsRequired += 6;
}
if (destination.Length < charsRequired)
{
charsWritten = 0;
return false;
}
charsWritten = charsRequired;
dateTime.GetDate(out int year, out int month, out int day);
dateTime.GetTimePrecise(out int hour, out int minute, out int second, out int tick);
fixed (TChar* dest = &MemoryMarshal.GetReference(destination))
{
Number.WriteFourDigits((uint)year, dest);
dest[4] = TChar.CastFrom('-');
Number.WriteTwoDigits((uint)month, dest + 5);
dest[7] = TChar.CastFrom('-');
Number.WriteTwoDigits((uint)day, dest + 8);
dest[10] = TChar.CastFrom('T');
Number.WriteTwoDigits((uint)hour, dest + 11);
dest[13] = TChar.CastFrom(':');
Number.WriteTwoDigits((uint)minute, dest + 14);
dest[16] = TChar.CastFrom(':');
Number.WriteTwoDigits((uint)second, dest + 17);
dest[19] = TChar.CastFrom('.');
Number.WriteDigits((uint)tick, dest + 20, 7);
if (kind == DateTimeKind.Local)
{
int offsetTotalMinutes = (int)(offset.Ticks / TimeSpan.TicksPerMinute);
char sign = '+';
if (offsetTotalMinutes < 0)
{
sign = '-';
offsetTotalMinutes = -offsetTotalMinutes;
}
(int offsetHours, int offsetMinutes) = Math.DivRem(offsetTotalMinutes, 60);
dest[27] = TChar.CastFrom(sign);
Number.WriteTwoDigits((uint)offsetHours, dest + 28);
dest[30] = TChar.CastFrom(':');
Number.WriteTwoDigits((uint)offsetMinutes, dest + 31);
}
else if (kind == DateTimeKind.Utc)
{
dest[27] = TChar.CastFrom('Z');
}
}
return true;
}

But if you have ideas for significantly improving it, that'd be great.

@jkotas
Copy link
Member

jkotas commented Jul 30, 2024

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 30, 2024

We wonder if this could be caused by an x64-specific optimization to omit frame pointers for simple methods (on arm64 we currently always emit them

I do not see how this can explain randomness. I can believe that this optimization can cause the flamegraph to be less representative in some cases (similar effects as inlining or tail-call optimizations), but it should not be random.

Any ideas what that could be? We do rely on a random for e.g. PGO, but I'd expect arm to be also affected if that was the major factor.

Here is a diff example between two x64 runs:

image

I wonder - why do we see OnTryWrite twice here, is it recursive?

@EgorBo
Copy link
Member Author

EgorBo commented Jul 30, 2024

A bit of a context how the perf is collected: I ask BDN to run longer, wait 15 sec and attach with perf to it, collect the samples for 8 seconds and quit. Hence, it's expected to be a steady state for such trivial benchmarks.

DOTNET_PerfMapEnabled=1 DOTNET_EnableWriteXorExecute=0 nohup $DIR_WORK/core_root_$1/corerun $TFM_PATH/linux-$ARCH/publish/benchapp.dll -i --filter "*" --noForcedGCs --noOverheadEvaluation --disableLogFile --maxWarmupCount 8 --minIterationCount 150000 --maxIterationCount 200000 -a perfarts 1> $DIR_LOGS/profiler.out 2> $DIR_LOGS/profiler.err &
sleep 15
perf record $PERF_RECORD_ARGS -k 1 -g -F 49999 -p $(pgrep corerun) sleep 8
perf inject --input perf.data --jit --output perfjit.data
perf report --input perfjit.data --no-children --percent-limit 5 --stdio > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1_functions.txt
perf annotate --stdio -i perfjit.data --percent-limit 5 -M intel > $DIR_BENCHAPP/BenchmarkDotNet.Artifacts/$1.asm
pkill corerun || true

@tommcdon tommcdon added this to the 10.0.0 milestone Aug 4, 2024
@noahfalk
Copy link
Member

noahfalk commented Aug 4, 2024

I don't have any immediate speculation on what the cause is. My initial thought is that while the architecture difference is a little surprising this might be a "correlation != causation" scenario and it will turn out the culprit is something quite different. A few suggestions that might help make progress:

  • You may want to try some benchmarks that exercise non-jitted code - its possible the noise has more to do with the hardware, background load, or measurement approach than anything in the codegen.
  • I find the flame graphs nice for seeing a difference exists, but hard to do deeper analysis on. For that I usually turn to a table of method name + exclusive time. Perfview also has a regression analysis feature that will take two traces and compare them pointing out the methods whose relative times have changed the most.
  • Performing more trial runs could give us a better idea what the distribution of results looks like and how does it vary in respect to other variables. For example variation by host machine, background processes, choice of benchmark app, localized to specific methods or smeared across all of them?
  • Folks like @adamsitnik and @brianrob may have more suggestions.

Hope that helps!

@adamsitnik
Copy link
Member

I am not sure if this is the same issue, but when I was working on adding --profiler perf support to BDN (which is just a wrapper for perfcollect) I've noticed that on x64 the profiles were often detached and on arm64 everything worked as expected.

By detached I mean for method A calling method B and then C, the profile showed "A calling B" on the root level and "C" just next to them (rather than B calling C).

Example:

image

I've not tried it with .NET 9 so the issue might be gone.

IIRC @janvorli knew the reason behind this particular difference.

@adamsitnik
Copy link
Member

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

@EgorBo
Copy link
Member Author

EgorBo commented Aug 7, 2024

@EgorBo just out of curiosity, what https://adamsitnik.com/PerfCollectProfiler/ is missing that pushed you to implementing your own solution?

@adamsitnik I tried that first, but decided to have a more explicit control over raw perf e.g. test different events, frequencies, get annotated asm code. E.g. I am not sure it's possible to extract raw asm annotated with % for the top 5 hottest functions in a profile with your profiler.

Overall, there are 2 problems currently:

  1. Some frames look "dettached" on x64, today I've ran a quick benchmark for the Write-Barrier issue and just compare these two flamegraphs: x64 and arm64 - for some reason, the write barrier is not "on top" of void WB() function as expected (it is on arm64) - see original C# code.

  2. General noisy differences between runs - I think it's indeed caused by tiered compilation, after I enabled DOTNET_PerfMapShowOptimizationTiers=1 I noticed mismatches between tier names, so I bumped the timeout perf has to wait before collecting the samples (to 30 sec). Alternatively, users can just disable tiered compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants