Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

ebepho · 2024-08-06T22:30:25Z

Description

We observed a significant performance disparity between the Arm64 and x64 write barriers. When running a program without the write barrier, Arm64 was 3x slower than x64. However, with the write barrier enabled, Arm64 became 10x slower. This suggests that Arm64's handling of the write barrier is less optimized compared to x64.

Data

Performance Counter Stats without the Write Barrier

To test the performance of the write barrier, we used Crank to run a simple program 10 times on the two machines. Notice that when we do not access the write barrier, it’s approximately 3x slower on the Arm64 machine.

This is a simple program that does not access the write barrier that we measured the performance of using crank:

int[] foo = new int[1];
for (long i = 0; i < 100_000_000; i++)
{
   foo[0]++;
}

Table 1: Average Performance Counter Stats without the write barrier.

Architecture	x64		Arm64
# of iterations	100,000,000	200,000,000	100,000,000	200,000,000
cache-references	`7199555`	7210098	266711905	467403412.6
cache-misses	`1673444`	1673888	1021946.5	`1042045`.5
cycles	812275185	1513438858	831957725	1517325563
instructions	656685121	1156933373.4	881350905	1583055913
branches	131173961	231219510.1	121014944	221181620.1
faults	2123.4	2123.2	3290.1	3290.9
migrations	50.9	51.7	71.1	84.8
Time elapsed (seconds)	0.26562	0.47812	0.82561	1.4412
User (seconds)	0.24808	0.46158	0.74556	1.3178
Sys (seconds)	0.00801	0.00946	0.16161	0.20523

Performance Counter Stats with the Write Barrier

When we do access the write barrier, performance degrades further, with the Arm64 machine becoming 10x slower.

This is a simple program that access the write barrier that we measured the performance of using crank:

Foo foo = new Foo();
for (long i = 0; i < (# of iterations); i++)
{
    foo.x = foo;
}
internal class Foo
{
    public volatile Foo x;
}

Table 2: Performance Counter Stats with the write barrier.

Architecture	x64		Arm64
# of iterations	100,000,000	200,000,000	100,000,000	200,000,000
cache-references	7252140	7178833	568014397	1068659425
cache-misses	`1697333`	1684188	1025013	1012689
cycles	713364359	1313245706	2756710296	5360611600
instructions	1456194567	2756823577	1983627681	3785656008
branches	431088498	831198368	621239460	1221448774
faults	2116	2124	3291	3296
migrations	50.9	52.3	72.7	61.6
Time elapsed (seconds)	0.23283	0.41492	2.6058	4.2126
User (seconds)	0.21495	0.39656	2.5438	4.0788
Sys (seconds)	0.01169	0.01188	0.14361	0.1984

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-08-06T23:10:58Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

teo-tsirpanis · 2024-08-06T23:19:04Z

Is it certain that the write barrier is to blame? volatile writes have release semantics which I think adds an overhead on ARM architectures.

ebepho · 2024-08-06T23:53:41Z

Is it certain that the write barrier is to blame? volatile writes have release semantics which I think adds an overhead on ARM architectures.

The volatile overhead is not significant enough to explain the performance regressions observed. The numbers were roughly the same with and without it.

neon-sunset · 2024-08-07T02:13:40Z

9.0.100-rc.1.24406.4, M1 Pro, osx-arm64 compiled with dotnet publish -p:PublishAot=true

var foo = new Foo();
for (long i = 0; i < 200_000_000; i++) {
    foo.x = foo;
}

class Foo {
    public volatile Foo? x;
}

time ./wbcost                                                               (base) 
________________________________________________________
Executed in  425.01 millis    fish           external
   usr time  404.48 millis    0.07 millis  404.41 millis
   sys time   18.57 millis    1.02 millis   17.55 millis

EgorBo · 2024-08-07T07:44:28Z

@EgorBot -arm64 -amd -perf -commit 5598791 vs 5598791 --envvars DOTNET_TieredCompilation:0 DOTNET_ReadyToRun:0

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    [Benchmark]
    public void WB()
    {
        Foo foo = new Foo();
        for (long i = 0; i < 200000000; i++)
            foo.x = foo;
    }
}

internal class Foo
{
    public volatile Foo x;
}

EgorBot · 2024-08-07T08:05:19Z

Benchmark results on Amd

BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 8 logical and 4 physical cores
  Job-LUJGBA : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-XLQIIV : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_TieredCompilation=0,DOTNET_ReadyToRun=0

Method	Toolchain	Mean	Error	Ratio
WB	Main	370.8 ms	0.04 ms	1.00
WB	PR	370.9 ms	0.07 ms	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBot · 2024-08-07T08:13:15Z

Benchmark results on Arm64

BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-HCAGWK : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-RPUMUX : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_TieredCompilation=0,DOTNET_ReadyToRun=0

Method	Toolchain	Mean	Error	Ratio
WB	Main	467.4 ms	0.07 ms	1.00
WB	PR	467.4 ms	0.05 ms	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBo · 2024-08-07T09:43:21Z

I cannot reproduce your numbers, I suspect you might be measuring OSR pace difference (consider running with DOTNET_TieredCompilation=0).

Although, arm64 is still slower due to:

I see a jump-stub in the traces, it's rarely an issue on x64
Since VM has to patch several pointer-sized constants - we have to introduce memory loads in arm64 WB, while on x64 we can patch movabs r10, 0xF0F0F0F0F0F0F0F0 directly (they have to be aligned), etc. Looks like Arm64's WB performs 5 memory loads (wbs_sw_ww_table, wbs_ephemeral_low, wbs_ephemeral_high, wbs_card_table + card table value load) while x64 has just one. Annotated asm: arm64 vs x64
X64 seems to have additional implementations of WB for Pre/Post ephemeral heap growing to eliminate some checks dynamically (although, it doesn't look like that saves a lot)
minor: WB implements ByrefWB contract, so it ends with redundant add x14, x14, #0x8

Also, we might want to have a more complicated benchmark where objects aren't ephemeral as well?

EgorBo · 2024-08-08T19:47:55Z

@jkotas @cshung If you're not busy - do you have any idea why "is card table already updated" check is so expensive on arm64? 🙂

Arm64: https://gist.github.com/EgorBot/9549a52ea9ec3ff3b576ba567c103032#file-base_asm_2f8459e6-asm-L26 - 41.44% of all samples are near the ldrb load (wbs_card_table) on arm64
x64: https://gist.github.com/EgorBot/c0d33e8b75ee98bf228b45e53dc7db92#file-diff_asm_124c2160-asm-L62 (here on the left I have absolute number of samples unlike arm64) - just one sample around the similar "is card table updated" load 😐

can it be some false sharing etc?

Another thing I noticed that arm64 WB is so expensive that we can add yet another branch ("is object reference null? Exit") and the regression will be <1% (while giving us 2X improvement when we actually write null)

cshung · 2024-08-08T22:19:29Z

Also, we might want to have a more complicated benchmark where objects aren't ephemeral as well?

Yes, we should totally understand the performance of the write barrier function under other execution paths - for example - when cache miss, when branching away because of heap range, generations, and so on. The initial benchmark was designed to be easy to understand. For example, I wanted to make sure the cache always hit and read exactly the same location, that make sure we don't hit any cache issues. As we can see, even in this trivial scenario, the data is showing surprising results, make it more varying will only make it harder to interpret.

can it be some false sharing etc?

I doubt it is false sharing. Since we aren't allocating, the GC should not be running, and no other thread should be accessing the card table, so the core should have exclusive access to the cache entry.

Beside the obvious fact that this "slow load" used a different instruction, this slow load is also loading from a computed address, does the ARM architecture does anything special with respect to loading from a hard coded address? I don't know.

I wonder if tools like this can give us more insight on what is going on.
https://learn.arm.com/learning-paths/servers-and-cloud-computing/top-down-n1/analysis-1/

jkotas · 2024-08-09T05:45:19Z

My bet would be sampling bias or some micro-architecture issue. I think it would be best to ask Arm hw engineers to replicate this on a simulator and tell us what's actually going on.

a74nh · 2024-11-14T14:27:00Z

Since VM has to patch several pointer-sized constants - we have to introduce memory loads in arm64 WB, while on x64 we can patch movabs r10, 0xF0F0F0F0F0F0F0F0 directly (they have to be aligned), etc. Looks like Arm64's WB performs 5 memory loads (wbs_sw_ww_table, wbs_ephemeral_low, wbs_ephemeral_high, wbs_card_table + card table value load) while x64 has just one. Annotated asm: arm64 vs x64

On Arm64 this could be done by having 4 MOVK/MOVN instructions to load 0xF0F0F0F0F0F0F0F0 and then patching those up. Has that been considered? It should be cheaper than doing the load.

Probably minor but also spotted is this bit of code is doing if (val != 0xff) val=0xff

#ifdef FEATURE_MANUALLY_MANAGED_CARD_BUNDLES
    // Check if we need to update the card bundle table
    ldr  x12, LOCAL_LABEL(wbs_card_bundle_table)
    add  x15, x12, x14, lsr #21
    ldrb w12, [x15]
    cmp  x12, 0xFF
    beq  LOCAL_LABEL(Exit)

    // Update the card bundle
    mov  x12, 0xFF
    strb w12, [x15]
#endif

LOCAL_LABEL(Exit):

...which can be simplified to val=0xff

#ifdef FEATURE_MANUALLY_MANAGED_CARD_BUNDLES
    // Update the card bundle table
    ldr  x12, LOCAL_LABEL(wbs_card_bundle_table)
    add  x15, x12, x14, lsr #21
    mov  x12, 0xFF
    strb w12, [x15]
#endif

LOCAL_LABEL(Exit):

EgorBo · 2024-11-14T14:44:01Z

On Arm64 this could be done by having 4 MOVK/MOVN instructions to load 0xF0F0F0F0F0F0F0F0 and then patching those up.

I am not sure whether they need to be patched atomically, presumably not since the constants are updated during GC stop, but it needs to be checked. Also, the data it loads is located just near the function so supposed to be not terribly slow?

...which can be simplified to val=0xff

I've tried that and it either didn't improve anything or even regressed, don't remember exactly

jkotas · 2024-11-14T14:47:44Z

I am not sure whether they need to be patched atomically, presumably not since the constants are updated during GC stop,

The constants are not always updated during GC stop. Notice that the x64 implementation of the write barrier has padding nops to make the constants aligned so that they can be patched atomically:

runtime/src/coreclr/vm/amd64/JitHelpers_FastWriteBarriers.asm

Line 72 in c653208

NOP_2_BYTE ; padding for alignment of constant

jkotas · 2024-11-14T15:16:22Z

Probably minor but also spotted is this bit of code is doing if (val != 0xff) val=0xff
...which can be simplified to val=0xff

This is intentional optimization to avoid cache contention. Simplifying it to val=0xff would likely show up as a regression.

a74nh · 2024-11-19T13:34:01Z

I recreated the two tests in the top comment using benchmarkdotnet. C# and Arm64 asm code is here: https://gist.github.com/a74nh/c8e06132b2d7c33a373a88f567ef8ef8

I ran it on a bunch of machines, all with Ubuntu 22.04

x86 Gold 5120T

Method	Mean	Error	StdDev
NoBarrier	170.5 ms	0.37 ms	0.35 ms
DoBarrier	312.6 ms	1.56 ms	1.38 ms

X86 platinum-8480

Method	Mean	Error	StdDev	Median
NoBarrier	26.41 ms	0.138 ms	0.115 ms	26.37 ms
DoBarrier	280.65 ms	5.606 ms	11.576 ms	286.65 ms

Altra (neoverse N1)

Method	Mean	Error	StdDev	Median
NoBarrier	171.7 ms	3.42 ms	5.90 ms	168.2 ms
DoBarrier	367.2 ms	0.02 ms	0.01 ms	367.2 ms

Altra Max

Method	Mean	Error	StdDev	Median
NoBarrier	172.8 ms	3.40 ms	6.21 ms	168.2 ms
DoBarrier	233.7 ms	0.13 ms	0.12 ms	233.7 ms

Cobalt 100 (Neoverse N2)

Method	Mean	Error	StdDev
NoBarrier	177.5 ms	0.01 ms	0.01 ms
DoBarrier	147.3 ms	0.01 ms	0.01 ms

Graviton 3 c7g.2xlarge (Neoverse V1)

Method	Mean	Error	StdDev
NoBarrier	192.9 ms	0.02 ms	0.02 ms
DoBarrier	154.4 ms	0.08 ms	0.07 ms

Graviton 3 c7g.16xlarge (Neoverse V1)

Method	Mean	Error	StdDev
NoBarrier	192.6 ms	0.01 ms	0.01 ms
DoBarrier	116.4 ms	1.22 ms	1.14 ms

Grace (Neoverse V2)

Method	Mean	Error	StdDev
NoBarrier	181.8 ms	0.04 ms	0.03 ms
DoBarrier	121.2 ms	0.08 ms	0.07 ms

Curiously the NoBarrier case isn't optimal. It could use index addressing for the load and the bounds check could be hoisted (as it's always the same constants).

Running perf on some of these (see the gist), For JIT_WriteBarrier() I see the samples spread out across each of the loads .

EgorBo · 2024-11-19T15:23:49Z

Curiously the NoBarrier case isn't optimal. It could use index addressing for the load and the bounds check could be hoisted (as it's always the same constants).

you can move assignments to separate no-inline methods to make sure the rest of the codegen around loops is the same.

To be fair, I also don't think there a problem here. My benchmarks also don't show terrible differences between arm64 and comparable x64. I observe high contention on card-table loads in OrchardCMS (high level of concurrency), but doesn't look to be a bottle-neck.

a74nh · 2024-11-21T07:56:07Z

Added Cobalt 100 figures to the set of results.

EgorBo · 2024-11-21T09:43:07Z

@EgorBot -linux_azure_cobalt100 -linux_azure_milano -profiler -commit main

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public class Bench
{
    static object Src = new ();
    static object Dst;

    static Bench() => GC.Collect();

    [Benchmark]
    public void WB_Gen0()
    {
        var src = new object();
        for (int i = 0; i < 1_000_000; i++)
            Dst = src;
    }

    [Benchmark]
    public void WB_Gen2()
    {
        var obj = Src;
        for (int i = 0; i < 1_000_000; i++)
            Dst = obj;
    }
}

kunalspathak · 2024-11-21T20:59:16Z

On Windows 16 core Cobalt machine, here are the results that match linux numbers:

Method	Mean	Error	StdDev
NoBarrier	177.4 ms	0.27 ms	0.24 ms
DoBarrier	118.0 ms	0.30 ms	0.23 ms

kunalspathak · 2024-11-21T21:57:14Z

@a74nh - After looking at the disassembly, the NoBarrier case was slower because of the bounds checks added before accessing the array element.

To compare with DoBarrier() version, this should be the right benchmark and with that, I can clearly see the slowness in DoBarrier

        internal class DataClass
        {            
            public DataClass x;
            public int y;
        }

        [Benchmark]
        public void NoBarrier_v2() => TestNoBarrier_v2(data, 5);

        [MethodImpl(MethodImplOptions.NoInlining)]
        static unsafe DataClass TestNoBarrier_v2(DataClass d, int _y)
        {
            for (long i = 0; i < iter; i++)
            {
                d.y = _y;
            }
            return d;
        }

Method	Mean	Error	StdDev
NoBarrier	177.32 ms	0.195 ms	0.152 ms
NoBarrier_v2	29.49 ms	0.003 ms	0.002 ms
DoBarrier	117.97 ms	0.023 ms	0.019 ms

ebepho added the tenet-performance Performance related issue label Aug 6, 2024

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 6, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 6, 2024

ebepho added arch-arm64 area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 6, 2024

This comment was marked as resolved.

Sign in to view

EgorBo mentioned this issue Aug 7, 2024

Quality of native perf profiling on x64 #105690

Open

EgorBo mentioned this issue Aug 9, 2024

Small tuning for write-barrier on arm64 #106191

Merged

mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 14, 2024

mangod9 added this to the 10.0.0 milestone Aug 14, 2024

kunalspathak mentioned this issue Nov 8, 2024

Improve Arm64 Performance in .NET 10 #109652

Open

17 tasks

This was referenced Nov 21, 2024

EgorBot for EgorBo in #106051 EgorBot/runtime-utils#167

Open

EgorBot for EgorBo in #106051 EgorBot/runtime-utils#168

Open

EgorBot mentioned this issue Nov 21, 2024

EgorBot for EgorBo in #106051 EgorBot/runtime-utils#169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

ebepho commented Aug 6, 2024 •

edited

Loading

dotnet-policy-service bot commented Aug 6, 2024

teo-tsirpanis commented Aug 6, 2024

ebepho commented Aug 6, 2024 •

edited

Loading

This comment was marked as resolved.

neon-sunset commented Aug 7, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

EgorBo commented Aug 7, 2024

EgorBot commented Aug 7, 2024

EgorBot commented Aug 7, 2024

EgorBo commented Aug 7, 2024 •

edited

Loading

EgorBo commented Aug 8, 2024

cshung commented Aug 8, 2024 •

edited

Loading

jkotas commented Aug 9, 2024

a74nh commented Nov 14, 2024

EgorBo commented Nov 14, 2024

jkotas commented Nov 14, 2024 •

edited

Loading

jkotas commented Nov 14, 2024

a74nh commented Nov 19, 2024 •

edited

Loading

EgorBo commented Nov 19, 2024

a74nh commented Nov 21, 2024

EgorBo commented Nov 21, 2024

kunalspathak commented Nov 21, 2024

kunalspathak commented Nov 21, 2024

Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

Comments

ebepho commented Aug 6, 2024 • edited Loading

Description

Data

Performance Counter Stats without the Write Barrier

Performance Counter Stats with the Write Barrier

dotnet-policy-service bot commented Aug 6, 2024

teo-tsirpanis commented Aug 6, 2024

ebepho commented Aug 6, 2024 • edited Loading

This comment was marked as resolved.

neon-sunset commented Aug 7, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

EgorBo commented Aug 7, 2024

EgorBot commented Aug 7, 2024

EgorBot commented Aug 7, 2024

EgorBo commented Aug 7, 2024 • edited Loading

EgorBo commented Aug 8, 2024

cshung commented Aug 8, 2024 • edited Loading

jkotas commented Aug 9, 2024

a74nh commented Nov 14, 2024

EgorBo commented Nov 14, 2024

jkotas commented Nov 14, 2024 • edited Loading

jkotas commented Nov 14, 2024

a74nh commented Nov 19, 2024 • edited Loading

EgorBo commented Nov 19, 2024

a74nh commented Nov 21, 2024

EgorBo commented Nov 21, 2024

kunalspathak commented Nov 21, 2024

kunalspathak commented Nov 21, 2024

ebepho commented Aug 6, 2024 •

edited

Loading

ebepho commented Aug 6, 2024 •

edited

Loading

neon-sunset commented Aug 7, 2024 •

edited

Loading

EgorBo commented Aug 7, 2024 •

edited

Loading

cshung commented Aug 8, 2024 •

edited

Loading

jkotas commented Nov 14, 2024 •

edited

Loading

a74nh commented Nov 19, 2024 •

edited

Loading