Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small tuning for write-barrier on arm64 #106191

Merged
merged 5 commits into from
Aug 18, 2024
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Aug 9, 2024

A small contribution towards #106051

A small clean up in debug-only code (we don't need to push/pop r13)
Optimize "is gen0" check

@EgorBo
Copy link
Member Author

EgorBo commented Aug 9, 2024

@EgorBot -arm64 -perf

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    [Benchmark]
    public void WB()
    {
        Foo foo = new Foo();
        for (long i = 0; i < 200000000; i++)
            foo.x = foo;
    }
}

internal class Foo
{
    public volatile Foo x;
}

@dotnet dotnet deleted a comment from EgorBot Aug 9, 2024
@EgorBot
Copy link

EgorBot commented Aug 9, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-XNVLRI : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-EYDRAF : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
WB Main 468.1 ms 0.46 ms 1.00
WB PR 403.6 ms 0.61 ms 0.86

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@mangod9
Copy link
Member

mangod9 commented Aug 9, 2024

A small clean up in debug-only code (we don't need to push/pop r13)

doesnt look like this change is debug builds only?

@EgorBo
Copy link
Member Author

EgorBo commented Aug 9, 2024

A small clean up in debug-only code (we don't need to push/pop r13)

doesnt look like this change is debug builds only?

Ah, it seems to be never enabled currently, judging by this

@EgorBo

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBo

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Aug 9, 2024

@EgorBot -arm64 -intel

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>(args: args);

public class Bench
{
    [Benchmark]
    public void WB_gen0()
    {
        Foo foo = new Foo();
        for (long i = 0; i < 200000000; i++)
            foo.x = foo;
    }

    static readonly Foo Gen2Obj = GetGen2Obj();
    static Foo GetGen2Obj()
    {
        var obj = new Foo();
        GC.Collect(2);
        GC.Collect(2);
        if (GC.GetGeneration(obj) != 2)
            throw new Exception();
        return obj;
    }

    [Benchmark]
    public void WB_gen2()
    {
        var gen2obj = Gen2Obj;
        for (long i = 0; i < 200000000; i++)
            gen2obj.x = gen2obj;
    }

    [Benchmark]
    public void CheckedWB_NotInHeap()
    {
        MyStruct ms = new MyStruct();
        string str = new string('x', 1);
        for (long i = 0; i < 200000000; i++)
            CheckedWriteBarrier(ref ms, str);
    }

    // struct in heap
    MyStruct MS = new();

    [Benchmark]
    public void CheckedWB_InHeap()
    {
        string str = new string('x', 1);
        for (long i = 0; i < 200000000; i++)
            CheckedWriteBarrier(ref MS, str);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    void CheckedWriteBarrier(ref MyStruct ms, string str) => ms.Str = str;
}

public record struct MyStruct(string Str);

internal class Foo
{
    public volatile Foo x;
}

@EgorBot

This comment was marked as resolved.

@EgorBot
Copy link

EgorBot commented Aug 9, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-NWBPJG : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-VPCNIX : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
WB_gen0 Main 468.2 ms 0.30 ms 1.00
WB_gen0 PR 402.7 ms 0.65 ms 0.86
WB_gen2 Main 468.3 ms 0.46 ms 1.00
WB_gen2 PR 402.1 ms 0.69 ms 0.86
CheckedWB_NotInHeap Main 600.9 ms 0.05 ms 1.00
CheckedWB_NotInHeap PR 601.3 ms 0.44 ms 1.00
CheckedWB_InHeap Main 801.2 ms 0.05 ms 1.00
CheckedWB_InHeap PR 801.7 ms 0.67 ms 1.00

BDN_Artifacts.zip

@EgorBo
Copy link
Member Author

EgorBo commented Aug 10, 2024

@cshung @jkotas when you have a moment, please review - I am seeing stable 14% improvement in the benchmarks. I did some study with topdown-tool and perf stat, from my view, branches are the most expensive thing on arm64. In my PR I am removing the cbz x12, LOCAL_LABEL(SkipEphemeralCheck) which is responsible for the wins (NAOT doesn't have this check). My understanding that x64 is able to dynamically change the write barrier to simplify the "is object ephemeral" check when GC knows that there is no need in upper bound check, etc - so we might want to repeat that logic for arm64 as well in future.

Untidddtled

Was: https://gist.github.com/EgorBot/cbbd053c7f0362cd6f315a9ee6089f39
Now: https://gist.github.com/EgorBot/2121eba10f5728cb30a5249b9fbe14e7

PS: it is unfortunate that the write barrier is accessed via jump-stubs (you can even see it in the link above ^ and notice that the address matches the address of the write barrier)

blo 0f

PREPARE_EXTERNAL_VAR_INDIRECT g_ephemeral_high, x12
PREPARE_EXTERNAL_VAR_INDIRECT g_ephemeral_low, x12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you have used ldp at one point. Is it not profitable here?

Copy link
Member Author

@EgorBo EgorBo Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw no visible benefits from it, while it's possible to use ldp for NativeAOT easily, it's a bit tricky on CoreCLR with its:

    ldr  x12, LOCAL_LABEL(wbs_ephemeral_low)
    ldr  x17, LOCAL_LABEL(wbs_ephemeral_high)

I wasn't sure how to properly use ldp with LOCAL_LABEL 🤔
I tried:

    ldp  x12, x17, LOCAL_LABEL(wbs_ephemeral_low) -- not compiling
    ldp  x12, x17, [sp, LOCAL_LABEL(wbs_ephemeral_low)] -- not compiling

I guess ldp expects some real sp delta rather than label imm. I guess I can calculate & hard-code it, but I simply didn't see any improvements in the benchmarks from that so I gave up

From my understanding, the difference between CoreCLR and NativeAOT is that NativeAOT accesses GC variables directly, while on CoreCLR we have a mechanism that dumps them all to a "patchable pool" next to the Write-barrier function for better cache access so it can access them by label in code

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mangod9
Copy link
Member

mangod9 commented Aug 11, 2024

just being a little risk averse, wonder whether we should check if this has any measurable improvement in real scenarios? Else we could differ till the 9 fork?

@EgorBo
Copy link
Member Author

EgorBo commented Aug 11, 2024

just being a little risk averse, wonder whether we should check if this has any measurable improvement in real scenarios? Else we could differ till the 9 fork?

Yeah I already put the 10.0.0 milestone to it

@EgorBo
Copy link
Member Author

EgorBo commented Aug 11, 2024

PS: it is unfortunate that the write barrier is accessed via jump-stubs (you can even see it in the link above ^ and notice that the address matches the address of the write barrier)

I presume we should put the write barrier code to the loader heapr

@cshung cshung changed the title Small tunning for write-barrier on arm64 Small tuning for write-barrier on arm64 Aug 14, 2024
@jkotas jkotas merged commit 6b7bfb9 into dotnet:main Aug 18, 2024
90 checks passed
@amanasifkhalid
Copy link
Member

cc @ebepho

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants