Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

SaltyChiang · 2024-01-08T14:53:50Z

Use QudaGaugeSmearParam::dir_ignore to tell QUDA not to perform smearing in this direction.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

We might should set different default values of dir_ignore for different smearing types to make the interface compatible with before.

Fix bugs in 3D HYP smearing.

maddyscientist · 2024-01-10T00:54:44Z

@cpviolator any chance you can review this?

maddyscientist · 2024-01-10T00:55:04Z

@Jenkins test this please

cpviolator · 2024-01-10T00:57:33Z

@maddyscientist Sure thing!

This solves lattice#1425.

to get the default tol for SU(3) projection.

cpviolator · 2024-01-12T00:47:15Z

@SaltyChiang

This is excellent work! I spent some time revising the HYP algorithm and your implementation here in QUDA is very much in line with QUDA coding style. I very much like the implementation of arbitrary orthogonal dimension for smearing.

I tried to find ways to reduce code repetition by using the existing staple computation, but the differences between APE style and Stout style means there's very little overlap in the code. The only code change I would like to request for the moment is that you place the templates with name structure computeStaple(3D)Level(N) in `gauge_utils.cuh. In the future we will likely implement analytic smearing for HMC or other smearing algorithms that could make use of the general HYP templates.

Before merging with develop I'd like to do a performance test of the kernels to see if we're not missing any obvious speed-up we can take advantage of with minimal code changes. I'll post here with the results.

cpviolator · 2024-01-12T03:17:00Z

OK, using the command:

for SMEAR in ape stout ovrimp-stout hyp wilson symanzik ; do

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 5 > log_${SMEAR}_perf.log                                                            
    wait                                                                                                                                                                                                           

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 500 > log_${SMEAR}_WC.log
    wait
done

I collected the following data:

SMEAR TYPE	Kernel Perf (Gflop/s)	Bandwidth (GB/s)	Compute Wall Clock (secs)
APE	1505.78	1564.45	4.729
Stout	1860.01	1932.48	4.402
OvrImp Stout	2956.79	1850.34	13.422
Wilson Flow Step 1	1421.67	1590.68	-
Wilson Flow Step 2	1271.89	1565.40	-
Wilson Flow Step 3	1573.99	1849.16	-
Wilson Flow time	-	-	13.137
Symanzik Flow Step 1	2625.85	1662.57	-
Symanzik Flow Step 2	2569.70	1670.99	-
Symanzik Flow Step 3	2720.21	1745.59	-
Symanzik Flow time	-	-	42.275
HYP Step 1	838.20	745.07	-
HYP Step 2	702.43	681.14	-
HYP Step 3	933.52	969.89	-
HYP time	-	-	33.840

So I think there may be some improvements for the future, but certainly nothing holding up a PR. It may require a slightly different approach to kernel launches and memory transfer, or something as simple as revising the flop/byte count in statistics.

Either way, this is a great addition to the QUDA library!

SaltyChiang · 2024-01-12T04:14:36Z

@cpviolator Thank you for the testing. I'll check the performance issue.

SaltyChiang · 2024-01-12T07:05:21Z

The performance regression comes from (using HYP<Arg::level=1> version as the example):

1. cnt in computeStapleLevel(N) cannot be unrolled by the compiler. Using staple[4] instead of staple[3] as the input argument (then we can use staple[nu] instead of staple[cnt]) can significantly improve the performance of the computeStapleLevel1 part.
1. We have to store 3 staples per dimension for HYP smearing, which makes the output part not very well optimized.

If I use staple[4] to get staples and then just save one staple (not real HYP, but for testing) in the end, I see the performance of HYP<Arg::level=1> ~2x that of stout. ~~Maybe we could use 4 vector gauges instead of 2 tensor gauges to save these staples? (though we would waste 1/4 of the allocated memory)~~

Please let me know if you have any ideas!

cpviolator · 2024-01-13T02:33:04Z

I had the same I idea and I've flattened out the loops as best as I can, but now I'm chasing a numerical error.

@SaltyChiang may I ask that you join us on slack? @maddyscientist can arrange the invite if so.

maddyscientist · 2024-01-16T19:15:13Z

@SaltyChiang can you share the Chroma code that you are using to call this? It would be great to have an example of how to call the smearing routines in QUDA with Chroma. Thanks.

SaltyChiang · 2024-01-17T04:35:04Z

@maddyscientist Oh, I don't use Chroma to call the smearing routines in QUDA.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

The checking follows the steps:

Generating hyp_smear.lime by Chroma inline measurement LINK_SMEAR with <LinkSmearingType>HYP_SMEAR</LinkSmearingType>
Call smear routine in my Python wrapper of QUDA
Read hyp_smear.lime and make diff of the two smeared field
The norm of the diff is about 1e-7, which should be caused by different SU(3) projection algorithm

weinbe2 · 2024-01-31T23:38:48Z

@SaltyChiang there are a few convention changes I want to put into your PR, plus reconcile the merge conflicts and perform a clang-format. I think I see how to address some of the perf issues as well. I'll create a local copy of your branch, put this in, and then you can merge it into your branch and check it---if all's good, we can then merge it into develop. Does that sound like a good plan to you?

SaltyChiang · 2024-02-01T03:29:49Z

@weinbe2 Sure, this sounds good to me. Glad to hear some progress on the performance issue.

…/gauge-smear

…e/gauge-smear

…me...

…gister count getting maxed out

weinbe2 · 2024-02-01T20:53:07Z

Using the same command as Dean in my branch---on a different GPU (A100-80GB-PCIe), I got:

SMEAR TYPE	Kernel Perf (Gflop/s)	Bandwidth (GB/s)	Compute Wall Clock (secs)
APE	2351.23	2442.84	2.162
Stout	3401.47	3534.00	1.700
OvrImp Stout	5344.44	3344.51	6.556
Wilson Flow Step 1	3349.50	3747.69	-
Wilson Flow Step 2	3172.13	3915.25	-
Wilson Flow Step 3	3290.95	3872.04	-
Wilson Flow time	-	-	5.559
Symanzik Flow Step 1	5285.13	3346.31	-
Symanzik Flow Step 2	5156.00	3355.53	-
Symanzik Flow Step 3	5184.04	3328.04	-
Symanzik Flow time	-	-	20.648
HYP Step 1	1744.10	1550.31	-
HYP Step 2	1580.78	1532.87	-
HYP Step 3	2205.11	2291.02	-
HYP time	-	-	13.763

HYP time is now ~66% of Symanzik time as opposed to ~75%, but it's hard to say if that's due to optimization or because A100 has a larger cache... in any case, it's relatively faster.

weinbe2 · 2024-02-01T22:02:38Z

@SaltyChiang I'm done with my updates---can you please merge my branch (lattice:feature/gauge-smear) into your local branch and give it a fresh comparison against Chroma? As far as my tests have gone (both single GPU and partitioned) I've preserved correctness, but an independent check is always good.

Assuming it passes your Chroma comparison and everything else looks good, I imagine we can get this merged in.

SaltyChiang · 2024-02-05T15:02:20Z

@weinbe2 I checked the performance again before/after merging your additions on Tesla P100.

SMEAR TYPE	Kernel Perf (Gflop/s)	Bandwidth (GB/s)	Compute Wall Clock (secs)
APE	973.97	1011.91	7.85193
Stout	1343.91	1396.27	7.08628
Ovrimp-stout	1717.53	1074.82	23.5821
Wilson flow 1	1206.77	1350.23	-
Wilson flow 2	1084.00	1337.95	-
Wilson flow 3	1194.42	1405.32	-
Wilson flow time	-	-	15.9663
Symanzik flow 1	1601.43	1013.95	-
Symanzik flow 2	1582.12	1029.65	-
Symanzik flow 3	1619.81	1039.88	-
Symanzik flow time	-	-	68.0921
HYP 1 before	609.19	541.50	-
HYP 2 before	522.48	506.65	-
HYP 3 before	721.51	749.62	-
HYP time before	-	-	47.2621
HYP 1 after	703.21	625.07	-
HYP 2 after	559.32	542.37	-
HYP 3 after	750.97	780.22	-
HYP time after	-	-	38.4669

And yes, the new code returns the same result as Chroma. Again, they differ by 1e-6~1e-7, which comes from the different SU(3) projection algorithms. I plan to implement a similar projection as Chroma in quda just for checking, do you think this is acceptable to push such a thing upstream?

@cpviolator Do you have any further ideas about the performance issue?

weinbe2 · 2024-02-05T16:42:53Z

Thanks @SaltyChiang , I'm glad it passes the correctness check on your end. I'd like to get this merged in sooner as opposed to later, further optimizations can go into a future PR.

As for the Chroma algorithm---I like the idea of checking it locally, and if you see a deviation file after that please file an issue and we can go from there.

weinbe2

Approved (pending Jenkins), we'll still need a second review before a merge which @maddyscientist promised she'd get to soon.

maddyscientist · 2024-02-05T16:45:49Z

@Jenkins ok to test

@SaltyChiang I expect once we fix #1436, this will give a boost to double-precision performance of all the smearing routines that use SU(3) projection. We'll handle this in a follow up PR though.

maddyscientist

Approving, subject to CI completing successfully.

mathiaswagner

approve cmake changes

…/gauge-smear

weinbe2 · 2024-02-05T17:36:58Z

@SaltyChiang I had to put in a few quick updates to address the CI issues---can you please merge my branch back into yours again?

lib/gauge_hyp.cu

cpviolator · 2024-02-06T05:59:26Z

@SaltyChiang @weinbe2 I think these improvements are great and an excellent contribution! As we can all see from the algorithm, there are some unavoidable aspects when it comes to memory that prohibit HYP from performing as well as STOUT based algorithms or vanilla APE.

They might not happy with implicit conversion to `int2`.

SaltyChiang · 2024-02-06T08:11:59Z

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

weinbe2 · 2024-02-06T15:34:42Z

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

It might be best to preserve that behavior, yes. Thoughts @cpviolator ? You're the real-world smearing workflow expert here

cpviolator · 2024-02-06T15:39:26Z

@weinbe2 yeah, I think that most users of non analytical smearing in QUDA are using it for spectroscopic measurement so defaulting to excluding the temporal dimension would be best in my opinion. A wiki addition on this behaviour would be good.

weinbe2 · 2024-02-06T18:25:19Z

Thanks @cpviolator --- @SaltyChiang , you should change the default to 3 per that recommendation. Creating a wiki page can be done orthogonal to this PR.

3D for APE/STOUT and 4D for OVRIMP_STOUT/HYP.

SaltyChiang · 2024-02-06T18:54:07Z

@weinbe2 @cpviolator Change the default value of dir_ignore to -1, and any negative value will apply APE and STOUT smearing in spatial directions only. The corresponding comments are also added.

SaltyChiang added 3 commits January 8, 2024 01:16

Add dir_ignore to QudaGaugeSmearParam.

bc88999

Add HYP smearing.

4aaa49a

Update flops() and bytes() for HYP smearing.

2089093

Fix bugs in 3D HYP smearing.

SaltyChiang requested review from a team as code owners January 8, 2024 14:53

SaltyChiang added 4 commits January 10, 2024 12:55

Make compiler happy with STRICT build.

4d073f7

Set default tolerance of SU3 projection 10x epsilon.

0fc4cb4

This solves lattice#1425.

clang-format

e240272

Use GaugeField::toleranceSU3()

2839225

to get the default tol for SU(3) projection.

weinbe2 added 9 commits January 31, 2024 19:41

Merge remote-tracking branch 'clqcd/feature/gauge-smear' into feature…

c0b920a

…/gauge-smear

Moved allocating temporaries into the HYPSmear routine

1133b3f

Merge branch 'develop' of https://github.com/lattice/quda into featur…

e0d97b4

…e/gauge-smear

Code cleanup, introducing extended field coordinate E

9545d28

Partial progress towards HYP cleanup

6685b14

Cleanup/bugfixes

2640ae9

Cleaned up/corrected staple abstraction. Still generating a stack fra…

4f2a625

…me...

Cleanup before getting rid of templating over staples

0fd5776

Removed as much stack frame as possible, remaining frame is due to re…

5f15d84

…gister count getting maxed out

weinbe2 added 2 commits February 1, 2024 12:53

Updated timers

a3fae46

clang-format

d4a24e5

weinbe2 approved these changes Feb 5, 2024

View reviewed changes

maddyscientist approved these changes Feb 5, 2024

View reviewed changes

mathiaswagner approved these changes Feb 5, 2024

View reviewed changes

weinbe2 added 3 commits February 5, 2024 09:21

Clang error: unused "using"

6721776

Removed superfluous getProfile calls

a2326f8

Merge remote-tracking branch 'clqcd/feature/gauge-smear' into feature…

124bab9

…/gauge-smear

maddyscientist reviewed Feb 6, 2024

View reviewed changes

lib/gauge_hyp.cu Outdated Show resolved Hide resolved

SaltyChiang added 3 commits February 6, 2024 14:45

Remove unused param to make compiler happy.

0892e26

clang-14 + nvcc-11.8 think variables like sigma_with_mu unused.

1688931

They might not happy with implicit conversion to `int2`.

Update comments.

74f9a23

Use dir_ignore < 0 to represent the default situation.

2b5a7c2

3D for APE/STOUT and 4D for OVRIMP_STOUT/HYP.

weinbe2 merged commit d090912 into lattice:develop Feb 6, 2024
11 checks passed

SaltyChiang deleted the feature/gauge-smear branch February 7, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

SaltyChiang commented Jan 8, 2024

maddyscientist commented Jan 10, 2024

maddyscientist commented Jan 10, 2024

cpviolator commented Jan 10, 2024

cpviolator commented Jan 12, 2024

cpviolator commented Jan 12, 2024

SaltyChiang commented Jan 12, 2024

SaltyChiang commented Jan 12, 2024 •

edited

Loading

cpviolator commented Jan 13, 2024

maddyscientist commented Jan 16, 2024

SaltyChiang commented Jan 17, 2024

weinbe2 commented Jan 31, 2024

SaltyChiang commented Feb 1, 2024

weinbe2 commented Feb 1, 2024

weinbe2 commented Feb 1, 2024

SaltyChiang commented Feb 5, 2024 •

edited

Loading

weinbe2 commented Feb 5, 2024

weinbe2 left a comment

maddyscientist commented Feb 5, 2024

maddyscientist left a comment

mathiaswagner left a comment

weinbe2 commented Feb 5, 2024

cpviolator commented Feb 6, 2024

SaltyChiang commented Feb 6, 2024 •

edited

Loading

weinbe2 commented Feb 6, 2024

cpviolator commented Feb 6, 2024

weinbe2 commented Feb 6, 2024

SaltyChiang commented Feb 6, 2024

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

Conversation

SaltyChiang commented Jan 8, 2024

maddyscientist commented Jan 10, 2024

maddyscientist commented Jan 10, 2024

cpviolator commented Jan 10, 2024

cpviolator commented Jan 12, 2024

cpviolator commented Jan 12, 2024

SaltyChiang commented Jan 12, 2024

SaltyChiang commented Jan 12, 2024 • edited Loading

cpviolator commented Jan 13, 2024

maddyscientist commented Jan 16, 2024

SaltyChiang commented Jan 17, 2024

weinbe2 commented Jan 31, 2024

SaltyChiang commented Feb 1, 2024

weinbe2 commented Feb 1, 2024

weinbe2 commented Feb 1, 2024

SaltyChiang commented Feb 5, 2024 • edited Loading

weinbe2 commented Feb 5, 2024

weinbe2 left a comment

Choose a reason for hiding this comment

maddyscientist commented Feb 5, 2024

maddyscientist left a comment

Choose a reason for hiding this comment

mathiaswagner left a comment

Choose a reason for hiding this comment

weinbe2 commented Feb 5, 2024

cpviolator commented Feb 6, 2024

SaltyChiang commented Feb 6, 2024 • edited Loading

weinbe2 commented Feb 6, 2024

cpviolator commented Feb 6, 2024

weinbe2 commented Feb 6, 2024

SaltyChiang commented Feb 6, 2024

SaltyChiang commented Jan 12, 2024 •

edited

Loading

SaltyChiang commented Feb 5, 2024 •

edited

Loading

SaltyChiang commented Feb 6, 2024 •

edited

Loading