Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

Merged
merged 25 commits into from
Feb 6, 2024

Conversation

SaltyChiang
Copy link
Contributor

Use QudaGaugeSmearParam::dir_ignore to tell QUDA not to perform smearing in this direction.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

We might should set different default values of dir_ignore for different smearing types to make the interface compatible with before.

@SaltyChiang SaltyChiang requested review from a team as code owners January 8, 2024 14:53
@maddyscientist
Copy link
Member

@cpviolator any chance you can review this?

@maddyscientist
Copy link
Member

@Jenkins test this please

@cpviolator
Copy link
Member

@maddyscientist Sure thing!

@cpviolator
Copy link
Member

@SaltyChiang

This is excellent work! I spent some time revising the HYP algorithm and your implementation here in QUDA is very much in line with QUDA coding style. I very much like the implementation of arbitrary orthogonal dimension for smearing.

I tried to find ways to reduce code repetition by using the existing staple computation, but the differences between APE style and Stout style means there's very little overlap in the code. The only code change I would like to request for the moment is that you place the templates with name structure computeStaple(3D)Level(N) in `gauge_utils.cuh. In the future we will likely implement analytic smearing for HMC or other smearing algorithms that could make use of the general HYP templates.

Before merging with develop I'd like to do a performance test of the kernels to see if we're not missing any obvious speed-up we can take advantage of with minimal code changes. I'll post here with the results.

@cpviolator
Copy link
Member

OK, using the command:

for SMEAR in ape stout ovrimp-stout hyp wilson symanzik ; do

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 5 > log_${SMEAR}_perf.log                                                            
    wait                                                                                                                                                                                                           

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 500 > log_${SMEAR}_WC.log
    wait
done

I collected the following data:

SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 1505.78 1564.45 4.729
Stout 1860.01 1932.48 4.402
OvrImp Stout 2956.79 1850.34 13.422
Wilson Flow Step 1 1421.67 1590.68 -
Wilson Flow Step 2 1271.89 1565.40 -
Wilson Flow Step 3 1573.99 1849.16 -
Wilson Flow time - - 13.137
Symanzik Flow Step 1 2625.85 1662.57 -
Symanzik Flow Step 2 2569.70 1670.99 -
Symanzik Flow Step 3 2720.21 1745.59 -
Symanzik Flow time - - 42.275
HYP Step 1 838.20 745.07 -
HYP Step 2 702.43 681.14 -
HYP Step 3 933.52 969.89 -
HYP time - - 33.840

So I think there may be some improvements for the future, but certainly nothing holding up a PR. It may require a slightly different approach to kernel launches and memory transfer, or something as simple as revising the flop/byte count in statistics.

Either way, this is a great addition to the QUDA library!

@SaltyChiang
Copy link
Contributor Author

@cpviolator Thank you for the testing. I'll check the performance issue.

@SaltyChiang
Copy link
Contributor Author

SaltyChiang commented Jan 12, 2024

The performance regression comes from (using HYP<Arg::level=1> version as the example):

    1. cnt in computeStapleLevel(N) cannot be unrolled by the compiler. Using staple[4] instead of staple[3] as the input argument (then we can use staple[nu] instead of staple[cnt]) can significantly improve the performance of the computeStapleLevel1 part.
    1. We have to store 3 staples per dimension for HYP smearing, which makes the output part not very well optimized.

If I use staple[4] to get staples and then just save one staple (not real HYP, but for testing) in the end, I see the performance of HYP<Arg::level=1> ~2x that of stout. Maybe we could use 4 vector gauges instead of 2 tensor gauges to save these staples? (though we would waste 1/4 of the allocated memory)

Please let me know if you have any ideas!

@cpviolator
Copy link
Member

I had the same I idea and I've flattened out the loops as best as I can, but now I'm chasing a numerical error.

@SaltyChiang may I ask that you join us on slack? @maddyscientist can arrange the invite if so.

@maddyscientist
Copy link
Member

@SaltyChiang can you share the Chroma code that you are using to call this? It would be great to have an example of how to call the smearing routines in QUDA with Chroma. Thanks.

@SaltyChiang
Copy link
Contributor Author

@maddyscientist Oh, I don't use Chroma to call the smearing routines in QUDA.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

The checking follows the steps:

  • Generating hyp_smear.lime by Chroma inline measurement LINK_SMEAR with <LinkSmearingType>HYP_SMEAR</LinkSmearingType>
  • Call smear routine in my Python wrapper of QUDA
  • Read hyp_smear.lime and make diff of the two smeared field
  • The norm of the diff is about 1e-7, which should be caused by different SU(3) projection algorithm

@weinbe2
Copy link
Contributor

weinbe2 commented Jan 31, 2024

@SaltyChiang there are a few convention changes I want to put into your PR, plus reconcile the merge conflicts and perform a clang-format. I think I see how to address some of the perf issues as well. I'll create a local copy of your branch, put this in, and then you can merge it into your branch and check it---if all's good, we can then merge it into develop. Does that sound like a good plan to you?

@SaltyChiang
Copy link
Contributor Author

@weinbe2 Sure, this sounds good to me. Glad to hear some progress on the performance issue.

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 1, 2024

Using the same command as Dean in my branch---on a different GPU (A100-80GB-PCIe), I got:

SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 2351.23 2442.84 2.162
Stout 3401.47 3534.00 1.700
OvrImp Stout 5344.44 3344.51 6.556
Wilson Flow Step 1 3349.50 3747.69 -
Wilson Flow Step 2 3172.13 3915.25 -
Wilson Flow Step 3 3290.95 3872.04 -
Wilson Flow time - - 5.559
Symanzik Flow Step 1 5285.13 3346.31 -
Symanzik Flow Step 2 5156.00 3355.53 -
Symanzik Flow Step 3 5184.04 3328.04 -
Symanzik Flow time - - 20.648
HYP Step 1 1744.10 1550.31 -
HYP Step 2 1580.78 1532.87 -
HYP Step 3 2205.11 2291.02 -
HYP time - - 13.763

HYP time is now ~66% of Symanzik time as opposed to ~75%, but it's hard to say if that's due to optimization or because A100 has a larger cache... in any case, it's relatively faster.

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 1, 2024

@SaltyChiang I'm done with my updates---can you please merge my branch (lattice:feature/gauge-smear) into your local branch and give it a fresh comparison against Chroma? As far as my tests have gone (both single GPU and partitioned) I've preserved correctness, but an independent check is always good.

Assuming it passes your Chroma comparison and everything else looks good, I imagine we can get this merged in.

@SaltyChiang
Copy link
Contributor Author

SaltyChiang commented Feb 5, 2024

@weinbe2 I checked the performance again before/after merging your additions on Tesla P100.

SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 973.97 1011.91 7.85193
Stout 1343.91 1396.27 7.08628
Ovrimp-stout 1717.53 1074.82 23.5821
Wilson flow 1 1206.77 1350.23 -
Wilson flow 2 1084.00 1337.95 -
Wilson flow 3 1194.42 1405.32 -
Wilson flow time - - 15.9663
Symanzik flow 1 1601.43 1013.95 -
Symanzik flow 2 1582.12 1029.65 -
Symanzik flow 3 1619.81 1039.88 -
Symanzik flow time - - 68.0921
HYP 1 before 609.19 541.50 -
HYP 2 before 522.48 506.65 -
HYP 3 before 721.51 749.62 -
HYP time before - - 47.2621
HYP 1 after 703.21 625.07 -
HYP 2 after 559.32 542.37 -
HYP 3 after 750.97 780.22 -
HYP time after - - 38.4669

And yes, the new code returns the same result as Chroma. Again, they differ by 1e-6~1e-7, which comes from the different SU(3) projection algorithms. I plan to implement a similar projection as Chroma in quda just for checking, do you think this is acceptable to push such a thing upstream?

@cpviolator Do you have any further ideas about the performance issue?

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 5, 2024

Thanks @SaltyChiang , I'm glad it passes the correctness check on your end. I'd like to get this merged in sooner as opposed to later, further optimizations can go into a future PR.

As for the Chroma algorithm---I like the idea of checking it locally, and if you see a deviation file after that please file an issue and we can go from there.

Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved (pending Jenkins), we'll still need a second review before a merge which @maddyscientist promised she'd get to soon.

@maddyscientist
Copy link
Member

@Jenkins ok to test

@SaltyChiang I expect once we fix #1436, this will give a boost to double-precision performance of all the smearing routines that use SU(3) projection. We'll handle this in a follow up PR though.

Copy link
Member

@maddyscientist maddyscientist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, subject to CI completing successfully.

Copy link
Member

@mathiaswagner mathiaswagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve cmake changes

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 5, 2024

@SaltyChiang I had to put in a few quick updates to address the CI issues---can you please merge my branch back into yours again?

lib/gauge_hyp.cu Outdated Show resolved Hide resolved
@cpviolator
Copy link
Member

@SaltyChiang @weinbe2 I think these improvements are great and an excellent contribution! As we can all see from the algorithm, there are some unavoidable aspects when it comes to memory that prohibit HYP from performing as well as STOUT based algorithms or vanilla APE.

@SaltyChiang
Copy link
Contributor Author

SaltyChiang commented Feb 6, 2024

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 6, 2024

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

It might be best to preserve that behavior, yes. Thoughts @cpviolator ? You're the real-world smearing workflow expert here

@cpviolator
Copy link
Member

@weinbe2 yeah, I think that most users of non analytical smearing in QUDA are using it for spectroscopic measurement so defaulting to excluding the temporal dimension would be best in my opinion. A wiki addition on this behaviour would be good.

@weinbe2
Copy link
Contributor

weinbe2 commented Feb 6, 2024

Thanks @cpviolator --- @SaltyChiang , you should change the default to 3 per that recommendation. Creating a wiki page can be done orthogonal to this PR.

3D for APE/STOUT and 4D for OVRIMP_STOUT/HYP.
@SaltyChiang
Copy link
Contributor Author

@weinbe2 @cpviolator Change the default value of dir_ignore to -1, and any negative value will apply APE and STOUT smearing in spatial directions only. The corresponding comments are also added.

@weinbe2 weinbe2 merged commit d090912 into lattice:develop Feb 6, 2024
11 checks passed
@SaltyChiang SaltyChiang deleted the feature/gauge-smear branch February 7, 2024 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants