Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zoltan/Muelu/Stokhos +cuda w/ CUDA 11.8, GCC 11.2.0: MueLu_Zoltan2Interface_MP_Vector_16_Cuda.cpp: agent_radix_sort_onesweep.cuh(221): error: expected an identifier #11839

Closed
eugeneswalker opened this issue May 1, 2023 · 12 comments
Labels
pkg: MueLu pkg: Stokhos pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@eugeneswalker
Copy link

Bug Report

@egboman @jhux2 @etphipp

Description

I'm using GCC 11.2.0 and CUDA 11.8.0 and having trouble building:

trilinos@develop +cuda cuda_arch=80 +amesos +amesos2 +anasazi +aztec +belos +boost \
     +epetra +epetraext +ifpack +ifpack2 +intrepid +intrepid2 +isorropia +kokkos +ml \
     +minitensor +muelu +nox +piro +phalanx +rol +rythmos +sacado +stk +shards +shylu \
     +stokhos +stratimikos +teko +tempus +tpetra +trilinoscouplings +zoltan +zoltan2 \
     +superlu-dist gotype=long_long

I'm seeing this error:

/opt/trilinos-ci-deps/view/bin/../targets/x86_64-linux/include/cub/device/dispatch/../../agent/agent_radix_sort_onesweep.cuh(221): error: expected an identifier

1 error detected in the compilation of "/build2/packages/stokhos/src/MueLu_Zoltan2Interface_MP_Vector_16_Cuda.cpp".

I created a Docker container image that has all of the dependencies for the Trilinos build I'm targeting:

  • esw123/trilinos-ci:2023-05-01

My cmake script is here: cmake.sh.txt

Steps to Reproduce

$> docker run -it esw123/trilinos-ci:2023-05-01
root@413e3ff113b7:/# git clone --depth=1 https://github.com/trilinos/Trilinos.git
root@413e3ff113b7:/# (cd Trilinos && git checkout d82ed05306b491146fc6b2ceed766100dc4a6fbd)
root@413e3ff113b7:/# ./cmake.sh Trilinos # see attached script above for cmake.sh
root@413e3ff113b7:/# ninja -j104
...
/opt/trilinos-ci-deps/view/bin/../targets/x86_64-linux/include/cub/device/dispatch/../../agent/agent_radix_sort_onesweep.cuh(221): error: expected an identifier

1 error detected in the compilation of "/build2/packages/stokhos/src/MueLu_Zoltan2Interface_MP_Vector_16_Cuda.cpp".
@eugeneswalker eugeneswalker added the type: bug The primary issue is a bug in Trilinos code or tests label May 1, 2023
@github-actions
Copy link

github-actions bot commented May 1, 2023

Automatic mention of the @trilinos/muelu team

@github-actions
Copy link

github-actions bot commented May 1, 2023

Automatic mention of the @trilinos/muelu team

@ndellingwood
Copy link
Contributor

ndellingwood commented May 1, 2023

I happened to see this issue while checking on a Trilinos PR, I'm adding some notes from my past experience with this in case helpful for investigating/resolving this issue:

This showed up before e.g. #11630 , this is a copy of my summary:

Here is the referenced cub line: https://github.com/NVIDIA/cub/blob/c1aa59a1a792ec6821f154dd47973f43f1be01be/cub/agent/agent_radix_sort_onesweep.cuh#L225 , which defines a static var EMPTY

I ran into a similar problem with cuda/11.4 builds with Amesos2 KLU2 code due to defining a macro named EMPTY, which I resolved by renaming the macro. As added info (in case useful), I did not see the error with cuda/11.2 since the corresponding version of cub did not contain the static var EMPTY in the cuda/11.2 version; I mention just in case you have encountered similar discrepancies with different cuda versions

I did not encounter any further problems with Trilinos after the Amesos2::KLU2 updates, but I see a couple other spots in Trilinos that define an EMPTY macro (various solvers in common/auxiliarySoftware/SuiteSparse/src), I will put in a PR today to rename those

If the problem in this issue is due to similar causes that I encountered, and updates to Trilinos do not resolve the problem, then similar updates may be needed for SuperLU_Dist as well which defines a macro named EMPTY in a few files, and should be reported to the developers. The EMPTY macro is defined in the following locations: https://github.com/xiaoyeli/superlu_dist/blob/5c134712a0ab22775b1c801acaf42b05965a3215/SRC/util_dist.h#L74 https://github.com/xiaoyeli/superlu_dist/blob/5c134712a0ab22775b1c801acaf42b05965a3215/SRC/colamd.c#L674 https://github.com/xiaoyeli/superlu_dist/blob/5c134712a0ab22775b1c801acaf42b05965a3215/SRC/old_colamd.c#L363

The usage of the EMPTY macro by various components of common/auxiliarySoftware/SuiteSparse/ was resolved in #11635 (simply renaming them). I did not see other EMPTY macros in Trilinos at the time, though I could have missed spot(s), or another culprit.
Also, I'm curious if other TPLs used in the Trilinos build could trigger similar issues e.g. if the headers have named an EMPTY var (the error just didn't occur when building the TPL if the TPL does not use any functionality from CUB)?

Edit: Adding link to the cub file for newer versions of cuda (my copied comment refers to cuda/11.4): https://github.com/NVIDIA/cub/blob/0905d7effcb3395d4157895e1d77bbcb252e55c8/cub/agent/agent_radix_sort_onesweep.cuh#L221

@eugeneswalker
Copy link
Author

Yes, thank you @ndellingwood! If I take superlu-dist@develop, replace all EMPTY with _EMPTY, it builds perfect!

@jhux2
Copy link
Member

jhux2 commented May 1, 2023

@ndellingwood Do you know if an issue was filed with SuperLU_dist?

[EDIT]

I just went looking, and there is an issue open already, xiaoyeli/superlu_dist#127.

@ndellingwood
Copy link
Contributor

Yes, thank you @ndellingwood! If I take superlu-dist@develop, replace all EMPTY with _EMPTY, it builds perfect!

@eugeneswalker thanks for confirming! This is very helpful to triage the underlying cause

I just went looking, and there is an issue open already, xiaoyeli/superlu_dist#127.

@jhux2 thanks for this link!

@jhux2
Copy link
Member

jhux2 commented May 2, 2023

@ndellingwood Sure thing. Btw, this is how the reporter in 127 worked around the problem. Possibly something similar could be done in Trilinos until this is resolved in SuperLU_dist.

This can cause a naming conflict when this include file is included ahead of another one using the same string "EMPTY". I just experienced that conflict with a CUDA header. I am able to avoid this issue by including the superlu header files after the CUDA ones in my code, but it would be better practice if that EMPTY string was never exposed in the public interface (I don't think it needs to)

@ndellingwood
Copy link
Contributor

Possibly something similar could be done in Trilinos until this is resolved in SuperLU_dist.

@jhux2 that would probably have to be done at the TriBITS level, right?

@jhux2
Copy link
Member

jhux2 commented May 2, 2023

@ndellingwood I was reading that as order-of-includes in Trilinos code itself, but I could be wrong.

@xiaoyeli
Copy link

xiaoyeli commented May 3, 2023

I just pushed a commit in master, renamed all EMPTY by SLU_EMPTY in superlu_dist.

@haampie
Copy link

haampie commented Oct 25, 2023

@xiaoyeli could you also prefix the other macros with SLU_? Sounds like this might come up again.

@jhux2
Copy link
Member

jhux2 commented Oct 25, 2023

could you also prefix the other macros with SLU_? Sounds like this might come up again.

@haampie Could you open an issue with github.com/xiaoyeli/superlu_dist if there's not one already? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: MueLu pkg: Stokhos pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

5 participants