Trilinos build: enable CUDA and Tpetra-based packages #120

klevzoff · 2020-07-30T02:53:45Z

klevzoff · 2020-07-31T00:24:32Z

CMakeLists.txt

@@ -659,6 +659,11 @@ if (ENABLE_TRILINOS)
      set( TRILINOS_Fortran_COMPILER ${CMAKE_Fortran_COMPILER} )
    endif()

+    if( ENABLE_CUDA )
+      set( TRILINOS_CXX_FLAGS "${TRILINOS_CXX_FLAGS} -ccbin ${TRILINOS_CXX_COMPILER} -arch=${CUDA_ARCH} --expt-extended-lambda --expt-relaxed-constexpr" )
+      set( TRILINOS_CXX_COMPILER ${CMAKE_CURRENT_BINARY_DIR}/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper )


Tpetra/Kokkos insist on using their nvcc wrapper (it's a script that does some command line argument preprocessing like wrapping flags with -Xcompiler, etc.). I had no luck trying to directly use nvcc as compiler.

klevzoff · 2020-07-31T00:28:38Z

docker/configure_tpl_build.sh

@@ -23,7 +23,7 @@ python ${TPL_SRC_DIR}/scripts/config-build.py \
       --buildtype Release \
       --buildpath ${TPL_BUILD_DIR} \
       --installpath ${GEOSX_TPL_DIR} \
-       -DTRILINOS_BUILD_COMMAND="make -j1" \
+       -DTRILINOS_BUILD_COMMAND="make -j2" \


This seems to have no effect, as Travis CUDA build still timed out at about the same point in build process. @rrsettgast @TotoGaz is there a way to increase Travis job limit from 3 hours to, say, 4?

By "no effect" do you mean that it's not enough for your problem or that it's not working?
(For information, initially this was done to reduce the memory footprint while building trilinos.)

For travis job limit, I don't know I have to check, but I'm not sure I have powerful enough credentials.

Officially there is no 3h limit I guess: https://docs.travis-ci.com/user/customizing-the-build#build-timeouts

I mean that the 3 hour limit was reached at about 98% of Trilinos build completed, both with -j1 and -j2. So I guess it's not working.

Yes, I do remember the memory footprint issue. I figured I'd give a try to using 2 processes, since the compilation time has massively increased with new packages enabled. I'm able to run with -j 12 on my 32 GB system quite easily, so I figured -j 2 could work on 8 GB (of course, linear scaling doesn't apply here, but...)

Probably in

Executing cmake line: 'cmake -C /tmp/thirdPartyLibs/host-configs/environment.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/workrd/SCR/GEOSX/install/gcc8/GEOSX_TPL-490bf75 -DTRILINOS_BUILD_COMMAND=make -j2 -DNUM_PROC=2 /tmp/thirdPartyLibs/scripts/.. '

the -DTRILINOS_BUILD_COMMAND is considered as make and the -j2 is thrown away.
Hence the -j1 has always been crappy (but we had the make) so I was fooled.

I would expect an error message about an unused -j2 parameter... can't find it.

After many unsuccessful attempts to quote/escape the command, I ended up passing -DTRILINOS_NUM_PROC=2 and constructing the command line in CMake. Remains to be seen if it actually builds though.

That was partial success... The ubuntu-clang-cuda build succeeded with 14 minutes to spare. The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.

The ubuntu-clang-cuda build succeeded with 14 minutes to spare.

Easy peasy

The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.

Still remains the solution to run (all ?) the builds with ninja.
It is better that make at running the compilation in parallel and maybe it would result in being more efficient at running the 2 cores. With luck it could make the cut.

At the very end we could speed up (5 minutes on centos, not more) the process by having all the rpm packages pre-installed. That would complicate our build for an artificial limit, that would be great to find another solution.

I can try to make Trilinos thinner by playing with build options. I think I've already cut off everything I could, but there's always a chance. Maybe disabling serial kernel instantiations will do the trick - I'm not entirely sure why we keep them enabled. Eventually it will be made lighter when we completely replace the older Epetra-based interface with the new one, and thus will be able to remove those older libraries from the build. But that was not the plan for this PR, the new interface needs a lot of testing and validation.

As far as ninja, it's worth trying, but I honestly doubt it will be much faster. Most of the time is spent in individual compiler invocations that cannot be made faster (and not the build system itself), and with just 2 cores there doesn't seem to be much room for parallel build optimization. Although admittedly I have only very basic familiarity with ninja.

Sometimes the build system (not only for trilinos) waits for all the files to be compiled before linking a library.
A core may be waiting idle for a compilation to finish. ninja seems to be able to start compilations for other libraries before the linking process.
On a product with multiple libraries and long compilations we may save a few time here and there and it may make the difference when we are close to the goal. 🤷

docker/gcc-cuda/Dockerfile

klevzoff · 2020-09-06T22:28:39Z

@rrsettgast I bumped up the cores for Trilinos build from 2 to 3 and the time of CUDA build went down from 3 to 2 hours. I think this is acceptable. Didn't try 4 as I'm not sure if those beefier Travis instances also come with more memory to accompany the core count.

…ackages (Belos, Amesos2, Ifpack2, MueLu)

rrsettgast · 2020-10-03T22:25:42Z

@klevzoff On lassen I see the following error for the clang-cuda build.

Probing the environment ...

-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='ON'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='ON'
-- Leaving current CMAKE_C_COMPILER=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-clang-upstream-2019.03.26/bin/mpicc since it is already set!
-- Leaving current CMAKE_CXX_COMPILER=/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper since it is already set!
-- Leaving current CMAKE_Fortran_COMPILER=/usr/tce/packages/xl/xl-beta-2019.06.20/bin/xlf_r since it is already set!
-- MPI_EXEC='/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-xl-2020.09.17/bin/mpiexec'
-- MPI_EXEC='/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-xl-2020.09.17/bin/mpiexec'
-- CMAKE_C_COMPILER_ID='Clang'
-- CMAKE_C_COMPILER_VERSION='9.0.0'
-- The CXX compiler identification is GNU 4.9.3
-- Check for working CXX compiler: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper
-- Check for working CXX compiler: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper -- broken
CMake Error at /usr/tce/packages/cmake/cmake-3.14.5/share/cmake/Modules/CMakeTestCXXCompiler.cmake:53 (message):
  The C++ compiler

    "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp
    
    Run Build Command(s):/usr/tcetmp/bin/gmake cmTC_31cd1/fast 
    gmake[3]: Entering directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    /usr/tcetmp/bin/gmake -f CMakeFiles/cmTC_31cd1.dir/build.make CMakeFiles/cmTC_31cd1.dir/build
    gmake[4]: Entering directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o
    /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper    -fPIC -w -ccbin /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-clang-upstream-2019.03.26/bin/mpicxx --expt-extended-lambda --expt-relaxed-constexpr  -fPIE   -o CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o -c /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    nvcc_internal_extended_lambda_implementation:16:15: error: expected parameter declarator
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
                  ^
    nvcc_internal_extended_lambda_implementation:16:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:16:14: note: to match this '('
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
                 ^
    nvcc_internal_extended_lambda_implementation:16:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
    ^
    nvcc_internal_extended_lambda_implementation:163:15: error: expected parameter declarator
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
                  ^
    nvcc_internal_extended_lambda_implementation:163:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:163:14: note: to match this '('
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
                 ^
    nvcc_internal_extended_lambda_implementation:163:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
    ^
    nvcc_internal_extended_lambda_implementation:197:15: error: expected parameter declarator
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
                  ^
    nvcc_internal_extended_lambda_implementation:197:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:197:14: note: to match this '('
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
                 ^
    nvcc_internal_extended_lambda_implementation:197:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
    ^
    nvcc_internal_extended_lambda_implementation:203:68: error: template argument for template type parameter must be a type
    struct __nv_hdl_helper_trait : public  __nv_hdl_helper_trait<Tag,  decltype(&Lambda::operator())> { };
                                                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    nvcc_internal_extended_lambda_implementation:202:34: note: template parameter is declared here
    template <typename Tag, typename Lambda>
                                     ^
    nvcc_internal_extended_lambda_implementation:208:8: error: 'auto' not allowed in function return type
    static auto get(Lambda lam, CaptureArgs... args) ->  __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, Tag, R(OpFuncArgs...),  CaptureArgs...> {
           ^~~~
    nvcc_internal_extended_lambda_implementation:208:49: error: expected ';' at end of declaration list
    static auto get(Lambda lam, CaptureArgs... args) ->  __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, Tag, R(OpFuncArgs...),  CaptureArgs...> {
                                                    ^
                                                    ;
    nvcc_internal_extended_lambda_implementation:217:9: error: 'auto' not allowed in function return type
     static auto get(Lambda lam, CaptureArgs... args) -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv,Tag, R(OpFuncArgs...), CaptureArgs...> {
            ^~~~
    nvcc_internal_extended_lambda_implementation:217:50: error: expected ';' at end of declaration list
     static auto get(Lambda lam, CaptureArgs... args) -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv,Tag, R(OpFuncArgs...), CaptureArgs...> {
                                                     ^
                                                     ;
    nvcc_internal_extended_lambda_implementation:225:8: error: 'auto' not allowed in function return type
    static auto __nv_hdl_create_wrapper(Lambda lam, CaptureArgs... args) -> decltype(__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>::template __nv_hdl_helper_trait<Tag, Lambda>::get(lam, args...))
           ^~~~
    nvcc_internal_extended_lambda_implementation:225:69: error: expected ';' at end of declaration list
    static auto __nv_hdl_create_wrapper(Lambda lam, CaptureArgs... args) -> decltype(__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>::template __nv_hdl_helper_trait<Tag, Lambda>::get(lam, args...))
                                                                        ^
                                                                        ;
    16 errors generated.
    gmake[4]: *** [CMakeFiles/cmTC_31cd1.dir/build.make:66: CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o] Error 1
    gmake[4]: Leaving directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    gmake[3]: *** [Makefile:121: cmTC_31cd1/fast] Error 2
    gmake[3]: Leaving directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:2022 (ENABLE_LANGUAGE)
  cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:193 (TRIBITS_SETUP_ENV)
  cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL)
  CMakeLists.txt:90 (TRIBITS_PROJECT)


-- Configuring incomplete, errors occurred!
See also "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeOutput.log".
See also "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeError.log".
make[2]: *** [CMakeFiles/trilinos.dir/build.make:110: trilinos/src/trilinos-stamp/trilinos-configure] Error 1
make[1]: *** [CMakeFiles/Makefile2:923: CMakeFiles/trilinos.dir/all] Error 2
make: *** [Makefile:95: all] Error 2

klevzoff · 2020-10-03T22:53:58Z

-- The CXX compiler identification is GNU 4.9.3

This obviously is not right... is your mpicxx hardcoded to the correct host compiler (clang) or does it need extra configuration (like env args)? It seems to have fallen back to system default. Or maybe it did not get passed to nvcc_wrapper correctly via -ccbin flag (it does on my system). I'm not sure if that specifically is the reason for errors, but it's the first thing to check/fix.

I pushed a commit that will print some flags that are passed to Trilinos' CMake... can you fetch it and just rerun cmake .. on top level and post the output?

XL64 · 2022-09-01T13:48:57Z

Started Jenkins Build on Pangea3.
Will give status when finished.

XL64 · 2022-09-02T08:27:39Z

Started Jenkins Build on Pangea3. Will give status when finished.

Didn't see the merge conflict, cannot test yet because it can't be merged with latest develop automatically.

klevzoff · 2022-09-02T17:58:09Z

@XL64 this PR is very outdated. At some point I will revive it

TotoGaz · 2022-09-02T18:23:17Z

@klevzoff We're also setting up an internal Jenkins CI/CD.
It's also a way to see if we're able to validate PRs on our systems, such that nothing gets merged if an issue pops up.

klevzoff added enhancement New feature or request flag: ready for review labels Jul 30, 2020

klevzoff self-assigned this Jul 30, 2020

klevzoff mentioned this pull request Jul 30, 2020

Tpetra LAI GEOS-DEV/GEOS#1086

Closed

3 tasks

klevzoff force-pushed the feature/klevzoff/tpetra branch from 2efb469 to 791ebf0 Compare July 30, 2020 21:45

klevzoff commented Jul 31, 2020

View reviewed changes

klevzoff requested review from TotoGaz and rrsettgast July 31, 2020 00:29

TotoGaz reviewed Jul 31, 2020

View reviewed changes

docker/gcc-cuda/Dockerfile Show resolved Hide resolved

TotoGaz approved these changes Jul 31, 2020

View reviewed changes

klevzoff force-pushed the feature/klevzoff/tpetra branch from 791ebf0 to 1091425 Compare July 31, 2020 02:47

klevzoff force-pushed the feature/klevzoff/tpetra branch from 6e35697 to c92adfc Compare August 7, 2020 07:30

klevzoff added the flag: ready to be merged label Aug 8, 2020

rrsettgast approved these changes Aug 20, 2020

View reviewed changes

klevzoff force-pushed the feature/klevzoff/tpetra branch from 8240d18 to d1ee89f Compare August 26, 2020 08:54

klevzoff force-pushed the feature/klevzoff/tpetra branch from 18bb5df to 29cdb92 Compare September 6, 2020 07:20

klevzoff force-pushed the feature/klevzoff/tpetra branch 2 times, most recently from eb55176 to 12898da Compare September 11, 2020 05:21

klevzoff force-pushed the feature/klevzoff/tpetra branch 2 times, most recently from a837fe7 to 2ca2db2 Compare September 21, 2020 07:50

Trilinos build: update to 13.0, enable CUDA and Tpetra-based solver p…

94a840b

…ackages (Belos, Amesos2, Ifpack2, MueLu)

klevzoff force-pushed the feature/klevzoff/tpetra branch from 2ca2db2 to 94a840b Compare September 22, 2020 00:06

Add extra output for Trilinos flags

1599a07

klevzoff removed the flag: ready to be merged label Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trilinos build: enable CUDA and Tpetra-based packages #120

Trilinos build: enable CUDA and Tpetra-based packages #120

klevzoff commented Jul 30, 2020 •

edited

Loading

klevzoff Jul 31, 2020 •

edited

Loading

klevzoff Jul 31, 2020

TotoGaz Jul 31, 2020

TotoGaz Jul 31, 2020

klevzoff Jul 31, 2020 •

edited

Loading

TotoGaz Jul 31, 2020

klevzoff Jul 31, 2020

klevzoff Jul 31, 2020

TotoGaz Jul 31, 2020

klevzoff Jul 31, 2020 •

edited

Loading

TotoGaz Jul 31, 2020 •

edited

Loading

klevzoff commented Sep 6, 2020 •

edited

Loading

rrsettgast commented Oct 3, 2020

klevzoff commented Oct 3, 2020 •

edited

Loading

XL64 commented Sep 1, 2022

XL64 commented Sep 2, 2022

klevzoff commented Sep 2, 2022

TotoGaz commented Sep 2, 2022

Trilinos build: enable CUDA and Tpetra-based packages #120

Are you sure you want to change the base?

Trilinos build: enable CUDA and Tpetra-based packages #120

Conversation

klevzoff commented Jul 30, 2020 • edited Loading

klevzoff Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

klevzoff Jul 31, 2020

Choose a reason for hiding this comment

TotoGaz Jul 31, 2020

Choose a reason for hiding this comment

TotoGaz Jul 31, 2020

Choose a reason for hiding this comment

klevzoff Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

TotoGaz Jul 31, 2020

Choose a reason for hiding this comment

klevzoff Jul 31, 2020

Choose a reason for hiding this comment

klevzoff Jul 31, 2020

Choose a reason for hiding this comment

TotoGaz Jul 31, 2020

Choose a reason for hiding this comment

klevzoff Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

TotoGaz Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

klevzoff commented Sep 6, 2020 • edited Loading

rrsettgast commented Oct 3, 2020

klevzoff commented Oct 3, 2020 • edited Loading

XL64 commented Sep 1, 2022

XL64 commented Sep 2, 2022

klevzoff commented Sep 2, 2022

TotoGaz commented Sep 2, 2022

klevzoff commented Jul 30, 2020 •

edited

Loading

klevzoff Jul 31, 2020 •

edited

Loading

klevzoff Jul 31, 2020 •

edited

Loading

klevzoff Jul 31, 2020 •

edited

Loading

TotoGaz Jul 31, 2020 •

edited

Loading

klevzoff commented Sep 6, 2020 •

edited

Loading

klevzoff commented Oct 3, 2020 •

edited

Loading