Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos build: enable CUDA and Tpetra-based packages #120

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

klevzoff
Copy link
Contributor

@klevzoff klevzoff commented Jul 30, 2020

Related to GEOS-DEV/GEOS#1086

@klevzoff klevzoff self-assigned this Jul 30, 2020
@klevzoff klevzoff mentioned this pull request Jul 30, 2020
3 tasks
@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch from 2efb469 to 791ebf0 Compare July 30, 2020 21:45
@@ -659,6 +659,11 @@ if (ENABLE_TRILINOS)
set( TRILINOS_Fortran_COMPILER ${CMAKE_Fortran_COMPILER} )
endif()

if( ENABLE_CUDA )
set( TRILINOS_CXX_FLAGS "${TRILINOS_CXX_FLAGS} -ccbin ${TRILINOS_CXX_COMPILER} -arch=${CUDA_ARCH} --expt-extended-lambda --expt-relaxed-constexpr" )
set( TRILINOS_CXX_COMPILER ${CMAKE_CURRENT_BINARY_DIR}/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper )
Copy link
Contributor Author

@klevzoff klevzoff Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tpetra/Kokkos insist on using their nvcc wrapper (it's a script that does some command line argument preprocessing like wrapping flags with -Xcompiler, etc.). I had no luck trying to directly use nvcc as compiler.

@@ -23,7 +23,7 @@ python ${TPL_SRC_DIR}/scripts/config-build.py \
--buildtype Release \
--buildpath ${TPL_BUILD_DIR} \
--installpath ${GEOSX_TPL_DIR} \
-DTRILINOS_BUILD_COMMAND="make -j1" \
-DTRILINOS_BUILD_COMMAND="make -j2" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have no effect, as Travis CUDA build still timed out at about the same point in build process. @rrsettgast @TotoGaz is there a way to increase Travis job limit from 3 hours to, say, 4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "no effect" do you mean that it's not enough for your problem or that it's not working?
(For information, initially this was done to reduce the memory footprint while building trilinos.)

For travis job limit, I don't know I have to check, but I'm not sure I have powerful enough credentials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@klevzoff klevzoff Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the 3 hour limit was reached at about 98% of Trilinos build completed, both with -j1 and -j2. So I guess it's not working.

Yes, I do remember the memory footprint issue. I figured I'd give a try to using 2 processes, since the compilation time has massively increased with new packages enabled. I'm able to run with -j 12 on my 32 GB system quite easily, so I figured -j 2 could work on 8 GB (of course, linear scaling doesn't apply here, but...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably in

Executing cmake line: 'cmake -C /tmp/thirdPartyLibs/host-configs/environment.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/workrd/SCR/GEOSX/install/gcc8/GEOSX_TPL-490bf75 -DTRILINOS_BUILD_COMMAND=make -j2 -DNUM_PROC=2 /tmp/thirdPartyLibs/scripts/.. '

the -DTRILINOS_BUILD_COMMAND is considered as make and the -j2 is thrown away.
Hence the -j1 has always been crappy (but we had the make) so I was fooled.

I would expect an error message about an unused -j2 parameter... can't find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After many unsuccessful attempts to quote/escape the command, I ended up passing -DTRILINOS_NUM_PROC=2 and constructing the command line in CMake. Remains to be seen if it actually builds though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was partial success... The ubuntu-clang-cuda build succeeded with 14 minutes to spare. The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ubuntu-clang-cuda build succeeded with 14 minutes to spare.

Easy peasy

The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.

Still remains the solution to run (all ?) the builds with ninja.
It is better that make at running the compilation in parallel and maybe it would result in being more efficient at running the 2 cores. With luck it could make the cut.

At the very end we could speed up (5 minutes on centos, not more) the process by having all the rpm packages pre-installed. That would complicate our build for an artificial limit, that would be great to find another solution.

Copy link
Contributor Author

@klevzoff klevzoff Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to make Trilinos thinner by playing with build options. I think I've already cut off everything I could, but there's always a chance. Maybe disabling serial kernel instantiations will do the trick - I'm not entirely sure why we keep them enabled. Eventually it will be made lighter when we completely replace the older Epetra-based interface with the new one, and thus will be able to remove those older libraries from the build. But that was not the plan for this PR, the new interface needs a lot of testing and validation.

As far as ninja, it's worth trying, but I honestly doubt it will be much faster. Most of the time is spent in individual compiler invocations that cannot be made faster (and not the build system itself), and with just 2 cores there doesn't seem to be much room for parallel build optimization. Although admittedly I have only very basic familiarity with ninja.

Copy link
Contributor

@TotoGaz TotoGaz Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the build system (not only for trilinos) waits for all the files to be compiled before linking a library.
A core may be waiting idle for a compilation to finish. ninja seems to be able to start compilations for other libraries before the linking process.
On a product with multiple libraries and long compilations we may save a few time here and there and it may make the difference when we are close to the goal. 🤷

@klevzoff klevzoff requested review from TotoGaz and rrsettgast July 31, 2020 00:29
@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch from 8240d18 to d1ee89f Compare August 26, 2020 08:54
@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch from 18bb5df to 29cdb92 Compare September 6, 2020 07:20
@klevzoff
Copy link
Contributor Author

klevzoff commented Sep 6, 2020

@rrsettgast I bumped up the cores for Trilinos build from 2 to 3 and the time of CUDA build went down from 3 to 2 hours. I think this is acceptable. Didn't try 4 as I'm not sure if those beefier Travis instances also come with more memory to accompany the core count.

@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch 2 times, most recently from eb55176 to 12898da Compare September 11, 2020 05:21
@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch 2 times, most recently from a837fe7 to 2ca2db2 Compare September 21, 2020 07:50
@klevzoff klevzoff force-pushed the feature/klevzoff/tpetra branch from 2ca2db2 to 94a840b Compare September 22, 2020 00:06
@rrsettgast
Copy link
Member

@klevzoff On lassen I see the following error for the clang-cuda build.

Probing the environment ...

-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='ON'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='ON'
-- Leaving current CMAKE_C_COMPILER=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-clang-upstream-2019.03.26/bin/mpicc since it is already set!
-- Leaving current CMAKE_CXX_COMPILER=/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper since it is already set!
-- Leaving current CMAKE_Fortran_COMPILER=/usr/tce/packages/xl/xl-beta-2019.06.20/bin/xlf_r since it is already set!
-- MPI_EXEC='/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-xl-2020.09.17/bin/mpiexec'
-- MPI_EXEC='/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-xl-2020.09.17/bin/mpiexec'
-- CMAKE_C_COMPILER_ID='Clang'
-- CMAKE_C_COMPILER_VERSION='9.0.0'
-- The CXX compiler identification is GNU 4.9.3
-- Check for working CXX compiler: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper
-- Check for working CXX compiler: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper -- broken
CMake Error at /usr/tce/packages/cmake/cmake-3.14.5/share/cmake/Modules/CMakeTestCXXCompiler.cmake:53 (message):
  The C++ compiler

    "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp
    
    Run Build Command(s):/usr/tcetmp/bin/gmake cmTC_31cd1/fast 
    gmake[3]: Entering directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    /usr/tcetmp/bin/gmake -f CMakeFiles/cmTC_31cd1.dir/build.make CMakeFiles/cmTC_31cd1.dir/build
    gmake[4]: Entering directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o
    /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper    -fPIC -w -ccbin /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-clang-upstream-2019.03.26/bin/mpicxx --expt-extended-lambda --expt-relaxed-constexpr  -fPIE   -o CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o -c /usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    nvcc_internal_extended_lambda_implementation:16:15: error: expected parameter declarator
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
                  ^
    nvcc_internal_extended_lambda_implementation:16:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:16:14: note: to match this '('
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
                 ^
    nvcc_internal_extended_lambda_implementation:16:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof(T) == 0, "nvcc internal error: unexpected failure in capturing array variable");
    ^
    nvcc_internal_extended_lambda_implementation:163:15: error: expected parameter declarator
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
                  ^
    nvcc_internal_extended_lambda_implementation:163:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:163:14: note: to match this '('
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
                 ^
    nvcc_internal_extended_lambda_implementation:163:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
    ^
    nvcc_internal_extended_lambda_implementation:197:15: error: expected parameter declarator
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
                  ^
    nvcc_internal_extended_lambda_implementation:197:15: error: expected ')'
    nvcc_internal_extended_lambda_implementation:197:14: note: to match this '('
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
                 ^
    nvcc_internal_extended_lambda_implementation:197:1: error: C++ requires a type specifier for all declarations
    static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!");
    ^
    nvcc_internal_extended_lambda_implementation:203:68: error: template argument for template type parameter must be a type
    struct __nv_hdl_helper_trait : public  __nv_hdl_helper_trait<Tag,  decltype(&Lambda::operator())> { };
                                                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    nvcc_internal_extended_lambda_implementation:202:34: note: template parameter is declared here
    template <typename Tag, typename Lambda>
                                     ^
    nvcc_internal_extended_lambda_implementation:208:8: error: 'auto' not allowed in function return type
    static auto get(Lambda lam, CaptureArgs... args) ->  __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, Tag, R(OpFuncArgs...),  CaptureArgs...> {
           ^~~~
    nvcc_internal_extended_lambda_implementation:208:49: error: expected ';' at end of declaration list
    static auto get(Lambda lam, CaptureArgs... args) ->  __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, Tag, R(OpFuncArgs...),  CaptureArgs...> {
                                                    ^
                                                    ;
    nvcc_internal_extended_lambda_implementation:217:9: error: 'auto' not allowed in function return type
     static auto get(Lambda lam, CaptureArgs... args) -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv,Tag, R(OpFuncArgs...), CaptureArgs...> {
            ^~~~
    nvcc_internal_extended_lambda_implementation:217:50: error: expected ';' at end of declaration list
     static auto get(Lambda lam, CaptureArgs... args) -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv,Tag, R(OpFuncArgs...), CaptureArgs...> {
                                                     ^
                                                     ;
    nvcc_internal_extended_lambda_implementation:225:8: error: 'auto' not allowed in function return type
    static auto __nv_hdl_create_wrapper(Lambda lam, CaptureArgs... args) -> decltype(__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>::template __nv_hdl_helper_trait<Tag, Lambda>::get(lam, args...))
           ^~~~
    nvcc_internal_extended_lambda_implementation:225:69: error: expected ';' at end of declaration list
    static auto __nv_hdl_create_wrapper(Lambda lam, CaptureArgs... args) -> decltype(__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>::template __nv_hdl_helper_trait<Tag, Lambda>::get(lam, args...))
                                                                        ^
                                                                        ;
    16 errors generated.
    gmake[4]: *** [CMakeFiles/cmTC_31cd1.dir/build.make:66: CMakeFiles/cmTC_31cd1.dir/testCXXCompiler.cxx.o] Error 1
    gmake[4]: Leaving directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    gmake[3]: *** [Makefile:121: cmTC_31cd1/fast] Error 2
    gmake[3]: Leaving directory '/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeTmp'
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:2022 (ENABLE_LANGUAGE)
  cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:193 (TRIBITS_SETUP_ENV)
  cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL)
  CMakeLists.txt:90 (TRIBITS_PROJECT)


-- Configuring incomplete, errors occurred!
See also "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeOutput.log".
See also "/usr/WS1/geosadmn/geosx/thirdPartyLibs/build-lassen-clang@upstream-release/trilinos/src/trilinos-build/CMakeFiles/CMakeError.log".
make[2]: *** [CMakeFiles/trilinos.dir/build.make:110: trilinos/src/trilinos-stamp/trilinos-configure] Error 1
make[1]: *** [CMakeFiles/Makefile2:923: CMakeFiles/trilinos.dir/all] Error 2
make: *** [Makefile:95: all] Error 2

@klevzoff
Copy link
Contributor Author

klevzoff commented Oct 3, 2020

-- The CXX compiler identification is GNU 4.9.3

This obviously is not right... is your mpicxx hardcoded to the correct host compiler (clang) or does it need extra configuration (like env args)? It seems to have fallen back to system default. Or maybe it did not get passed to nvcc_wrapper correctly via -ccbin flag (it does on my system). I'm not sure if that specifically is the reason for errors, but it's the first thing to check/fix.

I pushed a commit that will print some flags that are passed to Trilinos' CMake... can you fetch it and just rerun cmake .. on top level and post the output?

@XL64
Copy link
Contributor

XL64 commented Sep 1, 2022

Started Jenkins Build on Pangea3.
Will give status when finished.

@XL64
Copy link
Contributor

XL64 commented Sep 2, 2022

Started Jenkins Build on Pangea3. Will give status when finished.

Didn't see the merge conflict, cannot test yet because it can't be merged with latest develop automatically.

@klevzoff
Copy link
Contributor Author

klevzoff commented Sep 2, 2022

@XL64 this PR is very outdated. At some point I will revive it

@TotoGaz
Copy link
Contributor

TotoGaz commented Sep 2, 2022

@klevzoff We're also setting up an internal Jenkins CI/CD.
It's also a way to see if we're able to validate PRs on our systems, such that nothing gets merged if an issue pops up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants