-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trilinos build: enable CUDA and Tpetra-based packages #120
base: master
Are you sure you want to change the base?
Conversation
2efb469
to
791ebf0
Compare
@@ -659,6 +659,11 @@ if (ENABLE_TRILINOS) | |||
set( TRILINOS_Fortran_COMPILER ${CMAKE_Fortran_COMPILER} ) | |||
endif() | |||
|
|||
if( ENABLE_CUDA ) | |||
set( TRILINOS_CXX_FLAGS "${TRILINOS_CXX_FLAGS} -ccbin ${TRILINOS_CXX_COMPILER} -arch=${CUDA_ARCH} --expt-extended-lambda --expt-relaxed-constexpr" ) | |||
set( TRILINOS_CXX_COMPILER ${CMAKE_CURRENT_BINARY_DIR}/trilinos/src/trilinos/packages/kokkos/bin/nvcc_wrapper ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tpetra/Kokkos insist on using their nvcc
wrapper (it's a script that does some command line argument preprocessing like wrapping flags with -Xcompiler
, etc.). I had no luck trying to directly use nvcc
as compiler.
docker/configure_tpl_build.sh
Outdated
@@ -23,7 +23,7 @@ python ${TPL_SRC_DIR}/scripts/config-build.py \ | |||
--buildtype Release \ | |||
--buildpath ${TPL_BUILD_DIR} \ | |||
--installpath ${GEOSX_TPL_DIR} \ | |||
-DTRILINOS_BUILD_COMMAND="make -j1" \ | |||
-DTRILINOS_BUILD_COMMAND="make -j2" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have no effect, as Travis CUDA build still timed out at about the same point in build process. @rrsettgast @TotoGaz is there a way to increase Travis job limit from 3 hours to, say, 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "no effect" do you mean that it's not enough for your problem or that it's not working?
(For information, initially this was done to reduce the memory footprint while building trilinos.)
For travis job limit, I don't know I have to check, but I'm not sure I have powerful enough credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Officially there is no 3h limit I guess: https://docs.travis-ci.com/user/customizing-the-build#build-timeouts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that the 3 hour limit was reached at about 98% of Trilinos build completed, both with -j1
and -j2
. So I guess it's not working.
Yes, I do remember the memory footprint issue. I figured I'd give a try to using 2 processes, since the compilation time has massively increased with new packages enabled. I'm able to run with -j 12
on my 32 GB system quite easily, so I figured -j 2
could work on 8 GB (of course, linear scaling doesn't apply here, but...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably in
Executing cmake line: 'cmake -C /tmp/thirdPartyLibs/host-configs/environment.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/workrd/SCR/GEOSX/install/gcc8/GEOSX_TPL-490bf75 -DTRILINOS_BUILD_COMMAND=make -j2 -DNUM_PROC=2 /tmp/thirdPartyLibs/scripts/.. '
the -DTRILINOS_BUILD_COMMAND
is considered as make
and the -j2
is thrown away.
Hence the -j1
has always been crappy (but we had the make
) so I was fooled.
I would expect an error message about an unused -j2
parameter... can't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After many unsuccessful attempts to quote/escape the command, I ended up passing -DTRILINOS_NUM_PROC=2
and constructing the command line in CMake. Remains to be seen if it actually builds though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was partial success... The ubuntu-clang-cuda build succeeded with 14 minutes to spare. The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ubuntu-clang-cuda build succeeded with 14 minutes to spare.
Easy peasy
The centos-gcc-cuda one is slower unfortunately, it managed to finish the TPL build almost entirely, but didn't get to finish and push the docker image.
Still remains the solution to run (all ?) the builds with ninja
.
It is better that make
at running the compilation in parallel and maybe it would result in being more efficient at running the 2 cores. With luck it could make the cut.
At the very end we could speed up (5 minutes on centos, not more) the process by having all the rpm packages pre-installed. That would complicate our build for an artificial limit, that would be great to find another solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to make Trilinos thinner by playing with build options. I think I've already cut off everything I could, but there's always a chance. Maybe disabling serial kernel instantiations will do the trick - I'm not entirely sure why we keep them enabled. Eventually it will be made lighter when we completely replace the older Epetra-based interface with the new one, and thus will be able to remove those older libraries from the build. But that was not the plan for this PR, the new interface needs a lot of testing and validation.
As far as ninja
, it's worth trying, but I honestly doubt it will be much faster. Most of the time is spent in individual compiler invocations that cannot be made faster (and not the build system itself), and with just 2 cores there doesn't seem to be much room for parallel build optimization. Although admittedly I have only very basic familiarity with ninja.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the build system (not only for trilinos) waits for all the files to be compiled before linking a library.
A core may be waiting idle for a compilation to finish. ninja
seems to be able to start compilations for other libraries before the linking process.
On a product with multiple libraries and long compilations we may save a few time here and there and it may make the difference when we are close to the goal. 🤷
791ebf0
to
1091425
Compare
6e35697
to
c92adfc
Compare
8240d18
to
d1ee89f
Compare
18bb5df
to
29cdb92
Compare
@rrsettgast I bumped up the cores for Trilinos build from 2 to 3 and the time of CUDA build went down from 3 to 2 hours. I think this is acceptable. Didn't try 4 as I'm not sure if those beefier Travis instances also come with more memory to accompany the core count. |
eb55176
to
12898da
Compare
a837fe7
to
2ca2db2
Compare
…ackages (Belos, Amesos2, Ifpack2, MueLu)
2ca2db2
to
94a840b
Compare
@klevzoff On lassen I see the following error for the clang-cuda build.
|
This obviously is not right... is your I pushed a commit that will print some flags that are passed to Trilinos' CMake... can you fetch it and just rerun |
Started Jenkins Build on Pangea3. |
Didn't see the merge conflict, cannot test yet because it can't be merged with latest develop automatically. |
@XL64 this PR is very outdated. At some point I will revive it |
@klevzoff We're also setting up an internal Jenkins CI/CD. |
Related to GEOS-DEV/GEOS#1086