Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use streak.llnl.gov as runner for GHA #2873

Merged
merged 74 commits into from
Feb 7, 2024
Merged

Conversation

rrsettgast
Copy link
Member

@rrsettgast rrsettgast commented Dec 7, 2023

This PR uses streak to execute one CUDA build and execute unit tests on the GPU. Two tests are disabled testLifoStorage and testExternalSolvers are disabled on streak due to failure of those tests on GPU with unified memory disabled. Follow up PR's will be needed to fix and reenable those tests.

@rrsettgast rrsettgast added ci: run CUDA builds Allows to triggers (costly) CUDA jobs DO NOT MERGE ! labels Dec 7, 2023
@rrsettgast rrsettgast self-assigned this Dec 7, 2023
@rrsettgast rrsettgast removed the ci: run CUDA builds Allows to triggers (costly) CUDA jobs label Dec 8, 2023
Copy link

codecov bot commented Jan 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5c88b35) 51.44% compared to head (931b4e2) 51.34%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2873      +/-   ##
===========================================
- Coverage    51.44%   51.34%   -0.11%     
===========================================
  Files          972      972              
  Lines        87067    82389    -4678     
===========================================
- Hits         44789    42300    -2489     
+ Misses       42278    40089    -2189     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rrsettgast rrsettgast force-pushed the feature/useStreakForRunner2 branch from d13ab2f to 9751403 Compare February 4, 2024 21:01
@rrsettgast rrsettgast force-pushed the feature/useStreakForRunner2 branch from 1bd45aa to f627583 Compare February 5, 2024 18:56
@rrsettgast rrsettgast requested review from TotoGaz and CusiniM February 6, 2024 02:36
@@ -98,7 +98,7 @@ class SolidBaseUpdates
arrayView1d< real64 const > const m_thermalExpansionCoefficient;

/// Flag to disable inelasticity
const bool & m_disableInelasticity;
const bool m_disableInelasticity;
Copy link
Member Author

@rrsettgast rrsettgast Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug on heterogeneous memory systems. @klevzoff This was the problem with the constitutive unit tests from the remove array1d<string>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way to automatically check for this kind of bugs? Any way to at least get a warning whenever a reference is captured by value in a raja forAll?

@@ -192,6 +192,8 @@ void compute2DLaplaceOperator( MPI_Comm comm,

// Construct the 2D Laplace matrix
laplace2D.create( matrix.toViewConst(), matrix.numRows(), comm );

// laplace2D.write( "matrix.txt", LAIOutputFormat::NATIVE_ASCII);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this because it will be needed when we fix this test on GPU.

@@ -232,7 +252,11 @@ fi

# Run the unit tests (excluding previously ran checks).
if [[ "${RUN_UNIT_TESTS}" = true ]]; then
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck"
if [[ ${HOSTNAME} == 'streak.llnl.gov' ]]; then
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck|testLifoStorage|testExternalSolvers"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to exclude the two failing unit tests on GPU systems so we can merge this and follow up with fix PR's.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue pointing at this line though? If someone other than you fixes the tests I am pretty sure they'll forget to remove this.

update-ca-certificates
# gcloud config set core/custom_ca_certs_file cert.pem'

NPROC=8
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hard coded right now for streak.

Comment on lines +122 to +126
if [ ${{ inputs.RUNS_ON }} == 'self-hosted' ]; then
RUNNER_CERTIFICATES_DIR=/etc/pki/ca-trust/source/anchors/
mkdir -p ${GITHUB_WORKSPACE}/certificates
cp ${RUNNER_CERTIFICATES_DIR}/*.crt* ${GITHUB_WORKSPACE}/certificates
fi
Copy link
Contributor

@TotoGaz TotoGaz Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the DOCKER_RUN_ARGS input to add -v /etc/pki/ca-trust/source/anchors/.../bundle.crt:/etc/pki/ca-trust/.../llnl.bundle.crt:ro.

That way the certificate would be at the proper location in read-only, with no copy required from within the container.

Then inside of the docker container simply update the certificates list (we can do it whatever the case : it's fast and could do not harm?). Or if we really want it, we can add an option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly can...and probably should. Would you mind if we did this in a follow up PR. We are getting another workstation to run integratedTests on the CPU, and we will require a generic treatment at that time. I would rather just get this merged as is, and in separate PR's:

  1. clean up and generalize for the new workstation,
  2. fix the failing unit tests on GPU, and
  3. get integrated tests diffs worked out.

we can do all of these steps independent of each other IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm good with it. We need to be sure we actually have time to revisit this.

.github/workflows/build_and_test.yml Outdated Show resolved Hide resolved
Comment on lines +185 to +186
#rm -rf ${{ github.workspace }}/*
#docker rm -f ${CONTAINER_NAME}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you still need to clean up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can you add a small comment about it?

Comment on lines +288 to +289
# Cleaning the build directory.
or_die ninja clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is most likely superfluous since the build dir is in /tmp which is washed away with the --rm option of docker run. You can keep it if you want, since this may change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right.

@@ -256,8 +280,14 @@ if [[ "${RUN_INTEGRATED_TESTS}" = true ]]; then
or_die tar cfM ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar --directory ${GEOS_SRC_DIR} --transform "s/^integratedTests/${DATA_BASENAME_WE}\/repo/" integratedTests
or_die tar rfM ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar --directory ${GEOSX_BUILD_DIR} --transform "s/^integratedTests/${DATA_BASENAME_WE}\/logs/" integratedTests
or_die gzip ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar

# want to clean the integrated tests folder to avoid polluting the next build.
or_die integratedTests/geos_ats.sh -a clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With @cssherman 's patch to build the baselines out of source, cleaning the integrated tests will surely not be mandatory anymore, since they disappear with the docker run --rm.
Also, we will be able to mount the source read-only!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was before the --rm command. I can try without.

Copy link
Contributor

@TotoGaz TotoGaz Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think @cssherman 's PR has not been merged already. Am I wrong?

Copy link
Collaborator

@CusiniM CusiniM Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean not veryclean ? I think it will still be better to clean up the integratedTests directory. One may update the version of the baselines... and the clean also removes the tests results which I think we do want to remove IMO. Unless the rm options cleans all that for us.

Copy link
Contributor

@TotoGaz TotoGaz Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option can be to mount the workspace read-only in the container, then copy it to a temporary and work in it (waiting for an out-of-source integrated tests which should solve the last item which touches the src dir).
Then everything would be cleaned by the docker run --rm.

Eventually, we'll be able to skip the workspace copy, but not right now.

@@ -232,7 +252,11 @@ fi

# Run the unit tests (excluding previously ran checks).
if [[ "${RUN_UNIT_TESTS}" = true ]]; then
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck"
if [[ ${HOSTNAME} == 'streak.llnl.gov' ]]; then
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck|testLifoStorage|testExternalSolvers"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a EXCLUDE_UNIT_TESTS_RE input in GHA, defaulting to testUncrustifyCheck|testDoxygenCheck, so we do not have to hardcode anything.
Combined to giving the number of CPUs as a CLI arguments as well (NPROC can be provided through GHA, or defaulted to the results of nproc is absent), then we'll remove the need to hard code the name of the self hosted runner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would, but I actually don't think we should be excluding unit tests once they are fixed. This is just a hack to get us to the next step (passing tests), after witch this goes away. I actually do want to make it as hard as possible to exclude specific tests 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I kind of prefer having an easy way to do thing with a Cerberus at the code review, but it's OK.
What happens, is that the people maintaining the CI are not the same maintaining the tests or code. So the CI guys gets some ugly code he's not responsible for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well....that is a good enough reason.

.github/workflows/ci_tests.yml Outdated Show resolved Hide resolved
@sframba
Copy link
Contributor

sframba commented Feb 6, 2024

@rrsettgast as for the testLifoStorage, we plan on fixing it on CUDA in the not-too-distant future. It's supposed to work but it's broken right now.
Btw we had testExternalSolvers fail on Pangea3 too, with one single test failing (Hypre/SolverTestLaplace2D/0.GMRES_ILU, see this discussion with @victorapm) . We are available to do more tests if needed

@rrsettgast rrsettgast merged commit cb4e99b into develop Feb 7, 2024
42 of 43 checks passed
@rrsettgast rrsettgast deleted the feature/useStreakForRunner2 branch February 7, 2024 21:15
ouassimkh pushed a commit that referenced this pull request Feb 16, 2024
* use pip_system_certs...this time in right context

* add run args for docker image

* clean workspace from inside the docker container

* add nvidia runtime to docker args
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants