-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use streak.llnl.gov as runner for GHA #2873
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #2873 +/- ##
===========================================
- Coverage 51.44% 51.34% -0.11%
===========================================
Files 972 972
Lines 87067 82389 -4678
===========================================
- Hits 44789 42300 -2489
+ Misses 42278 40089 -2189 ☔ View full report in Codecov by Sentry. |
d13ab2f
to
9751403
Compare
1bd45aa
to
f627583
Compare
@@ -98,7 +98,7 @@ class SolidBaseUpdates | |||
arrayView1d< real64 const > const m_thermalExpansionCoefficient; | |||
|
|||
/// Flag to disable inelasticity | |||
const bool & m_disableInelasticity; | |||
const bool m_disableInelasticity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug on heterogeneous memory systems. @klevzoff This was the problem with the constitutive unit tests from the remove array1d<string>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way to automatically check for this kind of bugs? Any way to at least get a warning whenever a reference is captured by value in a raja forAll
?
@@ -192,6 +192,8 @@ void compute2DLaplaceOperator( MPI_Comm comm, | |||
|
|||
// Construct the 2D Laplace matrix | |||
laplace2D.create( matrix.toViewConst(), matrix.numRows(), comm ); | |||
|
|||
// laplace2D.write( "matrix.txt", LAIOutputFormat::NATIVE_ASCII); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep this because it will be needed when we fix this test on GPU.
@@ -232,7 +252,11 @@ fi | |||
|
|||
# Run the unit tests (excluding previously ran checks). | |||
if [[ "${RUN_UNIT_TESTS}" = true ]]; then | |||
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck" | |||
if [[ ${HOSTNAME} == 'streak.llnl.gov' ]]; then | |||
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck|testLifoStorage|testExternalSolvers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to exclude the two failing unit tests on GPU systems so we can merge this and follow up with fix PR's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create an issue pointing at this line though? If someone other than you fixes the tests I am pretty sure they'll forget to remove this.
update-ca-certificates | ||
# gcloud config set core/custom_ca_certs_file cert.pem' | ||
|
||
NPROC=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hard coded right now for streak.
if [ ${{ inputs.RUNS_ON }} == 'self-hosted' ]; then | ||
RUNNER_CERTIFICATES_DIR=/etc/pki/ca-trust/source/anchors/ | ||
mkdir -p ${GITHUB_WORKSPACE}/certificates | ||
cp ${RUNNER_CERTIFICATES_DIR}/*.crt* ${GITHUB_WORKSPACE}/certificates | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the DOCKER_RUN_ARGS
input to add -v /etc/pki/ca-trust/source/anchors/.../bundle.crt:/etc/pki/ca-trust/.../llnl.bundle.crt:ro
.
That way the certificate would be at the proper location in read-only, with no copy required from within the container.
Then inside of the docker container simply update the certificates list (we can do it whatever the case : it's fast and could do not harm?). Or if we really want it, we can add an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can...and probably should. Would you mind if we did this in a follow up PR. We are getting another workstation to run integratedTests on the CPU, and we will require a generic treatment at that time. I would rather just get this merged as is, and in separate PR's:
- clean up and generalize for the new workstation,
- fix the failing unit tests on GPU, and
- get integrated tests diffs worked out.
we can do all of these steps independent of each other IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm good with it. We need to be sure we actually have time to revisit this.
#rm -rf ${{ github.workspace }}/* | ||
#docker rm -f ${CONTAINER_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you still need to clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then can you add a small comment about it?
# Cleaning the build directory. | ||
or_die ninja clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely superfluous since the build dir is in /tmp
which is washed away with the --rm
option of docker run
. You can keep it if you want, since this may change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right.
@@ -256,8 +280,14 @@ if [[ "${RUN_INTEGRATED_TESTS}" = true ]]; then | |||
or_die tar cfM ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar --directory ${GEOS_SRC_DIR} --transform "s/^integratedTests/${DATA_BASENAME_WE}\/repo/" integratedTests | |||
or_die tar rfM ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar --directory ${GEOSX_BUILD_DIR} --transform "s/^integratedTests/${DATA_BASENAME_WE}\/logs/" integratedTests | |||
or_die gzip ${DATA_EXCHANGE_DIR}/${DATA_BASENAME_WE}.tar | |||
|
|||
# want to clean the integrated tests folder to avoid polluting the next build. | |||
or_die integratedTests/geos_ats.sh -a clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With @cssherman 's patch to build the baselines out of source, cleaning the integrated tests will surely not be mandatory anymore, since they disappear with the docker run --rm
.
Also, we will be able to mount the source read-only
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was before the --rm
command. I can try without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think @cssherman 's PR has not been merged already. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean
not veryclean
? I think it will still be better to clean up the integratedTests directory. One may update the version of the baselines... and the clean also removes the tests results which I think we do want to remove IMO. Unless the rm options cleans all that for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option can be to mount the workspace read-only in the container, then copy it to a temporary and work in it (waiting for an out-of-source integrated tests which should solve the last item which touches the src dir).
Then everything would be cleaned by the docker run --rm
.
Eventually, we'll be able to skip the workspace copy, but not right now.
@@ -232,7 +252,11 @@ fi | |||
|
|||
# Run the unit tests (excluding previously ran checks). | |||
if [[ "${RUN_UNIT_TESTS}" = true ]]; then | |||
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck" | |||
if [[ ${HOSTNAME} == 'streak.llnl.gov' ]]; then | |||
or_die ctest --output-on-failure -E "testUncrustifyCheck|testDoxygenCheck|testLifoStorage|testExternalSolvers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a EXCLUDE_UNIT_TESTS_RE
input in GHA, defaulting to testUncrustifyCheck|testDoxygenCheck
, so we do not have to hardcode anything.
Combined to giving the number of CPUs as a CLI arguments as well (NPROC
can be provided through GHA, or defaulted to the results of nproc
is absent), then we'll remove the need to hard code the name of the self hosted runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would, but I actually don't think we should be excluding unit tests once they are fixed. This is just a hack to get us to the next step (passing tests), after witch this goes away. I actually do want to make it as hard as possible to exclude specific tests 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I kind of prefer having an easy way to do thing with a Cerberus at the code review, but it's OK.
What happens, is that the people maintaining the CI are not the same maintaining the tests or code. So the CI guys gets some ugly code he's not responsible for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well....that is a good enough reason.
@rrsettgast as for the |
* use pip_system_certs...this time in right context * add run args for docker image * clean workspace from inside the docker container * add nvidia runtime to docker args
This PR uses streak to execute one CUDA build and execute unit tests on the GPU. Two tests are disabled
testLifoStorage
andtestExternalSolvers
are disabled on streak due to failure of those tests on GPU with unified memory disabled. Follow up PR's will be needed to fix and reenable those tests.