Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create and release a Docker image to run validations #57

Open
elray1 opened this issue Oct 12, 2023 · 18 comments
Open

create and release a Docker image to run validations #57

elray1 opened this issue Oct 12, 2023 · 18 comments
Assignees

Comments

@elray1
Copy link
Contributor

elray1 commented Oct 12, 2023

this would likely live in a different repo, but am filing the issue here for now

@nickreich
Copy link
Contributor

I removed the assignees from this issue, as I don't think this is being actively worked on.

@zkamvar
Copy link
Member

zkamvar commented Jan 2, 2025

I will investigate this today.

@zkamvar zkamvar self-assigned this Jan 2, 2025
@zkamvar
Copy link
Member

zkamvar commented Jan 3, 2025

Initial container in https://github.com/hubverse-org/test-docker-hubValidations. I will follow up tomorrow by experimenting with incorporating this into a fork of a hub.

@zkamvar
Copy link
Member

zkamvar commented Jan 3, 2025

I've tested it in zkamvar/variant-nowcast-hub#1, which was created by duplicating the output in reichlab/variant-nowcast-hub#246 and adding an error to the value column

Some observations:

  1. The test runs about 2 minutes faster than the original run because we did not have to install R or check for dependency updates.
  2. The workflow is shorter
  3. The container takes 15 to 20 minutes to build on GitHub actions, but once it's built, it only takes 20 to 30 seconds to provision in a GitHub workflow.
  4. I do not yet know how to version this, but I can dig into it

@zkamvar zkamvar moved this from Wishlist to In Progress in hubverse Development overview Jan 3, 2025
@annakrystalli
Copy link
Member

My feeling is that we should not be wrapping the validation code within scripts in the container. I feel using an image that contains all necessary software and packages and using it to run the rest of the workflow steps is preferable for a number of reasons:

  1. Reusability: Image can be used for other purposes, including locally for validating submissions.
  2. Modularity: It can still form the basis of a container that does run validation code if we decide this is necessary. It can even be used by hub admins as the basis for other containers.
  3. The validate_pr() function has a number of arguments that allow for customisation of validation. Hiding it away in a container script in a different repo means folks can't access that or we would need to write extra code to pass all validation arguments as command line args which feels unnecessary work.
  4. Equally if any custom validation functions require additional dependencies, they cannot be configured as described in our vignette using the current approach.

In addition before deploying any of this, I feel we need a clear system that will ensure images are up to date, ideally as automated as possible.

Give all this and how important validation is to hub functionality, I feel this issue requires a RFC and decision document before moving forward further.

@annakrystalli
Copy link
Member

annakrystalli commented Jan 6, 2025

Sorry! I think I confused what was going on when I looked at it last week but now that all the docs are in the repo, it's clear the way you've designed this can be used already as described above 🚀.

Given the step to install packages will no longer be required in the vanilla validation workflow, it would be good once we've decided to roll the above out in our validation workflows to include an example of how hub admins could amend their workflow to include additional dependencies. I imagine the vignette here in hubValidations for deploying custom functions might be a good place and perhaps the updated hubverse-actions documentation for the updated validate-submission action?

@annakrystalli
Copy link
Member

One last question, should we be installing the released hubverse packages from r-universe instead of from GitHub? Would it be useful to maintain a dev version of the image with dev versions of the packages?

@zkamvar
Copy link
Member

zkamvar commented Jan 6, 2025

should we be installing the released hubverse packages from r-universe instead of from GitHub?

No. We are using {renv} to control the versioning, which expects an archival repository. Theoretically, {renv} should fall back to GitHub if the package is from a universe repository, but I've found in rstudio/renv#2068 that this is still not the case.

Would it be useful to maintain a dev version of the image with dev versions of the packages?

I think so.

@zkamvar
Copy link
Member

zkamvar commented Jan 6, 2025

Give all this and how important validation is to hub functionality, I feel this issue requires a RFC and decision document before moving forward further.

I agree and I want to make it clear that my exercise here was to explore the space and come up with a prototype/minimal viable product. There are a lot of different layers that can be optimized and I wanted to explore them sufficiently before proposing a project.

@zkamvar
Copy link
Member

zkamvar commented Jan 6, 2025

include an example of how hub admins could amend their workflow to include additional dependencies.

That's a good point. I'll have to look at https://rstudio.github.io/renv/articles/docker.html#dynamically-provisioning-r-libraries-with-renv to make sure our implementation can be extended.

@annakrystalli
Copy link
Member

annakrystalli commented Jan 17, 2025

I know this is still WIP but I think some of the functionality is really useful and might be better as official hubValidations functionality. I'm thinking specifically the get_error_data() which would also resolve #16

It could be that we add it to the print method but only deploy through an argument (e.g. show_errors = TRUE)?

@zkamvar
Copy link
Member

zkamvar commented Jan 17, 2025

I know this is still WIP but I think some of the functionality is really useful and might be better as official hubValidations functionality. I'm thinking specifically the get_error_data() which would also resolve #16

It could be that we add it to the print method but only deploy through an argument (e.g. show_errors = TRUE)?

I can take a stabby-stab at it. I agree that the functionality should live in hubValidaitons.

@zkamvar
Copy link
Member

zkamvar commented Jan 17, 2025

Back on thread:

I'm now realizing that the renv.lock is over-complicating things a bit. The docker container could be built from a simple docker file.

I think it might be possible for the docker files to actually live inside the repository and ignored in .Rbuildignore

docker
|- dev/
|    |- Dockerfile
|- release/
|    |- Dockerfile

We could have a GitHub workflow that will build the dev container on each push and on a schedule (to capture updates from dependencies). The release container will be built when a release is created.

I've demonstrated this in my fork of hubUtils: https://github.com/zkamvar/hubUtils/
The Dockerfiles live in the places I described above and they are built with two nearly identical workflows, the difference being that the dev container is built every push and the release container is built on demand or on a release tag.

The container with the released version is https://github.com/zkamvar/hubUtils/pkgs/container/hubutils/339039092?tag=main

The released version contains a renv.lock file in /etc/renv.lock (I wasn't sure of the right place to put it), but this demonstrates that it should be possible to have a workflow that emits the renv.lock file and attaches it to a release.

and the container with the dev version is
https://github.com/zkamvar/hubUtils/pkgs/container/hubutils-dev/339039095?tag=main

@elray1
Copy link
Contributor Author

elray1 commented Jan 17, 2025

I'm now realizing that the renv.lock is over-complicating things a bit. The docker container could be built from a simple docker file.

In past investigations essentially independently replicated by @matthewcornell and me, we've landed on building R-based containers using renv in order to get the installation of specific R package versions that renv provides, and which we've found somewhat difficult to achieve without renv. @matthewcornell has written up the procedures we've used here: https://github.com/reichlab/operational-models/blob/main/flu_ar2/..%2FREADME.md#requirementstxt-and-renvlock-details. I (and I'm sure Matt) would be interested to hear if you're envisioning something else specific to deal with this?

@zkamvar
Copy link
Member

zkamvar commented Jan 17, 2025

we've landed on building R-based containers using renv in order to get the installation of specific R package versions that renv provides

I guess it really depends on what the container is for. If it's a drop-in replacement for our current GitHub workflow setup, then a simple Dockerfile that generates a container at release time is sufficient.

If, however, we would need this for future reproducibility, then the renv.lock file is an important element because we would be able to re-build the container from an old renv.lock.

I think for a container that should mirror the release and development versions of the packages , using a renv.lock might be a bit too prescriptive because we rebuild the containers every time we do a release, containing a snapshot of the R package environment at the time of release, which better reflects the testing environment and it reduces the burden of maintaining and synchronizing a renv.lock file across the dependent package containers (e.g. the renv.lock for hubUtils will serve as the base renv.lock for hubValidations).

@zkamvar
Copy link
Member

zkamvar commented Jan 17, 2025

That all being said, we might be able to have the Dockerfile even generate the renv.lock as an artifact when it's built.

@elray1
Copy link
Contributor Author

elray1 commented Jan 17, 2025

makes sense. i guess there's a plan to distribute the built images via docker hub or similar, and that's what a user would effectively interact with?

@zkamvar
Copy link
Member

zkamvar commented Jan 17, 2025

makes sense. i guess there's a plan to distribute the built images via docker hub or similar, and that's what a user would effectively interact with?

Yes. The plan is to use GitHub Workflows to build and deploy the containers to GitHub's container registry (ghcr.io). This is how https://github.com/hubverse-org/hub-dash-site-deployer and https://github.com/hubverse-org/hubPredEvalsData-docker are deployed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

5 participants