Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

Closed
wants to merge 1 commit into from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Sep 12, 2024

resolves #

Summary

Adds zstd to the silo image, as we use zstd file compression this generally makes sense to have, I also need this to add changes proposed in loculus-project/loculus#2773 (checking that we have received all streamed data).

Copy link
Contributor

There is no change in the changelog. This PR will not produce a new releasable version.

@anna-parker anna-parker changed the title Add zstd to the dockerimage used in silo preprocessing. Add zstd to the dockerimage used in silo preprocessing Sep 12, 2024
@anna-parker anna-parker changed the title Add zstd to the dockerimage used in silo preprocessing feat(docker): Add zstd to the dockerimage used in silo preprocessing Sep 12, 2024
@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

We install zstd as a dependency, I do not quite understand why this is required from the CLI?

@anna-parker
Copy link
Contributor Author

anna-parker commented Sep 13, 2024

@Taepper I would like to use zstd in the silo_import_job.sh script, I can also somehow call it from the silo api if that is possible. However, adding it to the cli here makes debugging easier.

here is the position: https://github.com/loculus-project/loculus/pull/2773/files#diff-4d103a16b3b0f2a73a5fcac5f2db3d5c401311c01652196e13f2cd060792951b

@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

But this is a script not part of SILO. Whether the downstream consumer wants to run the script in the SILO image or not should not introduce this package for all consumers. I would rather introduce the apt install here

@anna-parker
Copy link
Contributor Author

But this is a script not part of SILO. Whether the downstream consumer wants to run the script in the SILO image or not should not introduce this package for all consumers. I would rather introduce the apt install here

I can try this out. I thought normally when running a process in a docker image the process cannot install dependencies in the image as this would mean the process is running a sudo command - so this would normally need to be done in the docker image set-up. But maybe this is possible here

@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

Hmm, that is true. I am currently also not quite in the clear why the silo import script needs to be executed inside the silo container, but I guess this would be a little complicated to change?

So the mental model of the silo container is supposed to just have all dependency installed and be able to run silo. Adding additional binaries to the image because a user script interactively execs into the container run bash commands is not the way I envisioned it to be used

@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

I guess that could be solved by also publishing binaries for all releases instead of just docker images. Then the docker image we use to execute the commands can be just some ubuntu base image with the silo binary, all tools installed that you need and as an entry-point have the loculus silo import script. This could still be facilitated by building an image on top of the silo base image. What do you think?

@anna-parker
Copy link
Contributor Author

anna-parker commented Sep 13, 2024

The silo_import_script runs /app/siloApi --preprocessing - so it needs to run in a container that contains silo-preprocessing, currently you have the same container for silo-preprocessing and silo. So that is why it needs to be executed inside the silo container.

But it is ok I got it to work like you suggested - I wonder if this means any process running in your silo container is root.

I think having silo binaries would make sense adding @corneliusroemer as he had some thoughts in loculus-project/loculus#2765.

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also would not introduce this here, if it's not necessary.

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Sep 13, 2024

I also would not introduce this here, if it's not necessary.

There are arguments for and against, it's not about necessity but about pro/con tradeoff.

For:

  • Given zstd is a supported compression format for input data, it makes sense to have it available as a CLI binary in the image, for debugging and for use in scripts to verify the downloaded data is intact - you're already shipping curl in the image
  • You currently only provide a simple docker image, no binaries. Yes we can copy the binaries out of the docker image but that's a bit cumbersome
  • It solves a problem for Loculus/Pathoplexus in a very simple way (clicking merge). The alternative is to force Loculus/Pathoplexus (abbreviated PP) to add a whole new Dockerfile and build and github action and PP can no longer just specify a tag to use in the values.yaml. This means it'd take at least a few hours of dev time to get this done.

Against:

  • 300kB extra size
  • Basically none? There's already curl in the image, don't think security wise there's any reason to worry about zstd. If people want something minimal, then they should also get rid of curl etc, but that's not the case already.

Overall, I see a clear case for adding it to unblock loculus-project/loculus#2773 and I struggle to see downsides that are worth the extra work in PP at this point (later we might want to change how we use silo preprocessing, but not right now)

@chaoran-chen's view here would be valuable I think as he's involved in both projects and could make a call on the tradeoffs.

@fengelniederhammer
Copy link
Contributor

IMO Loculus "misuses" the SILO Docker image by running a script in it that the image is not intended for. And it apparently works by simply installing zstd on startup of the container. That's why I see this rather as a Loculus issue than a SILO issue.

But we've discussed splitting SILO and providing two images: one for the API and one for the preprocessing. This might be an aspect that's worth considering there, too.

@corneliusroemer
Copy link
Contributor

I am currently also not quite in the clear why the silo import script needs to be executed inside the silo container, but I guess this would be a little complicated to change?

Not complicated to change but it somewhat unnecessarily costs dev time to reconfigure the architecture and build a separate docker image in PP. It doesn't need to be executed inside the silo container, but it's how things are done right now.

So the mental model of the silo container is supposed to just have all dependency installed and be able to run silo. Adding additional binaries to the image because a user script interactively execs into the container run bash commands is not the way I envisioned it to be used

Curl was added a long time ago (71c5664) and has much more security implications than a tiny zstd binary.

I guess that could be solved by also publishing binaries for all releases instead of just docker images. Then the docker image we use to execute the commands can be just some ubuntu base image with the silo binary, all tools installed that you need and as an entry-point have the loculus silo import script. This could still be facilitated by building an image on top of the silo base image. What do you think?

Yeah that would be nice, I made #575 - but it would still cost a lot of downstream dev work.

@chaoran-chen
Copy link
Member

chaoran-chen commented Sep 13, 2024

Hi! Thank you all for detailing and weighing up the pros and cons! All your points make a lot of sense and I think that we all agree that there are things to improve with regard to how SILO is distributed. It has been on our roadmap for a long time to release binaries that people can use without Docker but the challenge is that it is not easy/possible to provide a fully statically-linked binary and we will probably not be able to solve that within the next few days. Offering different Docker images with different bases also sounds good in principle and is common practice for many projects but it will take quite some effort.

Installing things that are unrelated to SILO into the image is not ideal and we should improve it both on the Loculus side (by moving the data download from the backend into a separate image, this could be combined with loculus-project/loculus#2765) as well as on the SILO side (by providing binaries and/or other images). However, I do see that this is a very simple way to solve a problem in Loculus and see very few disadvantages.

So, fow now, if adding the one-liner is the best and easiest solution for Loculus, I am in favor of merging this PR. However, I see that we now have https://github.com/loculus-project/loculus/pull/2773/files#diff-f451cc1794ed7b689f4eddbe5eed0d6d7ce35e599c44a628e50af4b6c41a3499R93, so is this PR and discussion still relevant or obsolete anyways?

@corneliusroemer
Copy link
Contributor

IMO Loculus "misuses" the SILO Docker image by running a script in it that the image is not intended for.

Quite, yeah, but that's how it is right now and has been for a while, since loculus-project/loculus@b41659d

And it apparently works by simply installing zstd on startup of the container. That's why I see this rather as a Loculus issue than a SILO issue.

That's a good workaround idea - I still don't think not just adding zstd here isn't the easiest with little harm. Downside of installing after startup: increased startup time, requires internet access - but yeah we can just do that then.

But we've discussed splitting SILO and providing two images: one for the API and one for the preprocessing. This might be an aspect that's worth considering there, too.

Agreed that we can think about this but it's not a quick fix for the immediate need we have for the zstd binary.

@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

That's a good workaround idea

Sorry, I really didn't mean to block on this, after being confused by the issue at hand, and then finding out that the import script runs in the silo container, I thought that line change would have been the quickest solution..

I also do not think it would have harmed that zstd is installed into the docker image. But it would be confusing in a few months, when someone thinks that this install is required for some dynamic linking of the silo binary, but a comment linking to the loculus script could have cleared that up

@Taepper
Copy link
Collaborator

Taepper commented Sep 13, 2024

So, if you think this or something else helps development in loculus just mention it, and we can move forward with it, until we refactor the import script / amend the interface to start silo

@anna-parker
Copy link
Contributor Author

I now just install zstd every time we start the silo preprocessing script - inefficient but not terrible as a temporary solution.

I think its good to follow up in #575 and work on splitting SILO and providing two images: one for the API and one for the preprocessing which we can then use for example in the silo_import_job.sh script.

I will close this PR for now.

@Taepper Taepper deleted the ad_zstd branch October 25, 2024 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants