feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

anna-parker · 2024-09-12T16:07:24Z

resolves #

Summary

Adds zstd to the silo image, as we use zstd file compression this generally makes sense to have, I also need this to add changes proposed in loculus-project/loculus#2773 (checking that we have received all streamed data).

github-actions · 2024-09-12T17:16:32Z

There is no change in the changelog. This PR will not produce a new releasable version.

Taepper · 2024-09-13T06:53:16Z

We install zstd as a dependency, I do not quite understand why this is required from the CLI?

anna-parker · 2024-09-13T07:52:16Z

@Taepper I would like to use zstd in the silo_import_job.sh script, I can also somehow call it from the silo api if that is possible. However, adding it to the cli here makes debugging easier.

here is the position: https://github.com/loculus-project/loculus/pull/2773/files#diff-4d103a16b3b0f2a73a5fcac5f2db3d5c401311c01652196e13f2cd060792951b

Taepper · 2024-09-13T08:08:57Z

But this is a script not part of SILO. Whether the downstream consumer wants to run the script in the SILO image or not should not introduce this package for all consumers. I would rather introduce the apt install here

anna-parker · 2024-09-13T08:18:20Z

But this is a script not part of SILO. Whether the downstream consumer wants to run the script in the SILO image or not should not introduce this package for all consumers. I would rather introduce the apt install here

I can try this out. I thought normally when running a process in a docker image the process cannot install dependencies in the image as this would mean the process is running a sudo command - so this would normally need to be done in the docker image set-up. But maybe this is possible here

Taepper · 2024-09-13T08:30:12Z

Hmm, that is true. I am currently also not quite in the clear why the silo import script needs to be executed inside the silo container, but I guess this would be a little complicated to change?

So the mental model of the silo container is supposed to just have all dependency installed and be able to run silo. Adding additional binaries to the image because a user script interactively execs into the container run bash commands is not the way I envisioned it to be used

Taepper · 2024-09-13T08:32:36Z

I guess that could be solved by also publishing binaries for all releases instead of just docker images. Then the docker image we use to execute the commands can be just some ubuntu base image with the silo binary, all tools installed that you need and as an entry-point have the loculus silo import script. This could still be facilitated by building an image on top of the silo base image. What do you think?

anna-parker · 2024-09-13T08:40:15Z

The silo_import_script runs /app/siloApi --preprocessing - so it needs to run in a container that contains silo-preprocessing, currently you have the same container for silo-preprocessing and silo. So that is why it needs to be executed inside the silo container.

But it is ok I got it to work like you suggested - I wonder if this means any process running in your silo container is root.

I think having silo binaries would make sense adding @corneliusroemer as he had some thoughts in loculus-project/loculus#2765.

fengelniederhammer

I also would not introduce this here, if it's not necessary.

corneliusroemer · 2024-09-13T12:44:39Z

I also would not introduce this here, if it's not necessary.

There are arguments for and against, it's not about necessity but about pro/con tradeoff.

For:

Given zstd is a supported compression format for input data, it makes sense to have it available as a CLI binary in the image, for debugging and for use in scripts to verify the downloaded data is intact - you're already shipping curl in the image
You currently only provide a simple docker image, no binaries. Yes we can copy the binaries out of the docker image but that's a bit cumbersome
It solves a problem for Loculus/Pathoplexus in a very simple way (clicking merge). The alternative is to force Loculus/Pathoplexus (abbreviated PP) to add a whole new Dockerfile and build and github action and PP can no longer just specify a tag to use in the values.yaml. This means it'd take at least a few hours of dev time to get this done.

Against:

300kB extra size
Basically none? There's already curl in the image, don't think security wise there's any reason to worry about zstd. If people want something minimal, then they should also get rid of curl etc, but that's not the case already.

Overall, I see a clear case for adding it to unblock loculus-project/loculus#2773 and I struggle to see downsides that are worth the extra work in PP at this point (later we might want to change how we use silo preprocessing, but not right now)

@chaoran-chen's view here would be valuable I think as he's involved in both projects and could make a call on the tradeoffs.

fengelniederhammer · 2024-09-13T12:53:00Z

IMO Loculus "misuses" the SILO Docker image by running a script in it that the image is not intended for. And it apparently works by simply installing zstd on startup of the container. That's why I see this rather as a Loculus issue than a SILO issue.

But we've discussed splitting SILO and providing two images: one for the API and one for the preprocessing. This might be an aspect that's worth considering there, too.

corneliusroemer · 2024-09-13T12:57:01Z

I am currently also not quite in the clear why the silo import script needs to be executed inside the silo container, but I guess this would be a little complicated to change?

Not complicated to change but it somewhat unnecessarily costs dev time to reconfigure the architecture and build a separate docker image in PP. It doesn't need to be executed inside the silo container, but it's how things are done right now.

So the mental model of the silo container is supposed to just have all dependency installed and be able to run silo. Adding additional binaries to the image because a user script interactively execs into the container run bash commands is not the way I envisioned it to be used

Curl was added a long time ago (71c5664) and has much more security implications than a tiny zstd binary.

I guess that could be solved by also publishing binaries for all releases instead of just docker images. Then the docker image we use to execute the commands can be just some ubuntu base image with the silo binary, all tools installed that you need and as an entry-point have the loculus silo import script. This could still be facilitated by building an image on top of the silo base image. What do you think?

Yeah that would be nice, I made #575 - but it would still cost a lot of downstream dev work.

chaoran-chen · 2024-09-13T13:17:07Z

Hi! Thank you all for detailing and weighing up the pros and cons! All your points make a lot of sense and I think that we all agree that there are things to improve with regard to how SILO is distributed. It has been on our roadmap for a long time to release binaries that people can use without Docker but the challenge is that it is not easy/possible to provide a fully statically-linked binary and we will probably not be able to solve that within the next few days. Offering different Docker images with different bases also sounds good in principle and is common practice for many projects but it will take quite some effort.

Installing things that are unrelated to SILO into the image is not ideal and we should improve it both on the Loculus side (by moving the data download from the backend into a separate image, this could be combined with loculus-project/loculus#2765) as well as on the SILO side (by providing binaries and/or other images). However, I do see that this is a very simple way to solve a problem in Loculus and see very few disadvantages.

So, fow now, if adding the one-liner is the best and easiest solution for Loculus, I am in favor of merging this PR. However, I see that we now have https://github.com/loculus-project/loculus/pull/2773/files#diff-f451cc1794ed7b689f4eddbe5eed0d6d7ce35e599c44a628e50af4b6c41a3499R93, so is this PR and discussion still relevant or obsolete anyways?

corneliusroemer · 2024-09-13T13:19:24Z

IMO Loculus "misuses" the SILO Docker image by running a script in it that the image is not intended for.

Quite, yeah, but that's how it is right now and has been for a while, since loculus-project/loculus@b41659d

And it apparently works by simply installing zstd on startup of the container. That's why I see this rather as a Loculus issue than a SILO issue.

That's a good workaround idea - I still don't think not just adding zstd here isn't the easiest with little harm. Downside of installing after startup: increased startup time, requires internet access - but yeah we can just do that then.

But we've discussed splitting SILO and providing two images: one for the API and one for the preprocessing. This might be an aspect that's worth considering there, too.

Agreed that we can think about this but it's not a quick fix for the immediate need we have for the zstd binary.

Taepper · 2024-09-13T14:05:25Z

That's a good workaround idea

Sorry, I really didn't mean to block on this, after being confused by the issue at hand, and then finding out that the import script runs in the silo container, I thought that line change would have been the quickest solution..

I also do not think it would have harmed that zstd is installed into the docker image. But it would be confusing in a few months, when someone thinks that this install is required for some dynamic linking of the silo binary, but a comment linking to the loculus script could have cleared that up

Taepper · 2024-09-13T14:07:23Z

So, if you think this or something else helps development in loculus just mention it, and we can move forward with it, until we refactor the import script / amend the interface to start silo

anna-parker · 2024-09-13T14:37:50Z

I now just install zstd every time we start the silo preprocessing script - inefficient but not terrible as a temporary solution.

I think its good to follow up in #575 and work on splitting SILO and providing two images: one for the API and one for the preprocessing which we can then use for example in the silo_import_job.sh script.

I will close this PR for now.

Add zstd to the dockerimage used in silo preprocessing.

f7b3815

anna-parker mentioned this pull request Sep 12, 2024

fix(backend, silo-preprocessing): add new header x-total-records to get-released-data to enable streaming result verification loculus-project/loculus#2773

Merged

4 tasks

anna-parker marked this pull request as ready for review September 12, 2024 17:16

anna-parker requested a review from Taepper September 12, 2024 17:16

anna-parker changed the title ~~Add zstd to the dockerimage used in silo preprocessing.~~ Add zstd to the dockerimage used in silo preprocessing Sep 12, 2024

anna-parker changed the title ~~Add zstd to the dockerimage used in silo preprocessing~~ feat(docker): Add zstd to the dockerimage used in silo preprocessing Sep 12, 2024

fengelniederhammer requested changes Sep 13, 2024

View reviewed changes

corneliusroemer mentioned this pull request Sep 13, 2024

More distribution options than docker image: binaries #575

Open

anna-parker closed this Sep 13, 2024

Taepper deleted the ad_zstd branch October 25, 2024 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

anna-parker commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024 •

edited

Loading

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024

Taepper commented Sep 13, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024 •

edited

Loading

fengelniederhammer left a comment

corneliusroemer commented Sep 13, 2024 •

edited

Loading

fengelniederhammer commented Sep 13, 2024

corneliusroemer commented Sep 13, 2024

chaoran-chen commented Sep 13, 2024 •

edited

Loading

corneliusroemer commented Sep 13, 2024

Taepper commented Sep 13, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024

feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

feat(docker): Add zstd to the dockerimage used in silo preprocessing #574

Conversation

anna-parker commented Sep 12, 2024 • edited Loading

Summary

github-actions bot commented Sep 12, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024 • edited Loading

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024

Taepper commented Sep 13, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024 • edited Loading

fengelniederhammer left a comment

Choose a reason for hiding this comment

corneliusroemer commented Sep 13, 2024 • edited Loading

fengelniederhammer commented Sep 13, 2024

corneliusroemer commented Sep 13, 2024

chaoran-chen commented Sep 13, 2024 • edited Loading

corneliusroemer commented Sep 13, 2024

Taepper commented Sep 13, 2024

Taepper commented Sep 13, 2024

anna-parker commented Sep 13, 2024

anna-parker commented Sep 12, 2024 •

edited

Loading

anna-parker commented Sep 13, 2024 •

edited

Loading

anna-parker commented Sep 13, 2024 •

edited

Loading

corneliusroemer commented Sep 13, 2024 •

edited

Loading

chaoran-chen commented Sep 13, 2024 •

edited

Loading