Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) #3593

Open
1 of 3 tasks
underdarknl opened this issue Sep 27, 2024 · 2 comments
Assignees
Labels
backend boefjes Issues related to boefjes epic

Comments

@underdarknl
Copy link
Contributor

underdarknl commented Sep 27, 2024

Updated by @Donnype with the comment from @noamblitz.


About this feature

Detailed description

Currently, most boefjes are run in the boefje container as Python code. This has a few disadvantages:

  • Sandboxing They are not sandboxed, meaning they could potentially interact with things they are not supposed to interact with.
  • Reproducibility Since they run with Python code inside the container, users cannot reproduce the results without running the Python code themselves or rerunning the boefje.
  • Transparency It is hard to see what happens

Considerations:

  • Performance should not be affected too much
  • Transparency run commands should be visible in front-end to reach our goal
  • Developer experience should not be affected negatively too much
  • Migrating to this new container should be optional at first to ease pressure on QA

We propose an iterative approach to package all Python boefjes into a single container.

Single container

Several decisions on the first step of the implementation:

  • In boefje.json we will specify the path to the main.py so the code around boefje resolving does not have to be added to the new container. This can be done later if needed.
  • The "old" way of running boefjes will still be possible to ease pressure off migration.
  • In the run command of the container, we will add an argument which points to either a boefje_id or the path to the boefje.json.

Feature benefit/User story

As an expert user, I want to be able to reproduce raw files of the current Python boefjes. To do this, KAT should communicate the run command of the container.

Additional information

Design

Screenshots

Include screenshots of the proposed design changes here.

Figma link

Link to the Figma design for further visualization (if applicable)

@underdarknl underdarknl added this to KAT Oct 2, 2024
@github-project-automation github-project-automation bot moved this to Incoming features / Need assessment in KAT Oct 2, 2024
@underdarknl underdarknl moved this from Incoming features / Need assessment to Approved features / Need refinement in KAT Oct 2, 2024
@noamblitz
Copy link
Contributor

About this feature

Detailed description

Currently, most boefjes are run in the boefje container as Python code. This has a few disadvantages:

  • Sandboxing They are not sandboxed, meaning they could potentially interact with things they are not supposed to interact with.
  • Reproducibility Since they run with Python code inside the container, users cannot reproduce the results without running the Python code themselves or rerunning the boefje.
  • Transparency It is hard to see what happens

Considerations:

  • Performance should not be affected too much
  • Transparency run commands should be visible in front-end to reach our goal
  • Developer experience should not be affected negatively too much
  • Migrating to this new container should be optional at first to ease pressure on QA

We propose an iterative approach to package all Python boefjes into a single container.

  1. We create a single container with a Dockerfile in which we specify which boefjes should be copied, this container will be started and stopped on each boefje run
  2. We create a HTTP API around this container so it does not have to be stopped every time
  3. We create several runner files like kubernetes.py and docker.py so OPsers are able to choose how the boefjes will be run.

Single container

Several decisions on the first step of the implementation:

  • In boefje.json we will specify the path to the main.py so the code around boefje resolving does not have to be added to the new container. This can be done later if needed.
  • The "old" way of running boefjes will still be possible to ease pressure off migration.
  • In the run command of the container, we will add an argument which points to either a boefje_id or the path to the boefje.json.

Feature benefit/User story

As an expert user, I want to be able to reproduce raw files of the current Python boefjes. To do this, KAT should communicate the run command of the container.

Additional information

Design

Screenshots

Include screenshots of the proposed design changes here.

Figma link

Link to the Figma design for further visualization (if applicable)

@Donnype Donnype self-assigned this Oct 16, 2024
@Donnype Donnype moved this from Approved features / Need refinement to Backlog / To do in KAT Oct 17, 2024
@Donnype Donnype added boefjes Issues related to boefjes backend epic labels Oct 17, 2024
@Donnype Donnype changed the title Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) EPIC: Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) Oct 17, 2024
@Donnype Donnype changed the title EPIC: Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) [EPIC] Package all local Python boefjes in a container (could be a single container, targeting the specific modules by its arguments) Oct 17, 2024
@dekkers
Copy link
Contributor

dekkers commented Nov 25, 2024

Containerizing boefjes

The reason to run all boefjes in a container is to run the boefje in a sandbox. In the future is will be possible to also run boefjes created by others, not only boefjes created by KAT. Running those in a sandbox decreases the risk of doing that.

Starting a container for every boefje task results in a lot of overhead, so we want to support running multiple tasks in a single container.

For the boefjes containers we need to support two ways of deploing KAT:

  • KAT has access to the container system control plane to start/stop containers. In this case KAT can automatically start new containers when necessary, but there needs to be runner that can talk to the control plane and start them.

  • KAT has no acess to the control plane and the system administrator configures all the necessary containers themself beforehand.

This means we need to support for long running boefje containers. This boefje can either pull the tasks from the runner or the runner can push the tasks to the boefje container if the boefje container has a service.

Pull-based design

  • Boefje is started as a container
  • Boefje makes a HTTP GET request to boefjes runner to fetch input
  • Boefje runs
  • Boefje submits output using HTTP POST request to boefjes runner
  • Boefje makes a seconds HTTP GET request to boefjes runner to fetch input
  • Boefje runs
  • Boefje submits output using HTTP POST request to boefjes runner

When there isn't any task available, the boefje can either wait on the boefjes runner for a new task to be available using long-polling or just do a new request after some timeout.

Push-based design

  • Specific boefje container is started
  • The container starts listening on HTTP
  • The boefjes runner sends HTTP request to container to start a job
  • Boefje runs
  • Boefje submits output using HTTP POST request to boefjes runner
  • The boefjes runner sends a second HTTP request to container to start a job
  • Boefje runs
  • Boefje submits output using HTTP POST request to boefjes runner

The pull-based design is how task queues are usually implemented, a process that executes tasks pulls the tasks from the queue.

Pushing tasks gives more complications if you want to scale to multiple boefje containers that execute tasks. How will the boefje runner know to which container to push the task? Some boefje tasks might take a very long time to execute, while other tasks might be short. If you want to use things like autoscaling and use a loadbalancer for the boefje HTTP service the question is how the load balancing should work with those very long running tasks. HTTP load balancers usually balance a high number of short duration HTTP requests, not long running tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend boefjes Issues related to boefjes epic
Projects
Status: Backlog / To do
Development

No branches or pull requests

4 participants