Combined schedulers #3839

jpbruinsslot · 2024-11-13T17:15:24Z

Warning

Ready for review, not for merging

Changes

app.py: Removal of creation of organisational schedulers. We don't gather all available organisations from the katalogus and create individual boefje, report, and normalizer schedulers.
app.py: Removal of continuous checking of added or removed organisations. We removed the monitor_organisations method that was running in a thread to check if we added or removed organisation and create schedulers for it.
models/ : Task and Schedule get a specific organisation field. Since we don't differentiate tasks that are pushed on an organisation queue anymore, we need to be able to determine for what organisation an item on a queue is from.¹
schedulers/: Since we've consolidated the organisational schedulers to 3 types of schedulers, and we assume that we get task creation from consolidated message queues, we need to update the schedulers to handle that.
- BoefjeScheduler
  - retrieve scan profile mutations from one message queue
  - iterate over organisation to check for enables boefjes²
  - refactor of method names and exception handling
server/
- Merger of queues/ and schedulers/ endpoint. The notion of a queue and scheduler are used interchangeably. Therefore it makes more sense at the moment to merge those endpoints to avoid confusion and simplify the code.
- Support to pop off more than one Task from a scheduler. Task runner will now be able to batch pop tasks from a scheduler using filters.

Next steps and impact

Pop endpoint has changed from /queues to /schedulers/{id}/pop , additionally it will return a paginated result instead of a single Task , this is because the pop endpoint now supports filtering with multiple tasks returns. Services that rely on the scheduler pop endpoint need to update their interfaces (Update services that rely on /pop endpoint of scheduler #3961)
Push endpoint changed from /queues to schedulers/{id}/push, services that interface with the push endpoints (rocky) need to update their interfaces. (Update services that rely on /push endpoint of scheduler #3962)
scan profile mutation message queue, for every organisation a message queue is created for scan profile mutations, this needs to be updated and all scan profile mutations for every organisation needs to be relayed on a single scan profile mutations message queue (Combine organisational scan profile mutation message queues #3963)
raw data file received message queue (the as with scan profile mutations) (Combine raw data recieved message queues #3964)
Model definitions updates: organisation fields are added to Task , Schedule , services using these models need to update their specifications. (Model definition updates: addition of organisation field to Task and Schedule models #3965)
Batch status updates. Several places in the scheduler we can consider batch updating status field of task. Potentially exposing this as a endpoint. This because the task runner will be able to pop off multiple task from a scheduler, it might therefore be beneficial to batch update the tasks. (
Deleting of organisations needs to be addressed, and what the protocol needs to be with regards of queued tasks, and schedules. (Discussion: how do we handle the deletion of organisations? #3966)

Issue link

#3838

QA notes

The process of scanning tasks, rescheduling of tasks should remain the same from a users standpoint of view. Testing this would be the main focus point. Additionally since all organisational schedulers have been combined to one, particular emphasis should be made to check if users will only see, or are able to interact with tasks that are for the organisation that they are allowed to.

Code Checklist

All the commits in this PR are properly PGP-signed and verified.
This PR only contains functionality relevant to the issue.
I have written unit tests for the changes or fixes I made.
I have checked the documentation and made changes where necessary.
I have performed a self-review of my code and refactored it to the best of my abilities.

Tickets have been created for newly discovered issues.
For any non-trivial functionality, I have added integration and/or end-to-end tests.
I have informed others of any required .env changes files if required and changed the .env-dist accordingly.
I have included comments in the code to elaborate on what is not self-evident from the code itself, including references to issues and discussions online, or implicit behavior of an interface.

Checklist for code reviewers:

Copy-paste the checklist from the docs/source/templates folder into your comment.

Checklist for QA:

Copy-paste the checklist from the docs/source/templates folder into your comment.

The attentive reader will know that the organisation information is present in the data specification in a Task , and can be filtered on by using a FilterRequest. However in practice the scheduler is mainly used in the context of multiple organisations and filtering on this id is deemed developer friendly when interfacing with the scheduler. Hence the choice explicitly defining the organisation field on a top level. ↩
as mentioned in the code this iterating over all organisations in order to check if there are enabled plugins is very inefficient, please refer to issue Katalogus caching in the scheduler #3357 ↩

* main: (64 commits) Bug fix: KAT-alogus parameter is now organization member instead of organization code (#3895) Remove sigrid workflows (#3920) Updated packages (#3898) Fix mula migrations Debian package (#3919) Adds loggers to report flow (#3872) Add additional check if task already run for report scheduler (#3900) Create separate finding for Microsoft RDP port (#3882) fix: 🐛 allow boefje completion with 404 (#3893) Feature/improve rename bulk modal (#3885) Update scheduler folder structure (#3883) Translations update from Hosted Weblate (#3870) Increase max number of PostgreSQL connections (#3889) Fix for task id as valid UUID (#3744) Add `auto_calculate_deadline` attribute to Scheduler (#3869) Ignore specific url parameters when following location headers (#3856) Let mailserver inherit l1 (#3704) Change plugins enabling in report flow to checkboxes (#3747) Fix rocky katalogus tests and delete unused fixtures (#3884) Enable/disable scheduled reports (#3871) optimize locking in katalogus.py, reuse available data (#3752) ...

jpbruinsslot · 2025-01-15T16:08:00Z

mula/scheduler/app.py

-        with self.lock:
-            if scheduler_id not in self.schedulers:
-                return
+        # FIXME:: can be queries instead of a loop


underdarknl · 2025-01-21T15:07:15Z

boefjes/boefjes/clients/scheduler_client.py

+        response = self._session.post(f"/schedulers/{scheduler_id}/pop?limit=1")
        self._verify_response(response)

-        return TypeAdapter(list[Queue]).validate_json(response.content)
+        page = TypeAdapter(PaginatedTasksResponse | None).validate_json(response.content)


If the limit is hard coded to 1, do we really need a paginated response, especially when we then return only the first Task on line 89

This is specifically left in to make the new changes work with the current setup, of the task runner handling one task at a time. The endpoint now allows for batched tasks to be popped from the endpoint as was requested in feature requests for the scheduler. I made @Donnype aware of these changes and he will be thinking about the necessary changes to the task runner to leverage this new functionality.

However, come to think about it I need to check whether pagination will work correctly with tasks being popped and set to dispatched correctly.

Makes sense, I was thinking it might have been because of the multi-pop support. But, I'm guessing that would either introduce an extra param (limit?) or a new method?

popping and changing the status (for any number of taskst) should be handled in the same database-transaction I'd say. Hopefully relying on table locking to ensure a task is not popped twice simultaneously.

underdarknl · 2025-01-21T15:10:59Z

boefjes/tests/conftest.py

+                response = TypeAdapter(PaginatedTasksResponse).validate_json(self.boefje_responses.pop(0))
+                p_item = response.results[0]


paginated responses when we are only dealing with single results.

underdarknl · 2025-01-21T15:17:50Z

mula/scheduler/models/schedule.py

    schedule = Column(String, nullable=True)
-
    tasks = relationship("TaskDB", back_populates="schedule")

    deadline_at = Column(DateTime(timezone=True), nullable=True)


Should this (and created_at and modified_at not be set to func.utcnow() to make sure we don't rely on the possibly missing UTC flag on the connection?

For the deadline_at it is significant that it is possible to set it to None , setting the default on inserting a Schedule in the database has consequences on code that relies on this. This should remains as is.

Since this a model definition that translates to a SQL migration the func.now() function generates the SQL NOW(), which retrieves the current timestamp from the database server, if the database server is configured to use UTC , NOW() will return the current time in UTC. As far as I know there isn't a func.utcnow() function in sqlalchemy. We can however in the migrations force the database server to use UTC as its default timezone with:

SET timezone = 'UTC';

alternatively, I think we can also might be able to set the time on using Python's datetime.now(timezone.utc)

modified_at = Column( DateTime(timezone=True), nullable=False, default=lambda: datetime.now(timezone.utc), onupdate=lambda: datetime.now(timezone.utc) )

However this likely will be evaluated in Python instead of the database.

@ammar92 you have any particular thoughts about this?

Since this a model definition that translates to a SQL migration the func.now() function generates the SQL NOW(), which retrieves the current timestamp from the database server, if the database server is configured to use UTC , NOW() will return the current time in UTC. As far as I know there isn't a func.utcnow() function in sqlalchemy.

As far as I know the internally stored value for timestamp with time zone is always in UTC. So even if there was function like a UTCNOW(), it wouldn't matter for this non-naive date time type.

However this likely will be evaluated in Python instead of the database.

It's indeed evaluated in Python land, but the performance impact is negligible. While it's a good alternative, using SQLAlchemy's function is generally preferred and more common for this use case

Hm, sorry, I missed that it was an example:

https://docs.sqlalchemy.org/en/20/core/compiler.html#utc-timestamp-function

underdarknl · 2025-01-21T15:18:33Z

mula/scheduler/models/task.py

    hash: str | None = Field(None, max_length=32)
-
    data: dict = Field(default_factory=dict)

    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))


also use func.utcnow() here.

underdarknl · 2025-01-21T15:26:40Z

mula/scheduler/schedulers/schedulers/boefje.py

-        # There should be an OOI in value
-        ooi = mutation.value
-        if ooi is None:
+        try:


I'd rather keep the try except for the ValidationError smaller and only around the actual validation code.

underdarknl · 2025-01-21T15:28:59Z

mula/scheduler/schedulers/schedulers/boefje.py

+            list(range(boefje.scan_level, 5)),  # type: ignore
+        )
+
+        # Filter OOIs based on permission


This should be done in the remote API / query. Lets add a #TODO for this.

underdarknl · 2025-01-21T15:29:35Z

mula/scheduler/schedulers/schedulers/boefje.py

-            boefje: Boefje to run.
-            ooi: OOI to run Boefje on.
-            caller: The name of the function that called this function, used for logging.
+                        # Boefje allowed to scan ooi?


This should be done in the remote API / query. Lets add a #TODO for this.

underdarknl · 2025-01-21T15:31:40Z

mula/scheduler/schedulers/schedulers/report.py

-                scheduler_id=self.scheduler_id,
-            )
-            return True
+    def has_schedule_permission_to_run(self, schedule: models.Schedule) -> bool:


Can we somehow move this to the original query that pops our schedules, instead of evaluating this in python time?

underdarknl · 2025-01-21T15:32:55Z

mula/scheduler/schedulers/schedulers/report.py

+                    filters.Filter(column="deadline_at", operator="lt", value=datetime.now(timezone.utc)),
+                    filters.Filter(column="enabled", operator="eq", value=True),


Can we add the correct joins/filters here to limit this result set to only schedules that are allowed to produce a task?

These filters should do that. Other evaluations of whether a task should be executed or not depend on the scheduler and subsequent cross-references checks to other services. E.g. for a boefje scheduler, does the ooi still exist, is the plugin still enabled, are we still allowed to scan, etc. These factors can't be evaluated by a database query.

Can they be set on the update events of those changes though? Eg, disabling / removing tasks for boefjes that are being disabled seems like the correct way forward instead of keeping them around and checking them for an enabled boefje every time they could be executed.

Which is clearly out of scope for this PR, but still worth investigating.

jpbruinsslot added 13 commits October 31, 2024 09:28

Migrate schedulers to postgres

2783e5b

Implement scheduler storage

f6030fc

Restructure app

5dbf526

Update

ef00d7d

Taking out the trash

0f26d7f

Clean up on aisle six

6838bf3

Kondofy

5279369

Sweeping the floor

718df53

Dust-off

605eb78

Brush off

47085d1

Squeaky clean

d5bbf55

Wash cycle

43bd6b0

Rinse

cc0bec4

jpbruinsslot added the mula Issues related to the scheduler label Nov 13, 2024

jpbruinsslot self-assigned this Nov 13, 2024

jpbruinsslot mentioned this pull request Nov 13, 2024

Combine all schedulers for all organisations #3838

Open

jpbruinsslot added 3 commits November 18, 2024 10:28

Shiny

6d7bd0e

Polish

33a72d7

Combining organisation schedulers

da3f337

jpbruinsslot force-pushed the poc/mula/combined-schedulers branch from 57d040f to da3f337 Compare November 18, 2024 13:29

jpbruinsslot added 2 commits November 18, 2024 17:53

Revert to in-memory

05f36d4

Update

32f06ed

jpbruinsslot linked an issue Nov 21, 2024 that may be closed by this pull request

Combine all schedulers for all organisations #3838

Open

Refactor organisations

9ccecba

jpbruinsslot mentioned this pull request Nov 25, 2024

Update scheduler folder structure #3883

Merged

9 tasks

jpbruinsslot changed the title ~~POC: combined schedulers~~ Combined schedulers Dec 2, 2024

jpbruinsslot added 4 commits December 2, 2024 18:03

Update tests

9c17a32

Update

38d436f

Made schedulers work

0896efe

jpbruinsslot added 4 commits January 15, 2025 15:24

Merge branch 'main' into poc/mula/combined-schedulers

d995b68

Precommit

3f0f3eb

Fix task runner

e772838

Boefjes combined schedulers integration (#4015)

b31c366

jpbruinsslot commented Jan 15, 2025

View reviewed changes

Fix task stats

724fb39

underdarknl added this to the OpenKAT v1.19 milestone Jan 21, 2025

Fix scheduler stats

74764d7

underdarknl reviewed Jan 21, 2025

View reviewed changes

Update

72d3097

underdarknl reviewed Jan 21, 2025

View reviewed changes

jpbruinsslot and others added 12 commits January 27, 2025 17:55

Add tests

d5c0fb6

Update

9a3a6c9

Merge branch 'main' into poc/mula/combined-schedulers

b27a225

Fix robot test

b5265b6

Update report runner

f557a56

Fix tests

ce2cd07

Fix tests

aacf0df

Fix tests

9bd04f2

Fix integration tests of Bytes

740c354

Implement naive timebased ranker for boefje scheduler

67ba5c4

Precommit

9803c10

Merge branch 'main' into poc/mula/combined-schedulers

a23863a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined schedulers #3839

Combined schedulers #3839

jpbruinsslot commented Nov 13, 2024 •

edited

Loading

jpbruinsslot Jan 15, 2025

underdarknl Jan 21, 2025

jpbruinsslot Jan 21, 2025

jpbruinsslot Jan 21, 2025

underdarknl Jan 22, 2025

underdarknl Jan 22, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

jpbruinsslot Jan 22, 2025

ammar92 Jan 22, 2025

underdarknl Jan 22, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

underdarknl Jan 21, 2025

jpbruinsslot Jan 22, 2025

underdarknl Jan 22, 2025

underdarknl Jan 22, 2025

		response = TypeAdapter(PaginatedTasksResponse).validate_json(self.boefje_responses.pop(0))
		p_item = response.results[0]

		filters.Filter(column="deadline_at", operator="lt", value=datetime.now(timezone.utc)),
		filters.Filter(column="enabled", operator="eq", value=True),

Combined schedulers #3839

Are you sure you want to change the base?

Combined schedulers #3839

Conversation

jpbruinsslot commented Nov 13, 2024 • edited Loading

Changes

Next steps and impact

Issue link

QA notes

Code Checklist

Checklist for code reviewers:

Checklist for QA:

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbruinsslot commented Nov 13, 2024 •

edited

Loading