Refactor in-memory schedulers to postgresql table #3358

jpbruinsslot · 2024-08-13T15:08:14Z

Current situation

The running schedulers (boefje, normalizer, and report) can be accessed from the rest API. The detailed information about these schedulers are kept in memory in an application-wide dict.

nl-kat-coordination/mula/scheduler/app.py

Lines 61 to 67 in bf3267a

    
           self.schedulers: dict[ 
        
               str, 
        
               schedulers.Scheduler 
        
               | schedulers.BoefjeScheduler 
        
               | schedulers.NormalizerScheduler 
        
               | schedulers.ReportScheduler, 
        
           ] = {}

When referencing what available queues there are a /queues endpoint can be called (the task runner does this to gather what queues to pop from)

nl-kat-coordination/mula/scheduler/server/handlers/queues.py

Lines 56 to 57 in bf3267a

    
           def list(self) -> Any: 
        
               return [models.Queue(**s.queue.dict(include_pq=False)) for s in self.schedulers.copy().values()]

This will iterate over all the available schedulers and construct queue representations. The same goes for the available schedulers:

nl-kat-coordination/mula/scheduler/server/handlers/schedulers.py

Lines 45 to 46 in bf3267a

    
           def list(self) -> Any: 
        
               return [models.Scheduler(**s.dict()) for s in self.schedulers.values()]

Currently, we don't have any filtering possibilities for these endpoints, meaning a task runner needs to poll the scheduler for available schedulers to pop from and iterate over them.

Suggested changes

consolidate /queues and /schedulers endpoints, they are interchangeable
move schedulers configuration and settings to a postgres table
implement filtering of available schedulers from the rest API
change /pop endpoint to support popping of multiple tasks (batches), and add more filtering options (e.g. pop tasks for multiple organisations)
optional leveraging ETag (Entity Tag) or Last-Modified headers of scheduler endpoint
NEW create one BoefjeScheduler , NormalizerScheduler and ReportScheduler for all organisations instead individual schedulers for every organisation. One message queue for all scan profile mutations, and raw file creation (Combine all schedulers for all organisations #3838)

New Functionality

faster overview and querying of all available scheduler without relying on iterating over the in-memory schedulers
rest API filtering options allow for specific retrieval of schedulers (e.g. filtering by created_at to retrieve schedulers that have been created since a specific timestamp)
speed up start-up times, for already defined schedulers we can reference the database in order to create running schedulers

Considerations

Since the current way of referencing organisations in OpenKAT we're still bound by a start-up to check how what are organisations are available in the katalogus
Additionally we're still bound by periodically checking the katalogus for new or removed organisations. This can be optimized by sending a signal (either rest, or aqmp) to the scheduler to create scheduler for a new organisation.

The text was updated successfully, but these errors were encountered:

jpbruinsslot · 2025-01-06T10:30:57Z

superseded by #3838

jpbruinsslot added mula Issues related to the scheduler tech-debt labels Aug 13, 2024

jpbruinsslot self-assigned this Aug 13, 2024

underdarknl added this to KAT Oct 3, 2024

github-project-automation bot moved this to Incoming features / Need assessment in KAT Oct 3, 2024

underdarknl moved this from Incoming features / Need assessment to Approved features / Need refinement in KAT Oct 3, 2024

underdarknl added the scalability label Oct 3, 2024

jpbruinsslot moved this from Approved features / Need refinement to Backlog / To do in KAT Oct 28, 2024

jpbruinsslot mentioned this issue Oct 30, 2024

Investigate scheduler queue saturation #3765

Open

jpbruinsslot moved this from Backlog / To do to In Progress in KAT Oct 31, 2024

This was referenced Nov 11, 2024

Refactor schedulers #3793

Closed

Cross-organization worker polling (performance) #3826

Closed

jpbruinsslot closed this as completed Jan 6, 2025

github-project-automation bot moved this from Blocked to Done in KAT Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor in-memory schedulers to postgresql table #3358

Refactor in-memory schedulers to postgresql table #3358

jpbruinsslot commented Aug 13, 2024 •

edited

Loading

jpbruinsslot commented Jan 6, 2025

Refactor in-memory schedulers to postgresql table #3358

Refactor in-memory schedulers to postgresql table #3358

Comments

jpbruinsslot commented Aug 13, 2024 • edited Loading

Current situation

Suggested changes

New Functionality

Considerations

jpbruinsslot commented Jan 6, 2025

jpbruinsslot commented Aug 13, 2024 •

edited

Loading