feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

jbee · 2023-10-23T12:22:00Z

Summary

Adds a feature to the scheduler to allow jobs to execute faster than one per type per loop cycle.

For fast jobs like small imports that only take below 1-2 seconds it is important to run more than one per 20 seconds loop.
Otherwise jobs might be added faster or in higher quantity that the scheduling can execute as it is limited to one per 20 seconds otherwise. This would be 3 per minute, 180 per hour, 4320 per day. For imports this might not be enough and not be fast enough.

Usage

For future job types to use it their enum constant need to be added to the new https://github.com/dhis2/dhis2-core/pull/15466/files#diff-3ad347fffaf9c8a6f46a781e97b1a5d9770a5fecba98121543414e69428baa48R232 method

Implementation Notes

To allow importing faster a job type can be hard coded as being continuous. For job types of that nature the loop pushes all jobs (IDs) that are potentially ready to a per type queue. If no queue exists a new one is created for the type and a worker spawned that works the queue until all jobs in the queue are processed. The main loop keeps potentially adding to the queue in every loop cycle in the background but the worker will poll and execute the jobs as fast as possible while still ensuring that only one job at a time per type is running.

The queue will only remember the job's ID and reload the job configuration when it is time to execute them because the job configuration may have changed in the database since it was added to the queue.

A worker working the queue will always assume the queue was created by the worker so when all entries in the queue are processed the worker clears the queue before it ends so that another worker may be spawned in the future when a new job of the job type is created.

In the unlikely event that the loop cycle checks if the queue is empty to find out if a worker should be spawned and it is empty because the current worker is just processing the last entry while also a new entry was created since then a second worker is spawned for the same queue. Now two workers will poll from the same queue to empty it and both will try to remove it when done. This is not an issue as they will poll different jobs and the DB still limits execution to 1 job at a time. This means a job might try to run while the other worker's job just got started. It then fails to start and is re-added to the queue in the next loop cycle as a job that is ready to run.

The goal with this implementation is that while there is in-memory state the design is robust against getting stuck or out-of sync.
In a worst case a job might be delayed and run 1 loop cycle (20 seconds) later than it could have been running but it will eventually be picked up and run if it is ready.

Manual Testing

goto import/export app
export some metadata to a file (chose one object type only to make a small import)
import the file using a REST tool like POSTman (inspect the request made by the app and replicate it in the tool. the main point to use a tool is so we can create many imports quickly, usually in a tool this can be done by just clicking the send button multiple times)
wait some 30 seconds
check the job configurations lastExecuted time for the created jobs (jobType=METADATA_IMPORT), they should be all started quickly after another and not be about 20 seconds apart, e.g. /api/jobConfigurations/gist?filter=jobType:eq:METADATA_IMPORT

…S2-16004]

sonarcloud · 2023-10-24T12:04:15Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

0.0% Coverage
0.0% Duplication

codecov · 2023-10-24T12:10:28Z

Codecov Report

Merging #15466 (586d11b) into master (ada6f8a) will increase coverage by 0.01%.
Report is 6 commits behind head on master.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #15466      +/-   ##
============================================
+ Coverage     66.10%   66.12%   +0.01%     
- Complexity    31115    31139      +24     
============================================
  Files          3484     3486       +2     
  Lines        129448   129547      +99     
  Branches      15095    15107      +12     
============================================
+ Hits          85577    85659      +82     
- Misses        36792    36807      +15     
- Partials       7079     7081       +2

Flag	Coverage Δ
integration	`49.69% <0.00%> (-0.02%)`	⬇️
integration-h2	`32.45% <0.00%> (+<0.01%)`	⬆️
unit	`30.32% <0.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...rc/main/java/org/hisp/dhis/scheduling/JobType.java	`94.87% <0.00%> (-1.24%)`	⬇️
...his/scheduling/DefaultJobConfigurationService.java	`62.16% <0.00%> (+2.80%)`	⬆️
...troller/scheduling/JobConfigurationController.java	`30.76% <0.00%> (ø)`
...his/scheduling/DefaultJobSchedulerLoopService.java	`52.57% <0.00%> (+1.53%)`	⬆️
...in/java/org/hisp/dhis/scheduling/JobScheduler.java	`20.00% <0.00%> (-7.59%)`	⬇️

... and 21 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ada6f8a...586d11b. Read the comment docs.

david-mackessy · 2023-10-24T13:45:59Z

Nice. A test to cover this new behaviour would be good to have in the e2e suite. Checking that the allowed (continuous) job type finishes within 20 seconds?? Otherwise we'll never know (from a test) if this functionality breaks after it passes manual testing.

jbee · 2023-10-24T13:51:53Z

Checking that the allowed (continuous) job type finishes within 20 seconds??

Yes, if multiple small jobs of the METADATA_IMPORT type are created shortly after another they all should finish after one cycle of the scheduler loop, so in less than 20 seconds + the time it takes to process them all. Usually this should be just some additional ms or seconds for a small payload.

I will have a look if I can create an e2e test :)

jbee · 2023-10-25T09:49:43Z

I have not forgotton about adding a test but currently there are many other topics I work on that block further development which I need to prioritize so I merged this now as the added feature also is a precondition for that work.

feat: job scheduler - run jobs continuously [DHIS2-16004]

c5a4792

jbee self-assigned this Oct 23, 2023

fix: don't restrict list returned by service to 1 entry per type [DHI…

586d11b

…S2-16004]

jbee marked this pull request as ready for review October 24, 2023 11:58

david-mackessy approved these changes Oct 24, 2023

View reviewed changes

vietnguyen approved these changes Oct 24, 2023

View reviewed changes

jbee merged commit b565b84 into dhis2:master Oct 25, 2023
18 checks passed

jbee mentioned this pull request Oct 31, 2023

test: adds e2e test for continuous job execution [DHIS2-16004] #15538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

jbee commented Oct 23, 2023 •

edited

Loading

sonarcloud bot commented Oct 24, 2023

codecov bot commented Oct 24, 2023

david-mackessy commented Oct 24, 2023

jbee commented Oct 24, 2023

jbee commented Oct 25, 2023

feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

feat: job scheduler - run jobs continuously [DHIS2-16004] #15466

Conversation

jbee commented Oct 23, 2023 • edited Loading

Summary

Usage

Implementation Notes

Manual Testing

sonarcloud bot commented Oct 24, 2023

codecov bot commented Oct 24, 2023

Codecov Report

david-mackessy commented Oct 24, 2023

jbee commented Oct 24, 2023

jbee commented Oct 25, 2023

jbee commented Oct 23, 2023 •

edited

Loading