-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: job cluster synchronization to avoid running the same job twice #16
Comments
Hi @newnewcoder, Sorry for the late answer, can you detail your use case? |
In my case, I need to run the scheduler application separately on two nodes under loadbalancer, but with the same job/scheduling state, any suggestions? |
I have a doubt, do you want to execute the same jobs on both nodes, or do you want to load balance your jobs to your node instances (so if you have 2 jobs, you want to execute one job on node 1 and the other on node 2)? |
Neither the case. In fact, I use Wisp in a web project, which is under load balancer. And I don't want the same job runs twice. |
Ok I see! The "easy" solution is to have a dedicated node that run jobs. But that is actually a good idea to provide in the future a way to provide job cluster synchronization. We might provide another Wisp library for this specific purpose! |
One external service must be used to provide the cluster synchronization feature. For a first version, a database service will be used for that. Maybe in the future other implementations will be provided to connect to other services. |
I've implemented this already in my own code, but thought I'd share my strategy as it may help you decide on an implementation. In my case any one of my nodes could go un-healthy, or be removed from the cluster because of scaling down, so the node that runs the task can change between each invocation. Here's a use-case... Every cron period (e.g. 2am every morning) the cron would wake up the Scheduler on every node (x 10). Within my implementation (which needs refining to be perfect) each node will see how long ago it was since a node elected to perform the work and if it was more than 10s ago then they write to the database to say that they're alive and ready to run the task. Once this election process has started the task sleeps for 5s (very generous, I use MongoDB and wasn't 100% sure how long it would take after writing to ensure it's written in the database and propagated to the other nodes. My MongoStorage.setOrUpdateConfigByNameAndSyncWithSqs method is what's used to persist this value which is cached in memory AND send also messages to the other nodes using AWS SQS (Simple Queue Service). This will be received by the other nodes to invalidate their cache, forcing them to read the database rather than the cache when next queries (in <5s time). After this generous sleep, if the hostname of the machine that's running the task is still the same as the hostname persisted in the database then only that host will run the task, the others will exit like a no-op. The code for my implementation is below, feel free to use, ignore or pilfer any good bits.
I do plan to improve the algorithm slightly...
|
@devology-rob Wow thank you for sharing this! On my side, I thought about making each node try to run an update query, and then, the only node who will succeed the update query would then execute the job. The update SQL query is something like I keep your implementation in mind though, it might be useful for more complex cases! |
This feature actually requires some work, especially because querying the database may bring dependencies. Ideally this would be implemented using only a I do not have a lot of time to work on this lately, but if someone wants to start implementing something from the v3 branch that introduces more modularization, I would be happy to review it (if possible, it would be great to make intermediate PRs, to validate the work little by little). |
Is it possible to add a Scheduler Interface for users to implement it?
The text was updated successfully, but these errors were encountered: