-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long queue monitoring and alerting #501
Comments
Hey @danwetherald, sorry about that! Were you using a different Active Job backend before and had monitoring set up for that? |
Thanks for getting back to me @rosa. No, this is a new project, and we simply added a few jobs that generate more jobs and after taking a look to see why some data syncs were not happened, we came to find out that the queue was extremely long due to this added data sync job, causing the reoccurring job queues to exponentially get longer as time went on. Being that jobs are simply a active record model, my idea was to add some queue lengths somewhere in order to simply send a slack for example when a queue becomes too long or that the average job wait time becomes greater than a certain value. My first thought is to just simply setup a high priority job that will run in a queue no matter the length do to its higher priority every few minutes with a recurring job configuration to check the lengths of the queues, and notify me if needed. Of course once something like this is setup, we could play around with the machines api from fly to possibly add more workers to help with large queues. |
Ahhh got it. It really depends on your setup 🤔 We use Yabeda and Prometheus and so we export some metrics like this: module Yabeda
module SolidQueue
def self.install!
Yabeda.configure do
group :solid_queue
gauge :jobs_failed_count, comment: "Number of failed jobs"
gauge :jobs_unreleased_count, comment: "Number of claimed jobs that don't belong to any process"
collect do
if ::SolidQueue.supervisor?
solid_queue.jobs_failed_count.set({}, ::SolidQueue::FailedExecution.count)
solid_queue.jobs_unreleased_count.set({}, ::SolidQueue::ClaimedExecution.where(process: nil).count)
end)
end
end
end
end
end
end One to check queue depth could look like this: solid_queue.high_priority_queue_depth.set({}, ::SolidQueue::ReadyExecution.queued_as("production_high_priority").count) |
Hi @rosa, that's great! I also wanted to setup monitoring and reached out to yabeda-activejob, which looked like a good candidate to add. But it works out, that in my current setup, it reports the Sidekiq jobs only - since I have Sidekiq and SolidQueue working in parallel. It reads to me, that that gem is trying to hook into generic notifications and generate metrics based on that. Would it be possible to adjust SolidQueue in a way, that it also exposes those metrics via the ActiveJob interface? So we could use that gem and there is no need to have a SolidQueue specific reporting and dashboard etc. within Prometheus and Grafana? |
Huh, this is very strange. Since it's subscribing to Active Job instrumentation events, it should work for both Sidekiq and Solid Queue at the same time. There must be something going on with your setup that prevents this from working, I think 😕 |
You are right, I wasn't exposing the metrics correctly. But I'm still struggeling to get the full metrics. It would be great if you could guide me, so I would also extend the available documentation accordingly. My setup is seperated into containers running the webserver and containers running the solid_queue jobs. With the following configuration, I see the metric # config/initializers/solid_queue.rb
Yabeda::ActiveJob.install!
SolidQueue.on_start do
Yabeda::Prometheus::Exporter.start_metrics_server!
end This would start the metrics server within the supervisor and exposes the metrics. From the perspective of SolidQueue, that should be sufficient, right? The metrics server is also responding, but there is nothing available in the response. Since the supervisor coordinates all the processes, it's the longest running process and should be able to collect and expose the metrics, right? What am I missing? Should I move this question to yabeda-activejob? |
Yes, that should be it 🤔 When you say
When you say this, do you mean there are no metrics at all from the
I'm not sure what's missing, TBH. Maybe moving the question there is the next thing to try 👍 |
Hi @rosa, thanks for getting back. I'm moving this discussion to Fullscript/yabeda-activejob#18. Maybe they see what I'm missing here. Thanks a lot for your support! |
Hello everyone,
I ran into an incident where we had an extremely large backlog of jobs in our queue as things exponentially fell behind and simply could not catch up.
It would have been nice if I had some monitoring setup to alert me when job queues begin to become super long.
Another fun idea is using the fly.io machines api to automatically start more worker machines until the job queue has been reduced to a normal length.
https://fly.io/docs/machines/api/machines-resource/
@rosa I wanted to see if you had a best practice to go about accomplishing such monitoring.
Thanks!
The text was updated successfully, but these errors were encountered: