Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long queue monitoring and alerting #501

Open
danwetherald opened this issue Jan 29, 2025 · 8 comments
Open

Long queue monitoring and alerting #501

danwetherald opened this issue Jan 29, 2025 · 8 comments

Comments

@danwetherald
Copy link

danwetherald commented Jan 29, 2025

Hello everyone,

I ran into an incident where we had an extremely large backlog of jobs in our queue as things exponentially fell behind and simply could not catch up.

It would have been nice if I had some monitoring setup to alert me when job queues begin to become super long.

Another fun idea is using the fly.io machines api to automatically start more worker machines until the job queue has been reduced to a normal length.

https://fly.io/docs/machines/api/machines-resource/

@rosa I wanted to see if you had a best practice to go about accomplishing such monitoring.

Thanks!

@rosa
Copy link
Member

rosa commented Jan 29, 2025

Hey @danwetherald, sorry about that! Were you using a different Active Job backend before and had monitoring set up for that?

@danwetherald
Copy link
Author

Thanks for getting back to me @rosa.

No, this is a new project, and we simply added a few jobs that generate more jobs and after taking a look to see why some data syncs were not happened, we came to find out that the queue was extremely long due to this added data sync job, causing the reoccurring job queues to exponentially get longer as time went on.

Being that jobs are simply a active record model, my idea was to add some queue lengths somewhere in order to simply send a slack for example when a queue becomes too long or that the average job wait time becomes greater than a certain value.

My first thought is to just simply setup a high priority job that will run in a queue no matter the length do to its higher priority every few minutes with a recurring job configuration to check the lengths of the queues, and notify me if needed.

Of course once something like this is setup, we could play around with the machines api from fly to possibly add more workers to help with large queues.

@rosa
Copy link
Member

rosa commented Jan 29, 2025

Ahhh got it. It really depends on your setup 🤔 We use Yabeda and Prometheus and so we export some metrics like this:

module Yabeda
  module SolidQueue
    def self.install!
      Yabeda.configure do
        group :solid_queue

        gauge :jobs_failed_count, comment: "Number of failed jobs"
        gauge :jobs_unreleased_count, comment: "Number of claimed jobs that don't belong to any process"

        collect do
          if ::SolidQueue.supervisor?
            solid_queue.jobs_failed_count.set({}, ::SolidQueue::FailedExecution.count)
            solid_queue.jobs_unreleased_count.set({}, ::SolidQueue::ClaimedExecution.where(process: nil).count)
             end)
          end
        end
      end
    end
  end
end

One to check queue depth could look like this:

solid_queue.high_priority_queue_depth.set({}, ::SolidQueue::ReadyExecution.queued_as("production_high_priority").count)

@nlsrchtr
Copy link

Hi @rosa,

that's great! I also wanted to setup monitoring and reached out to yabeda-activejob, which looked like a good candidate to add. But it works out, that in my current setup, it reports the Sidekiq jobs only - since I have Sidekiq and SolidQueue working in parallel. It reads to me, that that gem is trying to hook into generic notifications and generate metrics based on that.

Would it be possible to adjust SolidQueue in a way, that it also exposes those metrics via the ActiveJob interface? So we could use that gem and there is no need to have a SolidQueue specific reporting and dashboard etc. within Prometheus and Grafana?

@rosa
Copy link
Member

rosa commented Jan 31, 2025

it reports the Sidekiq jobs only - since I have Sidekiq and SolidQueue working in parallel. It reads to me, that that gem is trying to hook into generic notifications and generate metrics based on that.

Huh, this is very strange. Since it's subscribing to Active Job instrumentation events, it should work for both Sidekiq and Solid Queue at the same time. There must be something going on with your setup that prevents this from working, I think 😕

@nlsrchtr
Copy link

There must be something going on with your setup that prevents this from working, I think 😕

You are right, I wasn't exposing the metrics correctly. But I'm still struggeling to get the full metrics. It would be great if you could guide me, so I would also extend the available documentation accordingly.

My setup is seperated into containers running the webserver and containers running the solid_queue jobs. With the following configuration, I see the metric activejob_enqueued_total being reported from the webserver containers, but the metrics from the solid_queue containers are empty:

# config/initializers/solid_queue.rb
Yabeda::ActiveJob.install!

SolidQueue.on_start do
  Yabeda::Prometheus::Exporter.start_metrics_server!
end

This would start the metrics server within the supervisor and exposes the metrics. From the perspective of SolidQueue, that should be sufficient, right? The metrics server is also responding, but there is nothing available in the response.

Since the supervisor coordinates all the processes, it's the longest running process and should be able to collect and expose the metrics, right?

What am I missing? Should I move this question to yabeda-activejob?

@rosa
Copy link
Member

rosa commented Feb 3, 2025

This would start the metrics server within the supervisor and exposes the metrics. From the perspective of SolidQueue, that should be sufficient, right?

Yes, that should be it 🤔 When you say

I see the metric activejob_enqueued_total being reported from the webserver containers, but the metrics from the solid_queue containers are empty:

When you say this, do you mean there are no metrics at all from the solid_queue containers or that activejob_enqueued_total is empty? Do you have examples of responses from your /metrics endpoints?

What am I missing? Should I move this question to yabeda-activejob?

I'm not sure what's missing, TBH. Maybe moving the question there is the next thing to try 👍

@nlsrchtr
Copy link

nlsrchtr commented Feb 5, 2025

Hi @rosa,

thanks for getting back. I'm moving this discussion to Fullscript/yabeda-activejob#18. Maybe they see what I'm missing here.

Thanks a lot for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants