Long queue monitoring and alerting #501

danwetherald · 2025-01-29T07:23:52Z

Hello everyone,

I ran into an incident where we had an extremely large backlog of jobs in our queue as things exponentially fell behind and simply could not catch up.

It would have been nice if I had some monitoring setup to alert me when job queues begin to become super long.

Another fun idea is using the fly.io machines api to automatically start more worker machines until the job queue has been reduced to a normal length.

https://fly.io/docs/machines/api/machines-resource/

@rosa I wanted to see if you had a best practice to go about accomplishing such monitoring.

Thanks!

rosa · 2025-01-29T11:05:45Z

Hey @danwetherald, sorry about that! Were you using a different Active Job backend before and had monitoring set up for that?

danwetherald · 2025-01-29T15:54:43Z

Thanks for getting back to me @rosa.

No, this is a new project, and we simply added a few jobs that generate more jobs and after taking a look to see why some data syncs were not happened, we came to find out that the queue was extremely long due to this added data sync job, causing the reoccurring job queues to exponentially get longer as time went on.

Being that jobs are simply a active record model, my idea was to add some queue lengths somewhere in order to simply send a slack for example when a queue becomes too long or that the average job wait time becomes greater than a certain value.

My first thought is to just simply setup a high priority job that will run in a queue no matter the length do to its higher priority every few minutes with a recurring job configuration to check the lengths of the queues, and notify me if needed.

Of course once something like this is setup, we could play around with the machines api from fly to possibly add more workers to help with large queues.

rosa · 2025-01-29T16:49:20Z

Ahhh got it. It really depends on your setup 🤔 We use Yabeda and Prometheus and so we export some metrics like this:

module Yabeda
  module SolidQueue
    def self.install!
      Yabeda.configure do
        group :solid_queue

        gauge :jobs_failed_count, comment: "Number of failed jobs"
        gauge :jobs_unreleased_count, comment: "Number of claimed jobs that don't belong to any process"

        collect do
          if ::SolidQueue.supervisor?
            solid_queue.jobs_failed_count.set({}, ::SolidQueue::FailedExecution.count)
            solid_queue.jobs_unreleased_count.set({}, ::SolidQueue::ClaimedExecution.where(process: nil).count)
             end)
          end
        end
      end
    end
  end
end

One to check queue depth could look like this:

solid_queue.high_priority_queue_depth.set({}, ::SolidQueue::ReadyExecution.queued_as("production_high_priority").count)

nlsrchtr · 2025-01-31T08:57:57Z

Hi @rosa,

that's great! I also wanted to setup monitoring and reached out to yabeda-activejob, which looked like a good candidate to add. But it works out, that in my current setup, it reports the Sidekiq jobs only - since I have Sidekiq and SolidQueue working in parallel. It reads to me, that that gem is trying to hook into generic notifications and generate metrics based on that.

Would it be possible to adjust SolidQueue in a way, that it also exposes those metrics via the ActiveJob interface? So we could use that gem and there is no need to have a SolidQueue specific reporting and dashboard etc. within Prometheus and Grafana?

rosa · 2025-01-31T09:09:53Z

it reports the Sidekiq jobs only - since I have Sidekiq and SolidQueue working in parallel. It reads to me, that that gem is trying to hook into generic notifications and generate metrics based on that.

Huh, this is very strange. Since it's subscribing to Active Job instrumentation events, it should work for both Sidekiq and Solid Queue at the same time. There must be something going on with your setup that prevents this from working, I think 😕

nlsrchtr · 2025-01-31T14:27:45Z

There must be something going on with your setup that prevents this from working, I think 😕

You are right, I wasn't exposing the metrics correctly. But I'm still struggeling to get the full metrics. It would be great if you could guide me, so I would also extend the available documentation accordingly.

My setup is seperated into containers running the webserver and containers running the solid_queue jobs. With the following configuration, I see the metric activejob_enqueued_total being reported from the webserver containers, but the metrics from the solid_queue containers are empty:

# config/initializers/solid_queue.rb
Yabeda::ActiveJob.install!

SolidQueue.on_start do
  Yabeda::Prometheus::Exporter.start_metrics_server!
end

This would start the metrics server within the supervisor and exposes the metrics. From the perspective of SolidQueue, that should be sufficient, right? The metrics server is also responding, but there is nothing available in the response.

Since the supervisor coordinates all the processes, it's the longest running process and should be able to collect and expose the metrics, right?

What am I missing? Should I move this question to yabeda-activejob?

rosa · 2025-02-03T10:22:28Z

This would start the metrics server within the supervisor and exposes the metrics. From the perspective of SolidQueue, that should be sufficient, right?

Yes, that should be it 🤔 When you say

I see the metric activejob_enqueued_total being reported from the webserver containers, but the metrics from the solid_queue containers are empty:

When you say this, do you mean there are no metrics at all from the solid_queue containers or that activejob_enqueued_total is empty? Do you have examples of responses from your /metrics endpoints?

What am I missing? Should I move this question to yabeda-activejob?

I'm not sure what's missing, TBH. Maybe moving the question there is the next thing to try 👍

nlsrchtr · 2025-02-05T16:47:04Z

Hi @rosa,

thanks for getting back. I'm moving this discussion to Fullscript/yabeda-activejob#18. Maybe they see what I'm missing here.

Thanks a lot for your support!

Envek · 2025-02-21T12:58:41Z

@nlsrchtr, I believe that your problem is caused by SolidQueue multi-process architecture: metrics are being collected in separate process from one that exposes them to Prometheus. Metrics are stored in memory by default and should be explicitly shared between processes.

I commented with more details at Fullscript/yabeda-activejob#18 (comment)

Envek · 2025-02-21T13:11:55Z

@rosa I believe that queue latency should be used for long queue monitoring instead of queue sizes:

If you throw a lot of jobs into the queue, you can get false positives when monitoring the queue backlog. Instead, monitor the queue latency. Queue latency is the difference between when the oldest job was pushed onto the queue versus the current time.
https://github.com/sidekiq/sidekiq/wiki/Monitoring#monitoring-queue-latency

What do you think?

I've set up latency metric collection like this (not sure whether I did it right though):

SolidQueue.on_start do
  Yabeda.configure do
    known_queues = Set.new # to reset empty queues latency to 0

    group :activejob

    gauge :queue_latency, unit: :seconds, tags: %i[queue], aggregation: :most_recent,
                          comment: "The queue latency, the difference in seconds between oldest job's enqueued time and now"

    collect do
      latencies = SolidQueue::ReadyExecution.group(:queue_name).maximum("now() - created_at")
      known_queues += latencies.keys

      # Set empty queues latency to 0 to avoid stale values
      known_queues.each do |queue|
        activejob.queue_latency.set({ queue: queue }, latencies.fetch(queue, 0).to_i)
      end
    end
  end
end

However this grouping query seems to be quite heavy one, maybe there is a better way to get queue latencies?

Also, SolidQueue data model is quite complicated and I'm not sure whether I should get data from other tables. Any advice or a link to the docs would be much appreciated.

Thank you for solid_queue!

rosa · 2025-03-31T12:22:58Z

Hey @Envek so sorry for not replying before! I somehow missed this notification completely 😳

What you have is correct, and I agree about measuring latency. The grouping itself should be very fast in MySQL thanks to this index:

t.index ["queue_name", "priority", "job_id"], name: "index_solid_queue_poll_by_queue"

but I think what would make the query slow is the maximum("now() - created_at"). It might be fast enough depending on how many jobs you have pending at a given time.

Another thing you could try, instead of doing a single query, is to do multiple queries, one per queue, and add an index on queue_name, created_at. You'd need to do a query first to find all queue names, like SolidQueue::ReadyExecution.distinct(:queue_name).pluck(:queue_name).

Envek · 2025-03-31T16:40:01Z

but I think what would make the query slow is the maximum("now() - created_at")

Maybe just minimum(:created_at) will be faster (and then make calculations with Time.current in Ruby). But yeah, index over queue_name, created_at will be quite beneficial.

Thanks!

nlsrchtr mentioned this issue Feb 5, 2025

No metrics exposed for SolidQueue with Rails 7.2 Fullscript/yabeda-activejob#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long queue monitoring and alerting #501

Long queue monitoring and alerting #501

danwetherald commented Jan 29, 2025 •

edited

Loading

rosa commented Jan 29, 2025

danwetherald commented Jan 29, 2025

rosa commented Jan 29, 2025

nlsrchtr commented Jan 31, 2025

rosa commented Jan 31, 2025

nlsrchtr commented Jan 31, 2025

rosa commented Feb 3, 2025

nlsrchtr commented Feb 5, 2025

Envek commented Feb 21, 2025

Envek commented Feb 21, 2025

rosa commented Mar 31, 2025

Envek commented Mar 31, 2025 •

edited

Loading

Long queue monitoring and alerting #501

Long queue monitoring and alerting #501

Comments

danwetherald commented Jan 29, 2025 • edited Loading

rosa commented Jan 29, 2025

danwetherald commented Jan 29, 2025

rosa commented Jan 29, 2025

nlsrchtr commented Jan 31, 2025

rosa commented Jan 31, 2025

nlsrchtr commented Jan 31, 2025

rosa commented Feb 3, 2025

nlsrchtr commented Feb 5, 2025

Envek commented Feb 21, 2025

Envek commented Feb 21, 2025

rosa commented Mar 31, 2025

Envek commented Mar 31, 2025 • edited Loading

danwetherald commented Jan 29, 2025 •

edited

Loading

Envek commented Mar 31, 2025 •

edited

Loading