RabbitMQ partition allocation is not uniformly distributed #1744

cailyoung · 2023-08-02T06:47:38Z

We've noticed in our installation with 32 replicas of this service (320 queues in RabbitMQ) that we routinely have single reporting.x queues with very high message counts compared to the others. Think 100k messages versus zero, or a few hundred.

I believe this is due to a combination of choices in this service:

Deterministically allocate routing key based on hashing the launch UUID and modulo against queue count - this approach will lead to non-uniform distributions across the output space
- I'm not sure whether this is a deliberate choice to ensure that a given launch's messages all end up in the same consumer's queue.
Each api replica holds an exclusive connection on its queues, meaning when there's unbalanced queues only a single consumer can work through the backlog

I think this could be resolved by either:

Altering the queue/exchange setup so that routing keys aren't used at all, and instead all api replicas compete on the same consuming queue for new messages. This might be complex to migrate into from the current state!
- This would also make autoscaling the api replica size safer, as the current implementation requires each replica to be aware of the entire queue count at startup and will create more queues based on that number - so scaling down leaves orphaned queues.
Altering the routing key allocation so that messages are genuinely randomly assigned to partitions when the producer runs. This might not be wise if there's a reason all messages from the same launch are put in the same partition.

The text was updated successfully, but these errors were encountered:

DzmitryHumianiuk · 2023-08-07T23:54:24Z

related:
#1744
reportportal/service-jobs#86
#1745

DzmitryHumianiuk · 2023-08-08T01:41:31Z

@cailyoung this is a very correct findings.
here some documentation, which may explain how it works:
https://reportportal.io/docs/dev-guides/AsynchronousReporting

and this is another very correct statement:

"This might be complex to migrate into from the current state!"

which explains why we didn't make changes yet.

First - we understand and acknowledge that the current implementation is not ideal and violates some scaling principles, but there were certain problems that we needed to address.

Problem#1: We had to resolve the issue of handling a situation where one project (or one launch) creates a massive load on the API, such as reporting 15,000 test cases with 1,000 logs each in 50 threads. In this scenario, any other project (launch) running concurrently might end up getting processed at a much slower pace.
Hence, we developed a mechanism to accommodate the processing of those launches that came in parallel with high-load launches. Which you are reffering as "Deterministically allocate routing key based on hashing the launch UUID and modulo against queue count". Which is exactly what you described.
This was a challenge for our production cluster and any other large setup, and we were looking for a quick fix.
Eventually, growth and scaling issues were easier to address by adding more hardware.

This now manifests in that autoscaling is not feasible since we need to restart RabbitMQ anyway to configure it for the new potential queue count. That means that with the addition of new API nodes, Rabbit must receive new configurations and be restarted.
This can be partly automated, but it's not as elegant as it should be.

Problem#2: In the early stages of reporting development, we were quite conservative with permissions in the reporting sequence (events coming into processing). Such as not being able to create a child before creating a parent or completing a parent before completing a child. This stemmed from the fact that we couldn't write logs to the database, for example, until we knew the test case ID.
This was much simpler from an implementation standpoint, and the conservative approach allowed us to better understand overall how reporting on test runners should occur. Gradually, we loosened these conditions and made it possible to write some objects earlier. Still, this was primarily through the addition of proxy buffering on the client side (within test runners) and proxy buffering on the queue side.

Namely, this involved adding the very re-try queues whose task was to dump events that caused an error on the API side into them. Often, these were precisely the events that tried to write information about children without having information or a created parent in the database.

Therefore, the re-try queue was where the events for children fell into, and the API would prioritize returning to this queue every 10 messages processed from the main queues. This way, we do not lose the recording of objects, which might be due to reasons such as network delivery delay, where even a 1ms difference matters.

Furthermore, these re-try queues later allowed us to address various specifics of JS frameworks, which often send the completion of parents before they finish all descendants or sometimes even before their start.

Considering all these specifics, we are planning significant changes for the version code-named RPv6. It will encompass a great deal, but primarily changes regarding queues and how data is stored.

Flat structure.
We will rid ourselves of the nesting of suite-classes-test cases, transforming it all into a record at one level, where nesting will be determined only by the attributes of the object. This will enable us to write any incoming objects to the database without waiting for their parents, as there are no parents anymore. This completely removes the need for re-try queues and simplifies the logic of queue handling.
Also, we will record objects and bind them based on UUIDs generated on the client-side (test runners), doubly freeing us from the need to await the appearance of any objects in the database.
Moreover, we will not have to worry about ensuring that the events of one launch are approximately ordered in one of the queues, to avoid having to reroute events to the re-try queue, spending time on a database query and transferring the event to another queue for each event.

Thus, this opens the possibility for one large queue, from which any number of API nodes can pull events for processing and store them in the database, free from the constraints of maintaining structure.

We still should have events across different queues so that incoming events from less busy launches still have the opportunity to be processed before the completion of processing heavyweights from other projects. But now, it is no longer necessary to retain a particular launch's affiliation to specific queues. It is permissible to allow the API to pull messages for processing from any available queues. This will enable the use of multi-node APIs and multiple Rabbit nodes.

cailyoung · 2023-08-08T05:07:57Z

Thankyou for the considered response and I look forward to seeing these improvements in this space!

This was referenced Aug 7, 2023

RabbitMQ queues are not declared as replicated #1745

Open

RabbitMQ unavailability halts processing of log_message_saving without triggering liveness probe failure reportportal/service-jobs#86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ partition allocation is not uniformly distributed #1744

RabbitMQ partition allocation is not uniformly distributed #1744

cailyoung commented Aug 2, 2023 •

edited

Loading

DzmitryHumianiuk commented Aug 7, 2023

DzmitryHumianiuk commented Aug 8, 2023

cailyoung commented Aug 8, 2023

RabbitMQ partition allocation is not uniformly distributed #1744

RabbitMQ partition allocation is not uniformly distributed #1744

Comments

cailyoung commented Aug 2, 2023 • edited Loading

DzmitryHumianiuk commented Aug 7, 2023

DzmitryHumianiuk commented Aug 8, 2023

cailyoung commented Aug 8, 2023

cailyoung commented Aug 2, 2023 •

edited

Loading