zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

zice312963205 · 2023-12-25T09:34:55Z

This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx. This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems.

zice312963205 · 2023-12-25T09:39:41Z

related issue：#15016

zebra/zebra_rib.c

…ile the thread t_dplane is waiting to be scheduled. This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx. This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems. Signed-off-by: hanyu.zly <[email protected]>

mjstapp

it's good to be thinking about how to do rate-matching or backpressure within the system, but I don't think this is really the solution. I think this needs more discussion.

mjstapp · 2024-01-02T15:14:46Z

zebra/zebra_rib.c

@@ -4974,6 +4979,7 @@ static void rib_process_dplane_results(struct event *thread)
 		}

 	} while (1);
+	t_dplane = NULL;


no, we can't just poke at these event pointers like this

to clarify. Just setting the pointer to NULL doesn't actually stop the event from running. FRR has just lost all control to it at all. This is especially bad because if the event in question has pointers stored in it and then those pointers are freed we'll have use after free's because the rest of FRR assumes that the event pointer as NULL the event is not running. We've had too many crashes to allow this to get into the system again.

donaldsharp · 2024-01-02T15:17:40Z

Yeah this is a decent idea, but the execution is all wrong. Not doing any work while the dplane has initiated a call back with the results is not the way I want to proceed with this.

The problem is that the dplane_fpm_nl module is holding onto the contexts when the underlying process that is being communicated too is not responding in a fast enough fashion. The underlying dplane should signal that it is backed up via a new dplane context command call and signal when it is no longer backed up as well. Then the master pthread can then decide to initiate more work or not. Additionally the dplane can decide at what context count should trigger the please hold off request and at what context count it should tell the upper level to start working again.

donaldsharp

I've outlined a different approach that would be acceptable. In the current state this approach is dead in the water. There might also need to be work to have the workqueue be smarter about when to come back to do more work here as well instead of busy looping

mjstapp · 2024-01-02T15:35:01Z

Is it correct that the real source of this specific problem is fpm - that the other end of the fpm session is just ... slow to read? if that's true, then that really needs to get fixed first (imo).

zice312963205 · 2024-01-03T03:07:51Z

Yeah this is a decent idea, but the execution is all wrong. Not doing any work while the dplane has initiated a call back with the results is not the way I want to proceed with this.

The problem is that the dplane_fpm_nl module is holding onto the contexts when the underlying process that is being communicated too is not responding in a fast enough fashion. The underlying dplane should signal that it is backed up via a new dplane context command call and signal when it is no longer backed up as well. Then the master pthread can then decide to initiate more work or not. Additionally the dplane can decide at what context count should trigger the please hold off request and at what context count it should tell the upper level to start working again.

I understand your perspective, but I think there might be some differences in our understanding of this section of the code.
One of the core reasons I think it could be modified is that the thread for meta_queue_process and the thread pointed to by t_dplane are both in the zebra main thread.
This means that there is no concurrent scheduling between these two threads. To put it more absolutely, these two threads are essentially taking turns in scheduling.
In rib_process_dplane_results, there is no mechanism to pause or terminate the thread by controlling the number of executions, meaning that every time it enters the thread, it will process all the result contexts before exiting.
Therefore, I believe waiting for rib_process_dplane_results to complete in meta_queue_process will not introduce blocking. Currently, we can add a mechanism to actively call work_queue_schedule to start meta_queue_process after rib_process_dplane_results has finished, instead of waiting for the default retry time. The reason I haven't made such a change is that I found that in its original blocking rules, that is, when determining queue_len > queue_limit, it did not use a work_queue_schedule method to actively start meta_queue_process.
This is something to consider regarding the necessity of work_queue_schedule.
Additionally, regarding the processing time issue with dplane_fpm_nl, I've also tracked it, and it turns out it's not due to slow communication response from the underlying process.
The difference between dplane_kernel and dplane_fpm is that dplane_kernel, within its dp_fp (kernel_update_multi), encapsulates all netlink messages directly in a synchronous manner and writes them into the kernel. On the other hand, dplane_fpm's dp_fp (fpm_nl_process) transfers the contexts to fnc->ctxqueue and then initiates an asynchronous event (fpm_process_queue), waiting for this event to be scheduled before writing data to the fpm. This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue.
In my practical tests, fpmsyncd reads data from the socket very quickly and does not cause any blocking in the execution of the fpm_nl_enqueue function.

zice312963205 · 2024-01-03T03:09:24Z

Is it correct that the real source of this specific problem is fpm - that the other end of the fpm session is just ... slow to read? if that's true, then that really needs to get fixed first (imo).

Actually, the reason isn't that the FPM reads data slowly. Please refer to my previous reply, thank you.

mjstapp · 2024-01-03T13:53:45Z

so there are (at least) a couple of things going on here. there's the sort of practical, low-level issue of how to communicate between pthreads - whether it is viable, workable to use an event* to try to do that. but the real issue is what to do when there's a rate mismatch within zebra - what components can detect a mismatch, what components should react to it, and what that reaction should be.

it's not viable to play with the event* outside the library lock, full-stop. that value is set by the dplane pthread, under a lock, and is cleared/NULLed by the event library in the context of the main pthread, under a lock. every time you look at that value you introduce a race, and any time you touch it you introduce a race and you risk leaking memory (a small amount, it's true).

About the fpm plugin - it does look inefficient to me, just on casual inspection, but I didn't really follow this statement:

This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue.

the actual netlink code is making a system call to send messages to the kernel, then it's waiting for a reply with status for the requests that were in the netlink batch. that's not a cheap operation. the fpm code only has to send on a socket, and if that's slower than the kernel operation, that's something we should be able to fix. the fpm code takes a lot of locks, instead of dequeuing batches. the fpm code doesn't ... just get on with its work, taking a batch of contexts and sending them out - it schedules an event to itself to look at a queue, and then it puts bytes into a stream buffer, and then it schedules itself to actually write the stream buffer to the socket. it looks like that could be cleaned up, and that might improve the throughput in that plugin.

the higher-level question seems to be: if one dplane plugin is running behind, running slowly, what should happen? right now, the dplane only really checks its own top-level queue, the initial queue that receives contexts from the main pthread. it could check the providers' queues, and try using those as a proxy for the state of the overall system. but that won't be able to detect queue buildup inside a provider's own data structures. or the providers could have a way to signal to the dplane framework that they are building queue - and the dplane framework could use that signal to slow down the rate at which it accepts new incoming work. you could make the case that the overall system shouldn't overrun the slowest component; but neither of those things exists right now though.

where do you think the bottleneck is? do you think it's in the less-than-perfect fpm plugin? or do you think it's in the zebra main pthread, trying to work through a big impulse of changes?

donaldsharp · 2024-01-03T14:00:01Z

I understand your perspective, but I think there might be some differences in our understanding of this section of the code. One of the core reasons I think it could be modified is that the thread for meta_queue_process and the thread pointed to by t_dplane are both in the zebra main thread. This means that there is no concurrent scheduling between these two threads. To put it more absolutely, these two threads are essentially taking turns in scheduling. In rib_process_dplane_results, there is no mechanism to pause or terminate the thread by controlling the number of executions, meaning that every time it enters the thread, it will process all the result contexts before exiting. Therefore, I believe waiting for rib_process_dplane_results to complete in meta_queue_process will not introduce blocking.

This makes no sense to me. The very fact that this solution was raised as fixing the problem clearly implies that blocking the processing of new routes while the master zebra pthread has contexts to handle slows the master pthread down enough to allow the dplane pthread to clear it's queues. This also implicitly implies that the contexts are sitting in the rib_dplane_q. Which is a bit odd. What queue exactly are the contexts sitting in?

Currently, we can add a mechanism to actively call work_queue_schedule to start meta_queue_process after rib_process_dplane_results has finished, instead of waiting for the default retry time. The reason I haven't made such a change is that I found that in its original blocking rules, that is, when determining queue_len > queue_limit, it did not use a work_queue_schedule method to actively start meta_queue_process. This is something to consider regarding the necessity of work_queue_schedule.

This sure sounds like it's needed as a seperate thing to fix in it's own right.

Additionally, regarding the processing time issue with dplane_fpm_nl, I've also tracked it, and it turns out it's not due to slow communication response from the underlying process. The difference between dplane_kernel and dplane_fpm is that dplane_kernel, within its dp_fp (kernel_update_multi), encapsulates all netlink messages directly in a synchronous manner and writes them into the kernel. On the other hand, dplane_fpm's dp_fp (fpm_nl_process) transfers the contexts to fnc->ctxqueue and then initiates an asynchronous event (fpm_process_queue), waiting for this event to be scheduled before writing data to the fpm.

This also sounds like a clearer path of fixing that needs to be done and frankly sounds like a higher value problem that should be addressed first.

This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue. In my practical tests, fpmsyncd reads data from the socket very quickly and does not cause any blocking in the execution of the fpm_nl_enqueue function.

This doesn't make a whole bunch of sense to me. In my local testing I cannot make the context queue grow when using dplane_fpm_nl. I am also not using fpmsyncd as a listener. I do know that I can make it grow arbitrarily when I intentionally introduce a delay in responding. Perhaps we don't fully understand where the problem is at this point in time.

github-actions · 2024-07-02T01:48:34Z

This PR is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this PR closed.

frrbot bot added the zebra label Dec 25, 2023

github-actions bot added master size/XS labels Dec 25, 2023

zice312963205 force-pushed the zebra_dplane branch from 3f82ff6 to 3b8c4f1 Compare December 25, 2023 09:37

ton31337 reviewed Dec 26, 2023

View reviewed changes

zebra/zebra_rib.c Show resolved Hide resolved

zice312963205 force-pushed the zebra_dplane branch from 3b8c4f1 to a4680eb Compare December 28, 2023 02:24

github-actions bot added size/S and removed size/XS labels Dec 28, 2023

zice312963205 requested a review from ton31337 December 28, 2023 02:28

zice312963205 force-pushed the zebra_dplane branch from a4680eb to f214941 Compare December 28, 2023 02:52

zice312963205 force-pushed the zebra_dplane branch from f214941 to 64aa037 Compare December 28, 2023 03:10

mjstapp self-requested a review January 2, 2024 15:11

mjstapp reviewed Jan 2, 2024

View reviewed changes

donaldsharp requested changes Jan 2, 2024

View reviewed changes

github-actions bot added the autoclose label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

zice312963205 commented Dec 25, 2023

zice312963205 commented Dec 25, 2023

mjstapp left a comment

mjstapp Jan 2, 2024

donaldsharp Jan 2, 2024

donaldsharp commented Jan 2, 2024

donaldsharp left a comment

mjstapp commented Jan 2, 2024

zice312963205 commented Jan 3, 2024

zice312963205 commented Jan 3, 2024

mjstapp commented Jan 3, 2024

donaldsharp commented Jan 3, 2024

github-actions bot commented Jul 2, 2024

zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

Are you sure you want to change the base?

zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

Conversation

zice312963205 commented Dec 25, 2023

zice312963205 commented Dec 25, 2023

mjstapp left a comment

Choose a reason for hiding this comment

mjstapp Jan 2, 2024

Choose a reason for hiding this comment

donaldsharp Jan 2, 2024

Choose a reason for hiding this comment

donaldsharp commented Jan 2, 2024

donaldsharp left a comment

Choose a reason for hiding this comment

mjstapp commented Jan 2, 2024

zice312963205 commented Jan 3, 2024

zice312963205 commented Jan 3, 2024

mjstapp commented Jan 3, 2024

donaldsharp commented Jan 3, 2024

github-actions bot commented Jul 2, 2024