Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zebra: Temporarily block the execution of the rib_process function while the thread t_dplane is waiting to be scheduled. #15065

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zice312963205
Copy link
Contributor

This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx. This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems.

@zice312963205
Copy link
Contributor Author

related issue:#15016

…ile the thread t_dplane is waiting to be scheduled.

This allows t_dplane to prioritize freeing up the cache structures of zebra_dplane_ctx.
This addresses the issue where a large number of zebra_dplane_ctx nodes are cached on the rib_dplane_q when a surge of routes is inserted in a short period of time, leading to the consumption of a significant amount of temporary memory and potentially causing system memory overload and kernel OOM (Out of Memory) problems.

Signed-off-by: hanyu.zly <[email protected]>
Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's good to be thinking about how to do rate-matching or backpressure within the system, but I don't think this is really the solution. I think this needs more discussion.

@@ -4974,6 +4979,7 @@ static void rib_process_dplane_results(struct event *thread)
}

} while (1);
t_dplane = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we can't just poke at these event pointers like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify. Just setting the pointer to NULL doesn't actually stop the event from running. FRR has just lost all control to it at all. This is especially bad because if the event in question has pointers stored in it and then those pointers are freed we'll have use after free's because the rest of FRR assumes that the event pointer as NULL the event is not running. We've had too many crashes to allow this to get into the system again.

@donaldsharp
Copy link
Member

Yeah this is a decent idea, but the execution is all wrong. Not doing any work while the dplane has initiated a call back with the results is not the way I want to proceed with this.

The problem is that the dplane_fpm_nl module is holding onto the contexts when the underlying process that is being communicated too is not responding in a fast enough fashion. The underlying dplane should signal that it is backed up via a new dplane context command call and signal when it is no longer backed up as well. Then the master pthread can then decide to initiate more work or not. Additionally the dplane can decide at what context count should trigger the please hold off request and at what context count it should tell the upper level to start working again.

Copy link
Member

@donaldsharp donaldsharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've outlined a different approach that would be acceptable. In the current state this approach is dead in the water. There might also need to be work to have the workqueue be smarter about when to come back to do more work here as well instead of busy looping

@mjstapp
Copy link
Contributor

mjstapp commented Jan 2, 2024

Is it correct that the real source of this specific problem is fpm - that the other end of the fpm session is just ... slow to read? if that's true, then that really needs to get fixed first (imo).

@zice312963205
Copy link
Contributor Author

Yeah this is a decent idea, but the execution is all wrong. Not doing any work while the dplane has initiated a call back with the results is not the way I want to proceed with this.

The problem is that the dplane_fpm_nl module is holding onto the contexts when the underlying process that is being communicated too is not responding in a fast enough fashion. The underlying dplane should signal that it is backed up via a new dplane context command call and signal when it is no longer backed up as well. Then the master pthread can then decide to initiate more work or not. Additionally the dplane can decide at what context count should trigger the please hold off request and at what context count it should tell the upper level to start working again.

I understand your perspective, but I think there might be some differences in our understanding of this section of the code.
One of the core reasons I think it could be modified is that the thread for meta_queue_process and the thread pointed to by t_dplane are both in the zebra main thread.
This means that there is no concurrent scheduling between these two threads. To put it more absolutely, these two threads are essentially taking turns in scheduling.
In rib_process_dplane_results, there is no mechanism to pause or terminate the thread by controlling the number of executions, meaning that every time it enters the thread, it will process all the result contexts before exiting.
Therefore, I believe waiting for rib_process_dplane_results to complete in meta_queue_process will not introduce blocking. Currently, we can add a mechanism to actively call work_queue_schedule to start meta_queue_process after rib_process_dplane_results has finished, instead of waiting for the default retry time. The reason I haven't made such a change is that I found that in its original blocking rules, that is, when determining queue_len > queue_limit, it did not use a work_queue_schedule method to actively start meta_queue_process.
This is something to consider regarding the necessity of work_queue_schedule.
Additionally, regarding the processing time issue with dplane_fpm_nl, I've also tracked it, and it turns out it's not due to slow communication response from the underlying process.
The difference between dplane_kernel and dplane_fpm is that dplane_kernel, within its dp_fp (kernel_update_multi), encapsulates all netlink messages directly in a synchronous manner and writes them into the kernel. On the other hand, dplane_fpm's dp_fp (fpm_nl_process) transfers the contexts to fnc->ctxqueue and then initiates an asynchronous event (fpm_process_queue), waiting for this event to be scheduled before writing data to the fpm. This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue.
In my practical tests, fpmsyncd reads data from the socket very quickly and does not cause any blocking in the execution of the fpm_nl_enqueue function.

@zice312963205
Copy link
Contributor Author

Is it correct that the real source of this specific problem is fpm - that the other end of the fpm session is just ... slow to read? if that's true, then that really needs to get fixed first (imo).

Actually, the reason isn't that the FPM reads data slowly. Please refer to my previous reply, thank you.

@mjstapp
Copy link
Contributor

mjstapp commented Jan 3, 2024

so there are (at least) a couple of things going on here. there's the sort of practical, low-level issue of how to communicate between pthreads - whether it is viable, workable to use an event* to try to do that. but the real issue is what to do when there's a rate mismatch within zebra - what components can detect a mismatch, what components should react to it, and what that reaction should be.

it's not viable to play with the event* outside the library lock, full-stop. that value is set by the dplane pthread, under a lock, and is cleared/NULLed by the event library in the context of the main pthread, under a lock. every time you look at that value you introduce a race, and any time you touch it you introduce a race and you risk leaking memory (a small amount, it's true).

About the fpm plugin - it does look inefficient to me, just on casual inspection, but I didn't really follow this statement:

This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue.

the actual netlink code is making a system call to send messages to the kernel, then it's waiting for a reply with status for the requests that were in the netlink batch. that's not a cheap operation. the fpm code only has to send on a socket, and if that's slower than the kernel operation, that's something we should be able to fix. the fpm code takes a lot of locks, instead of dequeuing batches. the fpm code doesn't ... just get on with its work, taking a batch of contexts and sending them out - it schedules an event to itself to look at a queue, and then it puts bytes into a stream buffer, and then it schedules itself to actually write the stream buffer to the socket. it looks like that could be cleaned up, and that might improve the throughput in that plugin.

the higher-level question seems to be: if one dplane plugin is running behind, running slowly, what should happen? right now, the dplane only really checks its own top-level queue, the initial queue that receives contexts from the main pthread. it could check the providers' queues, and try using those as a proxy for the state of the overall system. but that won't be able to detect queue buildup inside a provider's own data structures. or the providers could have a way to signal to the dplane framework that they are building queue - and the dplane framework could use that signal to slow down the rate at which it accepts new incoming work. you could make the case that the overall system shouldn't overrun the slowest component; but neither of those things exists right now though.

where do you think the bottleneck is? do you think it's in the less-than-perfect fpm plugin? or do you think it's in the zebra main pthread, trying to work through a big impulse of changes?

@donaldsharp
Copy link
Member

I understand your perspective, but I think there might be some differences in our understanding of this section of the code. One of the core reasons I think it could be modified is that the thread for meta_queue_process and the thread pointed to by t_dplane are both in the zebra main thread. This means that there is no concurrent scheduling between these two threads. To put it more absolutely, these two threads are essentially taking turns in scheduling. In rib_process_dplane_results, there is no mechanism to pause or terminate the thread by controlling the number of executions, meaning that every time it enters the thread, it will process all the result contexts before exiting. Therefore, I believe waiting for rib_process_dplane_results to complete in meta_queue_process will not introduce blocking.

This makes no sense to me. The very fact that this solution was raised as fixing the problem clearly implies that blocking the processing of new routes while the master zebra pthread has contexts to handle slows the master pthread down enough to allow the dplane pthread to clear it's queues. This also implicitly implies that the contexts are sitting in the rib_dplane_q. Which is a bit odd. What queue exactly are the contexts sitting in?

Currently, we can add a mechanism to actively call work_queue_schedule to start meta_queue_process after rib_process_dplane_results has finished, instead of waiting for the default retry time. The reason I haven't made such a change is that I found that in its original blocking rules, that is, when determining queue_len > queue_limit, it did not use a work_queue_schedule method to actively start meta_queue_process. This is something to consider regarding the necessity of work_queue_schedule.

This sure sounds like it's needed as a seperate thing to fix in it's own right.

Additionally, regarding the processing time issue with dplane_fpm_nl, I've also tracked it, and it turns out it's not due to slow communication response from the underlying process. The difference between dplane_kernel and dplane_fpm is that dplane_kernel, within its dp_fp (kernel_update_multi), encapsulates all netlink messages directly in a synchronous manner and writes them into the kernel. On the other hand, dplane_fpm's dp_fp (fpm_nl_process) transfers the contexts to fnc->ctxqueue and then initiates an asynchronous event (fpm_process_queue), waiting for this event to be scheduled before writing data to the fpm.

This also sounds like a clearer path of fixing that needs to be done and frankly sounds like a higher value problem that should be addressed first.

This process is much slower compared to dispatching directly to the kernel, which in turn reduces the overall efficiency of the dplane queue. In my practical tests, fpmsyncd reads data from the socket very quickly and does not cause any blocking in the execution of the fpm_nl_enqueue function.

This doesn't make a whole bunch of sense to me. In my local testing I cannot make the context queue grow when using dplane_fpm_nl. I am also not using fpmsyncd as a listener. I do know that I can make it grow arbitrarily when I intentionally introduce a delay in responding. Perhaps we don't fully understand where the problem is at this point in time.

Copy link

github-actions bot commented Jul 2, 2024

This PR is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this PR closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants