Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplane failed to limit the maximum length of the queue #15016

Closed
1 of 2 tasks
zice312963205 opened this issue Dec 14, 2023 · 6 comments
Closed
1 of 2 tasks

dplane failed to limit the maximum length of the queue #15016

zice312963205 opened this issue Dec 14, 2023 · 6 comments
Labels
autoclose triage Needs further investigation

Comments

@zice312963205
Copy link
Contributor

zice312963205 commented Dec 14, 2023


Describe the bug

FRR version: 8.2.2
kernel version:Linux 5.10

When I learn 2 million routes from BGP neighbors in a short period of time, dplane consumes a large amount of cache.
image
image
image
image

It seems there is an issue where if a large number of routes are learned in a short period of time, dplane will occupy a substantial amount of cache memory, which might lead to an out-of-memory (OOM) situation.

  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

Expected behavior

Screenshots

Versions

  • OS Version:
  • Kernel:
  • FRR Version:

Additional context

@zice312963205 zice312963205 added the triage Needs further investigation label Dec 14, 2023
@donaldsharp
Copy link
Member

Can we see the cli you have for the command line of zebra?

@donaldsharp
Copy link
Member

I'd like to see a show thread cpu as well

@zice312963205
Copy link
Contributor Author

I'd like to see a show thread cpu as well

image

@donaldsharp
Copy link
Member

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: #15025
and see if it cleans the problem up

@zice312963205
Copy link
Contributor Author

Why didn't you include the entirety of the show thread cpu output?

In any event I was able to recreate something similiar in my home setup. I am not sure if this is what you are reporting, but it probaby is, can you give this a try: #15025 and see if it cleans the problem up

I tracked the code flow of ctx and found that the problem arises because after ctx is processed by the provider, it is all hung on the rib_dplane_q for caching. Then, the value of zdplane_info.dg_routes_queued will be reduced, which leads to the failure of the attempt to limit the number of ctx processed each time (200) in the function meta_queue_process. Since rib_process_dplane_results is executed in the main thread of zebra, scheduling will be relatively slow. Therefore, when a large number of routes are injected in a short time, a lot of temporary caches will be hung on the rib_dplane_q, which in turn causes ctx to not be released in time, leading to this problem.

I have an idea for modification, which is to attempt to judge the length of rib_dplane_q in the function meta_queue_process. If there are already many cached nodes, then return WQ_QUEUE_BLOCKED to temporarily delay the processing of rib_process.

Copy link

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclose triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants