Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zebra final shutdown finally #14811

Merged
merged 13 commits into from
Nov 28, 2023

Conversation

donaldsharp
Copy link
Member

See individual commits but more examples of memory leaks on shutdown that we need to clean up.

@donaldsharp
Copy link
Member Author

ci:rerun

@donaldsharp donaldsharp force-pushed the zebra_final_shutdown_finally branch from b0e588d to 49bbbd7 Compare November 20, 2023 20:41
Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I'd like to look into some of the zebra changes, there's already quite a bit of mpls cleanup (for example) in other places, may make sense just to enhance that.

On shutdown go through and ensure that any contexts the
dplane provider holds are freed.

Signed-off-by: Donald Sharp <[email protected]>
The MTYPE_BGP memory type was being over used as
both the handler for the bgp instance itself as
well as memory associated with name strings.
Let's separate out the two.

Signed-off-by: Donald Sharp <[email protected]>
When bgp is shutting down, it calls bgp_fsm_change_status
on everything including a self peer, which goes through
and cleans the tables of the self peer data structures
as if it's a real peer.  Add a bit of code to just
not do the work at all.  This allows unlocks to flow
a bit further and for the self peer to be deleted
on shutdown.

Signed-off-by: Donald Sharp <[email protected]>
a) Rename rib_init to zebra_rib_init() to better follow how
things are named

b) on shutdown cycle through the rib_dplane_q and free
up any contexts sitting in it.

Signed-off-by: Donald Sharp <[email protected]>
@donaldsharp donaldsharp force-pushed the zebra_final_shutdown_finally branch from 49bbbd7 to 71f7ecb Compare November 21, 2023 17:41
@donaldsharp
Copy link
Member Author

ci:rerun

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@riw777
Copy link
Member

riw777 commented Nov 28, 2023

Hmm - I'd like to look into some of the zebra changes, there's already quite a bit of mpls cleanup (for example) in other places, may make sense just to enhance that.

I guess it depends on whether we want to get these in now, or wait on the other enhancement pr's?

@choppsv1
Copy link
Contributor

choppsv1 commented Nov 28, 2023

Hmm - I'd like to look into some of the zebra changes, there's already quite a bit of mpls cleanup (for example) in other places, may make sense just to enhance that.

I guess it depends on whether we want to get these in now, or wait on the other enhancement pr's?

I'd like to see this merged now. At least to get rid of all the memleaks

@choppsv1 choppsv1 merged commit bb6fe6b into FRRouting:master Nov 28, 2023
78 checks passed
@rwgbsd
Copy link
Contributor

rwgbsd commented Dec 30, 2023

It appears that when this was merged with master it blew up the ASAN reports we had spent so much time getting clean. I am unclear at this time as to why things procedded forward after the master branch began failing what where suppose to be fatal ASAN tests. You can see the merge to master and the leaks in the ASAN1 report at https://ci1.netdef.org/browse/FRR-FRR-7233

@choppsv1
Copy link
Contributor

choppsv1 commented Dec 30, 2023

It appears that when this was merged with master it blew up the ASAN reports we had spent so much time getting clean. I am unclear at this time as to why things procedded forward after the master branch began failing what where suppose to be fatal ASAN tests. You can see the merge to master and the leaks in the ASAN1 report at https://ci1.netdef.org/browse/FRR-FRR-7233

I had said I would look into this, but I thought this was a different PR. :) I'll let @donaldsharp address this first.

Why wasn't this failure flagged on the pre-merge CI test though?

@rwgbsd
Copy link
Contributor

rwgbsd commented Dec 30, 2023

Why wasn't this failure flagged on the pre-merge CI test though?

It appears as if some communications between Martin and myself was not clear and the only test that became fatal was post-merge into master. Though it is still unclear how even this failure went unaddressed for 30 days. Sadly with the leaks caused by this PR in the tree we can not attempt to turn on a pre-merge fatal state as then ALL PR runs would fail.

@choppsv1
Copy link
Contributor

... Sadly with the leaks caused by this PR in the tree we can not attempt to turn on a pre-merge fatal state as then ALL PR runs would fail.

This PR was all about fixing leaks, so it's odd/ironic that it caused them. In any case presumably we can get them fixed soon and enable failure in CI going forward.

@rwgbsd
Copy link
Contributor

rwgbsd commented Dec 30, 2023

My gut feeling from reading through some of the ASAN reports is that this "fix" mearly freed top level structures that had pointers inside them to other data. Most of the ASAN reports are for a massive slew of "indirect" leeks. I tried to create a revert PR from this one, but that fails, probably due to intervening changes. IMHO, we should first get these changes all backed out and get back to the clean ASAN state ASAP, fix the CI to cause pre-PR runs to be fatal for ASAN, then go back after what this purposed to fix. If my "git" foo was only better...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants