Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: fix flushing fifo when no space in stream #13625

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion bgpd/bgp_updgrp_packet.c
Original file line number Diff line number Diff line change
Expand Up @@ -761,7 +761,7 @@ struct bpacket *subgroup_update_packet(struct update_subgroup *subgrp)
/* Flush the FIFO update queue */
while (adv)
adv = bgp_advertise_clean_subgroup(
subgrp, adj);
subgrp, adv->adj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense to me. adv->adj is the same value as adj( see line 696 ). What path can we get to this point where adj is not adv->adj? Furthermore why would we ever have a null adj value? That seems like that is the actual problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, @louis-6wind is right (despite not being sure how adj comes as NULL), but we should reuse a new value of adv instead of the previous loop. Regarding adj as NULL, is this happening on UPDATE or WITHDRAW of prefixes? How can it be verified? In other words, @louis-6wind how did you get this crash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All explanations are already in the commit log!

This makes no sense to me. adv->adj is the same value as adj( see line 696 ). What path can we get to this point where adj is not adv->adj?

Because it is inside a while loop line 762.

Furthermore why would we ever have a null adj value? That seems like that is the actual problem.
NULL is set here

adj->adv = NULL;

It was a customer crash. Not sure how to reproduce it. I have forced this code to be tested :

diff --git a/bgpd/bgp_updgrp_packet.c b/bgpd/bgp_updgrp_packet.c
index b7a2ffe27b..be850bac20 100644
--- a/bgpd/bgp_updgrp_packet.c
+++ b/bgpd/bgp_updgrp_packet.c
@@ -716,9 +716,10 @@ struct bpacket *subgroup_update_packet(struct update_subgroup *subgrp)
 			+ bgp_packet_mpattr_prefix_size(afi, safi, dest_p);
 
 		/* When remaining space can't include NLRI and it's length.  */
-		if (space_remaining < space_needed)
+		if (space_remaining < space_needed && 0)
 			break;
 
+		s->endp = 0;
 		/* If packet is empty, set attribute. */
 		if (stream_empty(s)) {
 			struct peer *from = NULL;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to understand how this pointer is becoming NULL. That is the real problem, imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to understand how this pointer is becoming NULL. That is the real problem, imo.

Screenshot-2023-05-30-22:05:47

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this makes no sense since adv is removed from the fifo. How is it null and still on the fifo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to further clarify, the fact that we have a adv that is null that we want to use implies that we have the same adj multiple times on the baa.(is this legal?) This seems like a fundamental mistake that should be fixed elsewhere instead of here as that we have violated some other principal. Again I ask for a better understanding of how we got into this mess.

return NULL;
}

Expand Down