-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgpd: fix flushing fifo when no space in stream #13625
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Mergifyio backport stable/8.5 stable/8.4 |
🟠 Waiting for conditions to match
|
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDTest incomplete. See below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: IncompleteAddresssanitizer topotests part 6: Incomplete(check logs for details)Topotests debian 10 amd64 part 4: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO4DEB10AMD64-11798/test Topology Tests failed for Topotests debian 10 amd64 part 4 Topotests Ubuntu 18.04 arm8 part 9: Failed (click for details)Topotests Ubuntu 18.04 arm8 part 9: Unknown Log URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-11798/artifact/TOPO9U18ARM8/TopotestDetails/ Topotests Ubuntu 18.04 arm8 part 9: No useful log foundTopotests debian 10 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9DEB10AMD64-11798/test Topology Tests failed for Topotests debian 10 amd64 part 9 Successful on other platforms/tests
|
ci:rerun |
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9U18AMD64-11803/test Topology Tests failed for Topotests Ubuntu 18.04 amd64 part 9 Successful on other platforms/tests
|
@@ -761,7 +761,7 @@ struct bpacket *subgroup_update_packet(struct update_subgroup *subgrp) | |||
/* Flush the FIFO update queue */ | |||
while (adv) | |||
adv = bgp_advertise_clean_subgroup( | |||
subgrp, adj); | |||
subgrp, adv->adj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes no sense to me. adv->adj is the same value as adj( see line 696 ). What path can we get to this point where adj is not adv->adj? Furthermore why would we ever have a null adj value? That seems like that is the actual problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, @louis-6wind is right (despite not being sure how adj comes as NULL), but we should reuse a new value of adv
instead of the previous loop. Regarding adj
as NULL, is this happening on UPDATE or WITHDRAW of prefixes? How can it be verified? In other words, @louis-6wind how did you get this crash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All explanations are already in the commit log!
This makes no sense to me. adv->adj is the same value as adj( see line 696 ). What path can we get to this point where adj is not adv->adj?
Because it is inside a while loop line 762.
Furthermore why would we ever have a null adj value? That seems like that is the actual problem.
NULL is set here
Line 442 in 49429dc
adj->adv = NULL; |
It was a customer crash. Not sure how to reproduce it. I have forced this code to be tested :
diff --git a/bgpd/bgp_updgrp_packet.c b/bgpd/bgp_updgrp_packet.c
index b7a2ffe27b..be850bac20 100644
--- a/bgpd/bgp_updgrp_packet.c
+++ b/bgpd/bgp_updgrp_packet.c
@@ -716,9 +716,10 @@ struct bpacket *subgroup_update_packet(struct update_subgroup *subgrp)
+ bgp_packet_mpattr_prefix_size(afi, safi, dest_p);
/* When remaining space can't include NLRI and it's length. */
- if (space_remaining < space_needed)
+ if (space_remaining < space_needed && 0)
break;
+ s->endp = 0;
/* If packet is empty, set attribute. */
if (stream_empty(s)) {
struct peer *from = NULL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to understand how this pointer is becoming NULL. That is the real problem, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this makes no sense since adv is removed from the fifo. How is it null and still on the fifo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to further clarify, the fact that we have a adv that is null that we want to use implies that we have the same adj multiple times on the baa.(is this legal?) This seems like a fundamental mistake that should be fixed elsewhere instead of here as that we have violated some other principal. Again I ask for a better understanding of how we got into this mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Mergifyio backport dev/9.0 |
🟠 Waiting for conditions to match
|
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9U18AMD64-11803/test Topology Tests failed for Topotests Ubuntu 18.04 amd64 part 9 Successful on other platforms/tests
|
@louis-6wind could you rebase and push? |
49429dc
to
d1899bd
Compare
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9U18AMD64-12491/test Topology Tests failed for Topotests Ubuntu 18.04 amd64 part 9 Successful on other platforms/tests
|
1 similar comment
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9U18AMD64-12491/test Topology Tests failed for Topotests Ubuntu 18.04 amd64 part 9 Successful on other platforms/tests
|
ci:rerun |
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 i386 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9U18I386-12534/test Topology Tests failed for Topotests Ubuntu 18.04 i386 part 9 Successful on other platforms/tests
|
ci:rerun |
Continuous Integration Result: FAILEDContinuous Integration Result: FAILEDSee below for issues. This is a comment from an automated CI system. Get source / Pull Request: SuccessfulBuilding Stage: SuccessfulBasic Tests: FailedTopotests Ubuntu 18.04 amd64 part 6: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO6U18AMD64-12552/test Topology Tests failed for Topotests Ubuntu 18.04 amd64 part 6 Topotests Ubuntu 18.04 arm8 part 7: Failed (click for details)Topotests Ubuntu 18.04 arm8 part 7: Unknown Log URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-12552/artifact/TOPO7U18ARM8/TopotestDetails/ Topotests Ubuntu 18.04 arm8 part 7: No useful log foundTopotests debian 10 amd64 part 9: Failed (click for details)Topology Test Results are at https://ci1.netdef.org/browse/FRR-PULLREQ2-TOPO9DEB10AMD64-12552/test Topology Tests failed for Topotests debian 10 amd64 part 9 Successful on other platforms/tests
|
A better approach seems to be moving the route to the withdraw queue when a larger attribute is discovered. Otherwise, if the route was previously sent with a smaller attribute, it cannot be properly updated to other neighbors, and a better way to handle it may be to withdraw the route from the neighbors.(?) |
The following crash happen when trying to send BGP Update from a subgroup when the stream is out of spaces: > ==594613==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000030 (pc 0x56486652f11a bp 0x7ffefaef6140 sp 0x7ffefaef6110 T0) > ==594613==The signal is caused by a READ memory access. > ==594613==Hint: address points to the zero page. > #0 0x56486652f11a in bgp_advertise_clean_subgroup bgpd/bgp_updgrp_adv.c:449 > #1 0x564866537019 in subgroup_update_packet bgpd/bgp_updgrp_packet.c:778 > #2 0x56486646d24c in bgp_generate_updgrp_packets bgpd/bgp_packet.c:439 > #3 0x7fc00dcb14b0 in thread_call lib/thread.c:1825 > FRRouting#4 0x7fc00dbcbf92 in frr_run lib/libfrr.c:1155 > FRRouting#5 0x56486634108e in main bgpd/bgp_main.c:570 > FRRouting#6 0x7fc00d70bd09 in __libc_start_main ../csu/libc-start.c:308 > FRRouting#7 0x56486633d8a9 in _start (/usr/lib/frr/bgpd+0x2a58a9) Crash at the following line in frame 0 because the 'adv' pointer is NULL: > baa = adv->baa; subgroup_update_packet() calls bgp_advertise_clean_subgroup() in a loop and provides adj. In the latter function, adv is get from adj->adv then adj->adv is set to NULL. The next call re-use the same adj and the crash occurs. Update adj before each call. Fixes: 3f9c736 ("BGP: Add dynamic update group support") Signed-off-by: Louis Scalbert <[email protected]>
d1899bd
to
92385ee
Compare
Let's fix this first. Feel free to do a rework afterward. |
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-14335/ This is a comment from an automated CI system. |
The following crash happen when trying to send BGP Update from a subgroup when the stream is out of spaces:
Crash at the following line in frame 0 because the 'adv' pointer is NULL:
subgroup_update_packet() calls bgp_advertise_clean_subgroup() in a loop and provides adj. In the latter function, adv is get from adj->adv then adj->adv is set to NULL. The next call re-use the same adj and the crash occurs.
Update adj before each call.
Fixes: 3f9c736 ("BGP: Add dynamic update group support")