Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pimd: Fix for data packet loss when FHR is LHR and RP #14227

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

routingrocks
Copy link
Contributor

Topology:
A single router is acting as the First Hop Router (FHR), Last Hop Router (LHR), and RP.

RC and Issue:
When an upstream S,G is in join state, it sends a register message to the RP. If the RP has the receiver, it sends a register stop message and switches to the shortest path. When the register stop message is processed, it removes pimreg, moves to prune, and starts the reg stop timer.

When the reg stop timer expires, PIM changes S,G state to Join Pending and sends out a NULL register message to RP. RP receives it and fails to send Reg stop because SPT is not set at that point.

The problem is when the register stop timer pops and state is in Join Pending. According to https://www.rfc-editor.org/rfc/rfc4601#section-4.4.1, we need to put back the pimreg reg tunnel into the S,G mroute. This causes data to be sent to the control plane and subsequently interrupts the line rate.

Fix:
If the router is FHR and RP to the group,
ignore SPT status and send out a register stop message back to the DR (in this context, the same router).

Ticket: #
Signed-off-by: Donald Sharp [email protected]
Signed-off-by: Rajesh Varatharaj [email protected]

@mwinter-osr
Copy link
Member

CI:rerun rerunning with CI webhook after disabling the CI Checks App

@NetDEF-CI
Copy link
Collaborator

Continuous Integration Result: FAILED

Test incomplete. See below for issues.
CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-13620/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Get source / Pull Request: Successful

Building Stage: Successful

Basic Tests: Incomplete

Addresssanitizer topotests part 4: Incomplete (check logs for details)
Successful on other platforms/tests
  • Addresssanitizer topotests part 3
  • Topotests Ubuntu 18.04 amd64 part 7
  • Topotests debian 10 amd64 part 1
  • Topotests Ubuntu 18.04 i386 part 5
  • Topotests Ubuntu 18.04 i386 part 0
  • Topotests Ubuntu 18.04 amd64 part 4
  • CentOS 7 rpm pkg check
  • Topotests Ubuntu 18.04 i386 part 9
  • Addresssanitizer topotests part 2
  • Topotests Ubuntu 18.04 amd64 part 0
  • Addresssanitizer topotests part 1
  • Topotests Ubuntu 18.04 i386 part 4
  • Topotests debian 10 amd64 part 5
  • Topotests Ubuntu 18.04 amd64 part 1
  • Topotests debian 10 amd64 part 7
  • Debian 9 deb pkg check
  • Addresssanitizer topotests part 8
  • Topotests Ubuntu 18.04 amd64 part 6
  • Topotests Ubuntu 18.04 i386 part 2
  • Topotests Ubuntu 18.04 arm8 part 8
  • Ubuntu 18.04 deb pkg check
  • Topotests debian 10 amd64 part 6
  • Topotests Ubuntu 18.04 arm8 part 6
  • Topotests Ubuntu 18.04 arm8 part 1
  • Topotests Ubuntu 18.04 i386 part 7
  • Addresssanitizer topotests part 6
  • Addresssanitizer topotests part 5
  • Topotests Ubuntu 18.04 arm8 part 3
  • Topotests Ubuntu 18.04 i386 part 3
  • Topotests Ubuntu 18.04 amd64 part 2
  • Topotests Ubuntu 18.04 i386 part 8
  • Topotests debian 10 amd64 part 3
  • Topotests debian 10 amd64 part 8
  • Addresssanitizer topotests part 0
  • Topotests Ubuntu 18.04 arm8 part 4
  • Topotests debian 10 amd64 part 4
  • Topotests debian 10 amd64 part 9
  • Topotests Ubuntu 18.04 arm8 part 9
  • Topotests Ubuntu 18.04 amd64 part 3
  • Topotests Ubuntu 18.04 arm8 part 2
  • Static analyzer (clang)
  • Topotests debian 10 amd64 part 0
  • Addresssanitizer topotests part 9
  • Topotests debian 10 amd64 part 2
  • Topotests Ubuntu 18.04 i386 part 6
  • Topotests Ubuntu 18.04 arm8 part 0
  • Topotests Ubuntu 18.04 arm8 part 7
  • Topotests Ubuntu 18.04 amd64 part 8
  • Topotests Ubuntu 18.04 arm8 part 5
  • Topotests Ubuntu 18.04 i386 part 1
  • Topotests Ubuntu 18.04 amd64 part 9
  • Addresssanitizer topotests part 7
  • Ubuntu 20.04 deb pkg check
  • Debian 10 deb pkg check
  • Topotests Ubuntu 18.04 amd64 part 5

Copy link
Contributor

@mobash-rasool mobash-rasool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is how the flow should be:
(S,G) - OIL list should be having pimreg + Interface-connected-to-LHR (due to inheritance) initially
The kernel stats must be getting incremented now since the traffic is coming over (S,G) and hence upstream->sptbit should have been set and pimreg removed and that way the register-stop would have been sent too due to spt bit. And this flap of adding and removing of pimreg would not be occurring.
So we need to check why upstream->sptbit could not be set, is there some problem with kernel stats or we are unable to reach the point where upstream->sptbit is set in pimd.

|| ((SwitchToSptDesiredOnRp(pim, &sg))
&& pim_upstream_inherited_olist(pim, upstream) == 0)) {
if ((upstream->sptbit == PIM_UPSTREAM_SPTBIT_TRUE) ||
(PIM_UPSTREAM_FLAG_TEST_FHR(upstream->flags) && i_am_rp) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over slack, please check why upstream->sptbit is not getting set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got occupied with different work internally. I will check and update here later.

@RodrigoMNardi
Copy link
Contributor

ci:rerun

Copy link

This PR is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this PR closed.

@routingrocks
Copy link
Contributor Author

ci:rerun

@Jafaral
Copy link
Member

Jafaral commented Aug 23, 2024

Can you rebase instead ?

@routingrocks
Copy link
Contributor Author

Failure is not related to the change, re-running CI
E AssertionError: Testcase test_verify_default_originate_with_2way_ecmp_p2 : After shuting down the interface Convergence is expected to be Failed assert True is not Tru

Topology:
A single router is acting as the First Hop Router (FHR), Last Hop Router (LHR), and RP.

RC and Issue:
When an upstream S,G is in join state, it sends a register message to the RP.
If the RP has the receiver, it sends a register stop message and switches to the shortest path.
When the register stop message is processed, it removes pimreg, moves to prune,
and starts the reg stop timer.

When the reg stop timer expires, PIM changes S,G state to Join Pending and sends out a NULL
register message to RP. RP receives it and fails to send Reg stop because SPT is not set at that point.

The problem is when the register stop timer pops and state is in Join Pending.
According to https://www.rfc-editor.org/rfc/rfc4601#section-4.4.1,
we need to put back the pimreg reg tunnel into the S,G mroute.
This causes data to be sent to the control plane and subsequently interrupts the line rate.

Fix:
If the router is FHR and RP to the group,
ignore SPT status and send out a register stop message back to the DR (in this context, the same router).

Ticket: #3506780

Signed-off-by: Donald Sharp <[email protected]>
Signed-off-by: Rajesh Varatharaj <[email protected]>
@routingrocks
Copy link
Contributor Author

ci:rerun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants