Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nhrp maintain used state #14739

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

pguibert6WIND
Copy link
Member

The problem is due to the 'U'sed flag of dynamic cache entries (shortcut), which is set to false, despite its use.
Consequently, the shortcut associated goes in expiring state.

Fix this by ensuring that the 'U'sed flag is properly set to true, as it is always maintained by NHRP. Only NHRP can decide whether the cache entry is valid or not.

2023/11/06 11:25:16 NHRP: [KHACV-6YE92] Send Resolution-Request(1) 11.255.255.2 -> 11.255.255.4
2023/11/06 11:25:16 NHRP: [WSA6E-5GM0H] PACKET: Send 10.125.0.2 -> 10.125.0.4
2023/11/06 11:25:16 NHRP: [K0534-5VD2M] PACKET: Recv 10.125.0.4 -> 10.125.0.2
2023/11/06 11:25:16 NHRP: [KHACV-6YE92] Recv Resolution-Reply(2) 11.255.255.4 -> 11.255.255.2
2023/11/06 11:25:16 NHRP: [KHACV-6YE92] !LOCAL Resolution-Reply(2) 11.255.255.4 -> 11.255.255.2
2023/11/06 11:25:16 NHRP: [KZFC5-JYQF6] Shortcut: 11.255.255.4/32 is at proto 11.255.255.4 dst_proto 11.255.255.4 NBMA 10.125.0.4 cie-holdtime 1200
2023/11/06 11:25:16 NHRP: [J0Y1F-TGYD3] Shortcut: cache found, update binding
2023/11/06 11:25:16 NHRP: [VZQX2-A5SP9] cache: same type 4, updating expiry and changing nbma addr from 10.125.0.4 to 10.125.0.4
2023/11/06 11:25:16 NHRP: [MWP37-NGEEE] Shortcut: calling update_binding
2023/11/06 11:25:16 NHRP: [TP7X1-TKH88] Shortcut: Resolution reply handled

@@ -746,7 +746,7 @@ static void show_ip_nhrp_cache(struct nhrp_cache *c, void *pctx)
nhrp_cache_type_str[c->cur.type]);
json_object_string_add(json, "protocol", buf[0]);
json_object_string_add(json, "nbma", buf[1]);
json_object_string_add(json, "claimed_nbma", buf[2]);
json_object_string_add(json, "claimedNbma", buf[2]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go through an appropriate deprecation cycle with this change. Even if it is wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed to keep the old value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should mark this as a deprecation anyway (claimed_nbma) because it's not compliant with our JSON requirements for the key names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pguibert6WIND can you add deprecation for claimed_nbma? (CONFDATE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from 9378d4a to 0cae44f Compare November 6, 2023 17:13
@github-actions github-actions bot added size/M and removed size/L labels Nov 6, 2023
@pguibert6WIND
Copy link
Member Author

The problem is due to the 'U'sed flag of dynamic cache entries (shortcut), which is set to false, despite its use. Consequently, the shortcut associated goes in expiring state.

Fix this by ensuring that the 'U'sed flag is properly set to true, as it is always maintained by NHRP. Only NHRP can decide whether the cache entry is valid or not.

2023/11/06 11:25:16 NHRP: [KHACV-6YE92] Send Resolution-Request(1) 11.255.255.2 -> 11.255.255.4
2023/11/06 11:25:16 NHRP: [WSA6E-5GM0H] PACKET: Send 10.125.0.2 -> 10.125.0.4
2023/11/06 11:25:16 NHRP: [K0534-5VD2M] PACKET: Recv 10.125.0.4 -> 10.125.0.2
2023/11/06 11:25:16 NHRP: [KHACV-6YE92] Recv Resolution-Reply(2) 11.255.255.4 -> 11.255.255.2
2023/11/06 11:25:16 NHRP: [KHACV-6YE92] !LOCAL Resolution-Reply(2) 11.255.255.4 -> 11.255.255.2
2023/11/06 11:25:16 NHRP: [KZFC5-JYQF6] Shortcut: 11.255.255.4/32 is at proto 11.255.255.4 dst_proto 11.255.255.4 NBMA 10.125.0.4 cie-holdtime 1200
2023/11/06 11:25:16 NHRP: [J0Y1F-TGYD3] Shortcut: cache found, update binding
2023/11/06 11:25:16 NHRP: [VZQX2-A5SP9] cache: same type 4, updating expiry and changing nbma addr from 10.125.0.4 to 10.125.0.4
2023/11/06 11:25:16 NHRP: [MWP37-NGEEE] Shortcut: calling update_binding
2023/11/06 11:25:16 NHRP: [TP7X1-TKH88] Shortcut: Resolution reply handled

The main fix is simpler, as it modified the netlink filter so that we are now able to receive NEWNEIGH notifications.

@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from 945b495 to 6f71f4e Compare November 7, 2023 20:16
@@ -9,7 +9,6 @@
"protocol":"10.255.255.2",
"nbma":"10.2.1.2",
"claimed_nbma":"10.2.1.2",
"used":false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this value unconditionally, any reason to remove it from here?

Copy link
Member Author

@pguibert6WIND pguibert6WIND Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'used' attribute is dynamic and relies on the state of the
system neighbor entry attached to the protocol address. Periodically,
the value is set to false, when receiving a NUD_PROBE notification
from netlink. Then, NHRP re-programs the link entry and re-sets the
'used ' value to true.

Because that value may be intermittently set to false, it is better to
remove it from the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch 2 times, most recently from 851f17c to 9c34333 Compare November 28, 2023 15:52
@pguibert6WIND
Copy link
Member Author

ci:rerun

@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch 2 times, most recently from 2b593e5 to 084b59b Compare December 4, 2023 12:58
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch 2 times, most recently from 723b1a7 to f066b48 Compare December 15, 2023 08:31
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from f066b48 to 9613b07 Compare March 11, 2024 16:29
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from 9613b07 to c30a7a1 Compare March 21, 2024 17:07
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from c30a7a1 to 8f24211 Compare April 2, 2024 07:11
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch 2 times, most recently from 6f386ab to 1c9720c Compare April 15, 2024 14:25
Some json attributes are not compliant with caml format.

> # show ip nhrp cache json
> [..]
> {
>  "interface":"gre1",
>  "type":"dynamic",
>  "protocol":"11.255.255.4",
>  "nbma":"10.125.0.4",
>  "claimed_nbma":"10.125.0.4",
> [..]

Fix this by duping the 'claimed_nbma' attribute with
the 'claimedNbma' name.

Fixes: 85365e5 ("nhrpd: Add Claimed NBMA field in sh ip nhrp cache output")
Signed-off-by: Philippe Guibert <[email protected]>
When creating a shortcut, there is no information on the life cycle
of the entry. Add some more fields in the json part of the command.

> # show ip nhrp shortcut
> [..]
>   "table":[
>     {
>       "type":"dynamic",
>       "prefix":"11.255.255.4\/32",
>       "via":"11.255.255.4",
>       "identity":"south",
>       "holdingTimeSecs":1200,          <-- added
>       "routeInstalled":true,           <-- added
>       "expiring":false,                <-- added
>       "remainingTimeSecs":312          <-- added
>     }

The remaining time can be understood with the expiring flag.
If true, the timer stands for the remaining time before
shortcut is removed. Otherwise, the timer stands for the time
before a renewal attempt may happen (initially set to the 2
thirds of the 'holdingTimeSecs' value.

Signed-off-by: Philippe Guibert <[email protected]>
There is no information about the active timers in an nhrp
cache entry. Dump the cache timer and/or the authentication
timer when available. Dump the '*routeInstalled' boolean.

> # show ip nhrp cache json
> [..]
>  {
>  "interface":"gre1",
>  "type":"nhs",
> [..]
>  "used":true,
>  "routeInstalled":true,		<-- added
>  "nhrpRouteInstalled":true,		<-- added
>  "timeout":true,
>  "timeoutRemainingTimeSecs":6644,	<-- added
>  "auth":false,
>  [..]

Signed-off-by: Philippe Guibert <[email protected]>
pguibert6WIND and others added 4 commits September 3, 2024 14:16
The 'used' field of NHRP cache entries goes to false every
time it receives a netlink STALE message. Consequently, the
cache entry appears as unused until the reception of a PROBE
message triggers the neighbor entry re-configuration that
changes the entry to REACHABLE.

> '# show ip nhrp cache' command before receiving STALE message.
> Iface    Type     Protocol                 NBMA                     Claimed NBMA             Flags  Identity
> gre1     nhs      11.255.255.1             10.125.0.1               10.125.0.1                UT     west
> # ip monitor neigh dev gre1
> 10.125.0.1 lladdr de:ed:02:05:3c:b5 STALE
>
> '# show ip nhrp cache' command after sending STALE message.
> Iface    Type     Protocol                 NBMA                     Claimed NBMA             Flags  Identity
> gre1     nhs      11.255.255.1             10.125.0.1               10.125.0.1                T     west

Fix this by ignoring the STALE state, and do not change the cache
entry 'used' flag.

Signed-off-by: Philippe Guibert <[email protected]>
When creating an NHRP shortcut entry between two spokes, the cache
entry created has very often a field 'used' set to false, despite
a continuous traffic towards the 11.255.255.1 IP address.

> north-vm# show ip nhrp cache
> Iface    Type     Protocol                 NBMA                     Claimed NBMA             Flags  Identity
> gre1     local    11.255.255.2             10.125.0.2               10.125.0.2                      -
> gre1     nhs      11.255.255.1             10.125.0.1               10.125.0.1               T     west     <----

Actually, that flag reflects the protocol address reachability. A neighbor
entry is maintained in the kernel. As an ARP probe operates only with MAC
address, NHRP will take the place of ARP, and will refresh the NBMA link IP
address, every time the neighbor entry enters in PROBE state. The neighbor
entry state goes in REACHABLE state, and the expectation is that the REACHABLE
state is notified to the NHRP daemon, which does not happen.

The below dump indicates 3 netlink messages over the gre1 interface:

> # trace from 'ip monitor neigh'
> miss 11.255.255.1 dev gre1 lladdr 10.125.0.1 PROBE proto 191
> 11.255.255.1 dev gre1 lladdr 10.125.0.1 PROBE proto 191
> 11.255.255.1 dev gre1 lladdr 10.125.0.1 REACHABLE proto 191

The NHRP/ZEBRA traces indicate the first 2 netlink messages received, followed
by the update of the neighbor entry by the NHRP daemon. But there is no
reception or confirmation that the new neighbor state moved to REACHABLE.

> # trace from nhrp
> 2023/11/06 09:46:21 ZEBRA: [V8KNF-8EXH8] netlink_recv_msg: << netlink message dump [recv]
> 2023/11/06 09:46:21 ZEBRA: [JAS4D-NCWGP] nlmsghdr [len=80 type=(30) GETNEIGH flags=(0x0001) {REQUEST} seq=0 pid=0]
> 2023/11/06 09:46:21 ZEBRA: [S4WS4-PS3KF] netlink_parse_info: netlink-listen (NS 0) type RTM_GETNEIGH(30), len=80, seq=0, pid=0
> 2023/11/06 09:46:21 ZEBRA: [V8KNF-8EXH8] netlink_recv_msg: << netlink message dump [recv]
> 2023/11/06 09:46:21 ZEBRA: [JAS4D-NCWGP] nlmsghdr [len=80 type=(28) NEWNEIGH flags=(0x0000) {} seq=0 pid=0]
> 2023/11/06 09:46:21 ZEBRA: [T4YQJ-83R8H]   ndm [family=2 (AF_INET) ifindex=10 state=0x0010 {PROBE} flags=0x0000 {} type=1 (UNICAST)]
> 2023/11/06 09:46:21 ZEBRA: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(1) DST]
> 2023/11/06 09:46:21 ZEBRA: [M8QV4-KY9C0]       11.255.255.1
> 2023/11/06 09:46:21 ZEBRA: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(2) LLADDR]
> 2023/11/06 09:46:21 ZEBRA: [V74GD-NYS6Y]       0A:7D:00:01
> 2023/11/06 09:46:21 ZEBRA: [KFBSR-XYJV1]     rta [len=8 (payload=4) type=(4) PROBES]
> 2023/11/06 09:46:21 ZEBRA: [KFBSR-XYJV1]     rta [len=20 (payload=16) type=(3) CACHEINFO]
> 2023/11/06 09:46:21 ZEBRA: [KFBSR-XYJV1]     rta [len=5 (payload=1) type=(12) UNKNOWN]
> 2023/11/06 09:46:21 ZEBRA: [S4WS4-PS3KF] netlink_parse_info: netlink-listen (NS 0) type RTM_NEWNEIGH(28), len=76, seq=0, pid=0
> 2023/11/06 09:46:21 ZEBRA: [TDS34-MNEJW]     Neighbor Entry received is not on a VLAN or a BRIDGE, ignoring
> 2023/11/06 09:46:21 NHRP: [QQ0NK-1H449] Netlink: who-has 11.255.255.1 dev gre1 lladdr 10.125.0.1 nud 0x10 cache used 0 type 5
> 2023/11/06 09:46:21 NHRP: [QVXNM-NVHEQ] Netlink: update binding for 11.255.255.1 dev gre1 from c (unspec) peer.vc.nbma 10.125.0.1 to lladdr 10.125.0.1
> 2023/11/06 09:46:21 NHRP: [QQ0NK-1H449] Netlink: new-neigh 11.255.255.1 dev gre1 lladdr 10.125.0.1 nud 0x10 cache used 1 type 5
> 2023/11/06 09:46:21 ZEBRA: [NH6N7-54CD1] Tx RTM_NEWNEIGH family ipv4 IF gre1(10) Neigh 11.255.255.1 Link  10.125.0.1 flags 0x0 state 0x2 ext_flags 0x0
> 2023/11/06 09:46:21 ZEBRA: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=52, msg cnt=1

The NHRP daemon relies on zebra netlink layer. Read and write operations are
done on two different sockets. As a filter is attached to the read socket to
prevent from reading notifications from the write socket, the NEWNEIGH
operations from NHRP are ignored.

Fix this by adding an exception in the netlink filter to autorise NEWNEIGH
notifications. Consequently, the REACHABLE state is notified to NHRP.

> north-vm# show ip nhrp cache
> Iface    Type     Protocol                 NBMA                     Claimed NBMA             Flags  Identity
> gre1     local    11.255.255.2             10.125.0.2               10.125.0.2                      -
> gre1     nhs      11.255.255.1             10.125.0.1               10.125.0.1               UT     west     <----

Link: https://flylib.com/books/3/475/1/html/2/images/0131777203/graphics/15fig06.gif
Fixes: b3b7510 ("nhrpd: link layer registration to notificationas")
Signed-off-by: Philippe Guibert <[email protected]>
Expect expect nhrp used state true.

Signed-off-by: Louis Scalbert <[email protected]>
Use the check_ping library function.

Signed-off-by: Louis Scalbert <[email protected]>
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from 1c9720c to a8c06a5 Compare September 3, 2024 12:48
@frrbot frrbot bot added the tests Topotests, make check, etc label Sep 3, 2024
@github-actions github-actions bot added size/L and removed size/M labels Sep 3, 2024
The test expected the NHRP shortcut on nhc1 to expire after 10s in the
absence of traffic. Previous patches no longer cause shortcuts to expire
unless the remote network is down or the kernel has removed neighbor
entries to the remote network.

Test that the shortcut only expires in these cases.

Note that the shortcut mechanism is based on NHRP type 8 (NHRP traffic
indication) packets that is not part of RFC2332 but of
draft-detienne-dmvpn-01. The tested behavior conforms with:

> an NHRP cache entry is created for TunS1 => PubS1.  The entry
> SHOULD NOT remain in the cache for more than the specified Hold
> Time (from the NHRP Resolution Request).  This NHRP cache entry
> may be 'refreshed' for another hold time period prior to expiry by
> receipt of another matching NHRP Resolution Request or by sending
> an NHRP Resolution Request and receiving an NHRP Resolution Reply.

Signed-off-by: Louis Scalbert <[email protected]>
@pguibert6WIND pguibert6WIND force-pushed the nhrp_maintain_used_state branch from a8c06a5 to c06c32e Compare September 3, 2024 12:51
@louis-6wind
Copy link
Contributor

ci:rerun

1 similar comment
@RodrigoMNardi
Copy link
Contributor

ci:rerun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants