|
| 1 | +# Known Issues |
| 2 | + |
| 3 | +The following is a list of current limitations of the Fabric, which we are |
| 4 | +working hard to address: |
| 5 | + |
| 6 | +* [Deleting a VPC and creating a new one right away can cause the agent to fail](#deleting-a-vpc-and-creating-a-new-one-right-away-can-cause-the-agent-to-fail) |
| 7 | +* [VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch](#vpc-local-peering-can-cause-the-agent-to-fail-if-subinterfaces-are-not-supported-on-the-switch) |
| 8 | +* [External peering over a connection originating from an MCLAG switch can fail](#external-peering-over-a-connection-originating-from-an-mclag-switch-can-fail) |
| 9 | +* [MCLAG leaf with no surviving spine connection will blackhole traffic](#mclag-leaf-with-no-surviving-spine-connection-will-blackhole-traffic) |
| 10 | + |
| 11 | +### Deleting a VPC and creating a new one right away can cause the agent to fail |
| 12 | + |
| 13 | +The issue is due to limitations in SONiC's gNMI interface. In this particular case, |
| 14 | +the deletion and creation of a VPC back-to-back (i.e. using a script or the golang API) |
| 15 | +can lead to the reuse of the deleted VPC's VNI before the deletion had effect. |
| 16 | + |
| 17 | +#### Diagnosing this issue |
| 18 | + |
| 19 | +The applied generation of the affected agent reported by kubectl will not |
| 20 | +converge to the last desired generation. Additionally, the agent logs on the switch |
| 21 | +(accessible at `/var/log/agent.log`) will contain an error similar to the following one: |
| 22 | + |
| 23 | +``` |
| 24 | +time=2025-03-23T12:26:19.649Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = VNI is already used in VRF VrfVvpc-02" |
| 25 | +``` |
| 26 | + |
| 27 | +#### Known workarounds |
| 28 | + |
| 29 | +Deleting the pending VPCs will allow the agent to reconverge. After that, the |
| 30 | +desired VPCs can be safely created. |
| 31 | + |
| 32 | +### VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch |
| 33 | + |
| 34 | +As explained in the [Architecture page](../architecture/fabric.md#vpc-peering), to workaround |
| 35 | +limitations in SONiC, local VPCPeering is implemented over a pair of loopback interfaces. |
| 36 | +This workaround requires subinterface support on the switch where the VPCPeering is being |
| 37 | +instantiated. If the affected switch does not meet this requirement, the agent will fail |
| 38 | +to apply the desired configuration. |
| 39 | + |
| 40 | +#### Diagnosing this issue |
| 41 | + |
| 42 | +The applied generation of the affected agent reported by kubectl will not |
| 43 | +converge to the last desired generation. Additionally, the agent logs on the switch |
| 44 | +(accessible at `/var/log/agent.log`) will contain an error similar to the following one: |
| 45 | + |
| 46 | +``` |
| 47 | +time=2025-02-04T13:37:58.675Z level=DEBUG msg=Action idx=90 weight=33 summary="Create Subinterface Base 101" command=update path="/interfaces/interface[name=Ethernet16]/subinterfaces/subinterface[index=101]" |
| 48 | +time=2025-02-04T13:37:58.796Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = SubInterfaces are not supported" |
| 49 | +``` |
| 50 | + |
| 51 | +#### Known workarounds |
| 52 | + |
| 53 | +Configure remote VPCPeering wherever local peering would target a switch not supporting |
| 54 | +subinterfaces. You can double-check whether your switch model supports them by looking at |
| 55 | +the [Switch Profiles Catalog](../reference/profiles.md) entry for it. |
| 56 | + |
| 57 | +### External peering over a connection originating from an MCLAG switch can fail |
| 58 | + |
| 59 | +When importing routes via [External Peering](../user-guide/external.md) over a connection |
| 60 | +originating from an MCLAG leaf switch, traffic from the peered VPC towards that |
| 61 | +prefix can be blackholed. This is due to a routing mismatch between the two MCLAG leaves, |
| 62 | +where only one switch learns the imported route. Packets hitting the "wrong" leaf will |
| 63 | +be dropped with a Destination Unreachable error. |
| 64 | + |
| 65 | +#### Diagnosing this issue |
| 66 | + |
| 67 | +No connectivity from the workload server(s) in the VPC towards the prefix routed via the external. |
| 68 | + |
| 69 | +#### Known workarounds |
| 70 | + |
| 71 | +Connect your externals to non-MCLAG switches instead. |
| 72 | + |
| 73 | +### MCLAG leaf with no surviving spine connection will blackhole traffic |
| 74 | + |
| 75 | +When a leaf switch in an MCLAG pair loses its last uplink to the spine, the BGP |
| 76 | +session to the spine goes down, causing the leaf to stop advertising and receiving |
| 77 | +EVPN routes. This leads to blackholing of traffic for endpoints connected to the |
| 78 | +isolated leaf, as the rest of the fabric no longer has reachability information for |
| 79 | +those endpoints, even though the MCLAG peering session is up. |
| 80 | + |
| 81 | +#### Diagnosing this issue |
| 82 | + |
| 83 | +Traffic destined for endpoints connected to the leaf is blackholed. All BGP sessions |
| 84 | +from the affected leaf towards the spines are down. |
| 85 | + |
| 86 | +#### Known workarounds |
| 87 | + |
| 88 | +None. |
0 commit comments