|
| 1 | +# Known Issues |
| 2 | + |
| 3 | +The following is a list of current limitations of the Fabric, which we are |
| 4 | +working hard to address: |
| 5 | + |
| 6 | +* [Deleting a VPC and creating a new one right away can cause the agent to fail](#deleting-a-vpc-and-creating-a-new-one-right-away-can-cause-the-agent-to-fail) |
| 7 | +* [VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch](#vpc-local-peering-can-cause-the-agent-to-fail-if-subinterfaces-are-not-supported-on-the-switch) |
| 8 | +* [External peering over a connection originating from an MCLAG switch can fail](#external-peering-over-a-connection-originating-from-an-mclag-switch-can-fail) |
| 9 | +* [MCLAG leaf with no surviving spine connection will blackhole traffic](#mclag-leaf-with-no-surviving-spine-connection-will-blackhole-traffic) |
| 10 | + |
| 11 | +### Deleting a VPC and creating a new one right away can cause the agent to fail |
| 12 | + |
| 13 | +The issue is due to limitations in SONiC's gNMI interface. In this particular case, |
| 14 | +the deletion and creation of a VPC back-to-back (i.e. using a script or the golang API) |
| 15 | +can lead to the reuse of the deleted VPC's VNI before the deletion had effect. |
| 16 | + |
| 17 | +#### Diagnosing this issue |
| 18 | + |
| 19 | +The applied generation of the affected agent reported by kubectl will not |
| 20 | +converge to the last desired generation. Additionally, the agent logs on the switch |
| 21 | +(accessible at `/var/log/agent.log`) will contain an error similar to the following one: |
| 22 | + |
| 23 | +><code>time=2025-03-23T12:26:19.649Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = VNI is already used in VRF VrfVvpc-02"</code> |
| 24 | +
|
| 25 | +#### Known workarounds |
| 26 | + |
| 27 | +Deleting the pending VPCs will allow the agent to reconverge. After that, the |
| 28 | +desired VPCs can be safely created. |
| 29 | + |
| 30 | +### VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch |
| 31 | + |
| 32 | +As explained in the [Architecture page](../architecture/fabric.md#vpc-peering), to workaround |
| 33 | +limitations in SONiC, local VPCPeering is implemented over a pair of loopback interfaces. |
| 34 | +This workaround requires subinterface support on the switch where the VPCPeering is being |
| 35 | +instantiated. If the affected switch does not meet this requirement, the agent will fail |
| 36 | +to apply the desired configuration. |
| 37 | + |
| 38 | +#### Diagnosing this issue |
| 39 | + |
| 40 | +The applied generation of the affected agent reported by kubectl will not |
| 41 | +converge to the last desired generation. Additionally, the agent logs on the switch |
| 42 | +(accessible at `/var/log/agent.log`) will contain an error similar to the following one: |
| 43 | + |
| 44 | +><code>time=2025-02-04T13:37:58.796Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = SubInterfaces are not supported"</code> |
| 45 | +
|
| 46 | +#### Known workarounds |
| 47 | + |
| 48 | +Configure remote VPCPeering instead of local peering in any case where the target switch |
| 49 | +does not support subinterfaces. You can double-check whether your switch model supports them |
| 50 | +by looking at the [Switch Profiles Catalog](../reference/profiles.md) entry for it. |
| 51 | + |
| 52 | +### External peering over a connection originating from an MCLAG switch can fail |
| 53 | + |
| 54 | +When importing routes via [External Peering](../user-guide/external.md) over a connection |
| 55 | +originating from an MCLAG leaf switch, traffic from the peered VPC towards that |
| 56 | +prefix can be blackholed. This is due to a routing mismatch between the two MCLAG leaves, |
| 57 | +where only one switch learns the imported route. Packets hitting the "wrong" leaf will |
| 58 | +be dropped with a Destination Unreachable error. |
| 59 | + |
| 60 | +#### Diagnosing this issue |
| 61 | + |
| 62 | +No connectivity from the workload server(s) in the VPC towards the prefix routed via the external. |
| 63 | + |
| 64 | +#### Known workarounds |
| 65 | + |
| 66 | +Connect your externals to non-MCLAG switches instead. |
| 67 | + |
| 68 | +### MCLAG leaf with no surviving spine connection will blackhole traffic |
| 69 | + |
| 70 | +When a leaf switch in an MCLAG pair loses all of its uplink connections to the spines and the |
| 71 | +related BGP sessions go down, it will stop advertising and receiving |
| 72 | +EVPN routes. This leads to blackholing of traffic for endpoints connected to the |
| 73 | +isolated leaf, as the rest of the fabric no longer has reachability information for |
| 74 | +those endpoints, even though the MCLAG peering session is up. |
| 75 | + |
| 76 | +#### Diagnosing this issue |
| 77 | + |
| 78 | +Traffic destined for endpoints connected to the leaf is blackholed. All BGP sessions |
| 79 | +from the affected leaf towards the spines are down. |
| 80 | + |
| 81 | +#### Known workarounds |
| 82 | + |
| 83 | +None. |
0 commit comments