docs: add known issues

edipascale · edipascale · commit 8226f23ef3ac · 2025-04-03T10:53:07.000+02:00
Signed-off-by: Emanuele Di Pascale &lt;emanuele@githedgehog.com&gt;
diff --git a/docs/.pages b/docs/.pages
@@ -9,6 +9,7 @@ nav:
   - Reference: reference
   - Architecture: architecture
   - Troubleshooting: troubleshooting
+  - Known Issues: known-issues
   - FAQ: faq
   - ...
   - release-notes
diff --git a/docs/known-issues/known-issues.md b/docs/known-issues/known-issues.md
@@ -0,0 +1,95 @@
+# Known Issues
+
+The following is a list of current limitations of the Fabric, which we are
+working hard to address.
+
+### Deleting a VPC and creating a new one right away can cause the agent to fail
+
+The issue is due to limitations in SONiC's gNMI interface, where operations still have to
+be serialized in a certain order. In this particular case, the deletion and creation of
+a VPC back-to-back can lead to the reuse of the deleted VPC's VNI before the deletion had
+effect.
+
+#### How to diagnose this issue
+
+The applied generation on the affected agent as reported by kubectl will not
+converge to the last desired generation. Additionally, the agent logs on the switch 
+(accessible in `/var/log/agent.log`) will contain an error similar to the following one:
+
+```
+time=2025-03-23T12:26:19.649Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = VNI is already used in VRF VrfVvpc-02"
+```
+
+#### Known workarounds
+
+Deleting the pending VPCs will allow the agent to reconverge. After that, the
+desired VPCs can be safely created.
+
+### VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch
+
+As explained in the [Architecture page](../architecture/fabric.md#vpc-peering), to workaround
+limitations in SONiC, local VPCPeering is implemented over a pair of loopback interfaces.
+This workaround requires subinterfaces support on the switch where the VPCeering is being
+instantiated. If the affected switch does not meet this requirement, the agent will fail
+to apply the desired configuration.
+
+#### How to diagnose this issue
+
+The applied generation on the affected agent as reported by kubectl will not
+converge to the last desired generation. Additionally, the agent logs on the switch 
+(accessible in `/var/log/agent.log`) will contain an error similar to the following one:
+
+```
+time=2025-02-04T13:37:58.675Z level=DEBUG msg=Action idx=90 weight=33 summary="Create Subinterface Base 101" command=update path="/interfaces/interface[name=Ethernet16]/subinterfaces/subinterface[index=101]"
+time=2025-02-04T13:37:58.796Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = SubInterfaces are not supported"
+```
+
+#### Known workarounds
+
+Configure remote VPCPeering wherever local peering would happen on a switch not supporting
+subinterfaces. You can double-check whether your switch model supports them by looking at
+the [Switch Profiles Catalog](../reference/profiles.md) entry for it.
+
+### External Peering over a connection originating from an MCLAG switch can fail
+
+When importing routes via [External Peering](../user-guide/external.md) over a connection
+originating from an MCLAG leaf switch, traffic from the peered VPC towards that
+prefix can be blackholed. This is due to a routing mismatch between the two MCLAG leaves,
+where only one switch learns the appropriate route to the imported prefix. If the orignating
+traffic hits the "wrong" leaf, it will be dropped with a Destination Unreachable error.
+
+#### How to diagnose this issue
+
+No connectivity from the workload server(s) in the VPC towards the prefix routed via the external.
+
+#### Known workarounds
+
+Connect your externals to non-MCLAG switches instead.
+
+### MCLAG leaf with no surviving spine connection will blackhole traffic
+
+When a leaf switch in an MCLAG pair loses its last uplink to the spine, the BGP session to the spine goes down,
+causing the leaf to stop advertising and receiving EVPN routes. This leads to blackholing of traffic for endpoints
+connected to the isolated leaf, as the rest of the fabric no longer has reachability information for those endpoints,
+ even though the MCLAG peering session is up.
+
+#### How to diagnose this issue
+
+Traffic destined for endpoints connected to the leaf is blackholed. All BGP sessions from the affected leaf towards
+the spines are down.
+
+#### Known workarounds
+
+None.
+
+### Cannot set PortSpeed for PortChannels
+
+TBD
+
+#### How to diagnose this issue
+
+TBD
+
+#### Known workarounds
+
+TBD