Skip to content

Commit 53ef16e

Browse files
committed
docs: add known issues
Signed-off-by: Emanuele Di Pascale <[email protected]>
1 parent 8b86187 commit 53ef16e

File tree

2 files changed

+84
-0
lines changed

2 files changed

+84
-0
lines changed

docs/.pages

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ nav:
99
- Reference: reference
1010
- Architecture: architecture
1111
- Troubleshooting: troubleshooting
12+
- Known Issues: known-issues
1213
- FAQ: faq
1314
- ...
1415
- release-notes

docs/known-issues/known-issues.md

+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Known Issues
2+
3+
The following is a list of current limitations of the Fabric, which we are
4+
working hard to address:
5+
6+
* [Deleting a VPC and creating a new one right away can cause the agent to fail](#deleting-a-vpc-and-creating-a-new-one-right-away-can-cause-the-agent-to-fail)
7+
* [VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch](#vpc-local-peering-can-cause-the-agent-to-fail-if-subinterfaces-are-not-supported-on-the-switch)
8+
* [External peering over a connection originating from an MCLAG switch can fail](#external-peering-over-a-connection-originating-from-an-mclag-switch-can-fail)
9+
* [MCLAG leaf with no surviving spine connection will blackhole traffic](#mclag-leaf-with-no-surviving-spine-connection-will-blackhole-traffic)
10+
11+
### Deleting a VPC and creating a new one right away can cause the agent to fail
12+
13+
The issue is due to limitations in SONiC's gNMI interface. In this particular case,
14+
the deletion and creation of a VPC back-to-back (i.e. using a script or the golang API)
15+
can lead to the reuse of the deleted VPC's VNI before the deletion had effect.
16+
17+
#### Diagnosing this issue
18+
19+
The applied generation of the affected agent reported by kubectl will not
20+
converge to the last desired generation. Additionally, the agent logs on the switch
21+
(accessible at `/var/log/agent.log`) will contain an error similar to the following one:
22+
23+
><code>time=2025-03-23T12:26:19.649Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = VNI is already used in VRF VrfVvpc-02"</code>
24+
25+
#### Known workarounds
26+
27+
Deleting the pending VPCs will allow the agent to reconverge. After that, the
28+
desired VPCs can be safely created.
29+
30+
### VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch
31+
32+
As explained in the [Architecture page](../architecture/fabric.md#vpc-peering), to workaround
33+
limitations in SONiC, local VPCPeering is implemented over a pair of loopback interfaces.
34+
This workaround requires subinterface support on the switch where the VPCPeering is being
35+
instantiated. If the affected switch does not meet this requirement, the agent will fail
36+
to apply the desired configuration.
37+
38+
#### Diagnosing this issue
39+
40+
The applied generation of the affected agent reported by kubectl will not
41+
converge to the last desired generation. Additionally, the agent logs on the switch
42+
(accessible at `/var/log/agent.log`) will contain an error similar to the following one:
43+
44+
><code>time=2025-02-04T13:37:58.796Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = SubInterfaces are not supported"</code>
45+
46+
#### Known workarounds
47+
48+
Configure remote VPCPeering instead of local peering in any case where the target switch
49+
does not support subinterfaces. You can double-check whether your switch model supports them
50+
by looking at the [Switch Profiles Catalog](../reference/profiles.md) entry for it.
51+
52+
### External peering over a connection originating from an MCLAG switch can fail
53+
54+
When importing routes via [External Peering](../user-guide/external.md) over a connection
55+
originating from an MCLAG leaf switch, traffic from the peered VPC towards that
56+
prefix can be blackholed. This is due to a routing mismatch between the two MCLAG leaves,
57+
where only one switch learns the imported route. Packets hitting the "wrong" leaf will
58+
be dropped with a Destination Unreachable error.
59+
60+
#### Diagnosing this issue
61+
62+
No connectivity from the workload server(s) in the VPC towards the prefix routed via the external.
63+
64+
#### Known workarounds
65+
66+
Connect your externals to non-MCLAG switches instead.
67+
68+
### MCLAG leaf with no surviving spine connection will blackhole traffic
69+
70+
When a leaf switch in an MCLAG pair loses all of its uplink connections to the spines and the
71+
related BGP sessions go down, it will stop advertising and receiving
72+
EVPN routes. This leads to blackholing of traffic for endpoints connected to the
73+
isolated leaf, as the rest of the fabric no longer has reachability information for
74+
those endpoints, even though the MCLAG peering session is up.
75+
76+
#### Diagnosing this issue
77+
78+
Traffic destined for endpoints connected to the leaf is blackholed. All BGP sessions
79+
from the affected leaf towards the spines are down.
80+
81+
#### Known workarounds
82+
83+
None.

0 commit comments

Comments
 (0)