Skip to content

Commit ba21114

Browse files
committed
docs: add known issues
Signed-off-by: Emanuele Di Pascale <[email protected]>
1 parent 39cf21f commit ba21114

File tree

2 files changed

+89
-0
lines changed

2 files changed

+89
-0
lines changed

docs/.pages

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ nav:
99
- Reference: reference
1010
- Architecture: architecture
1111
- Troubleshooting: troubleshooting
12+
- Known Issues: known-issues
1213
- FAQ: faq
1314
- ...
1415
- release-notes

docs/known-issues/known-issues.md

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Known Issues
2+
3+
The following is a list of current limitations of the Fabric, which we are
4+
working hard to address:
5+
6+
* [Deleting a VPC and creating a new one right away can cause the agent to fail](#deleting-a-vpc-and-creating-a-new-one-right-away-can-cause-the-agent-to-fail)
7+
* [VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch](#vpc-local-peering-can-cause-the-agent-to-fail-if-subinterfaces-are-not-supported-on-the-switch)
8+
* [External peering over a connection originating from an MCLAG switch can fail](#external-peering-over-a-connection-originating-from-an-mclag-switch-can-fail)
9+
* [MCLAG leaf with no surviving spine connection will blackhole traffic](#mclag-leaf-with-no-surviving-spine-connection-will-blackhole-traffic)
10+
11+
### Deleting a VPC and creating a new one right away can cause the agent to fail
12+
13+
The issue is due to limitations in SONiC's gNMI interface. In this particular case,
14+
the deletion and creation of a VPC back-to-back (i.e. using a script or the golang API)
15+
can lead to the reuse of the deleted VPC's VNI before the deletion had effect.
16+
17+
#### Diagnosing this issue
18+
19+
The applied generation of the affected agent reported by kubectl will not
20+
converge to the last desired generation. Additionally, the agent logs on the switch
21+
(accessible at `/var/log/agent.log`) will contain an error similar to the following one:
22+
23+
```
24+
time=2025-03-23T12:26:19.649Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = VNI is already used in VRF VrfVvpc-02"
25+
```
26+
27+
#### Known workarounds
28+
29+
Deleting the pending VPCs will allow the agent to reconverge. After that, the
30+
desired VPCs can be safely created.
31+
32+
### VPC local peering can cause the agent to fail if subinterfaces are not supported on the switch
33+
34+
As explained in the [Architecture page](../architecture/fabric.md#vpc-peering), to workaround
35+
limitations in SONiC, local VPCPeering is implemented over a pair of loopback interfaces.
36+
This workaround requires subinterface support on the switch where the VPCPeering is being
37+
instantiated. If the affected switch does not meet this requirement, the agent will fail
38+
to apply the desired configuration.
39+
40+
#### Diagnosing this issue
41+
42+
The applied generation of the affected agent reported by kubectl will not
43+
converge to the last desired generation. Additionally, the agent logs on the switch
44+
(accessible at `/var/log/agent.log`) will contain an error similar to the following one:
45+
46+
```
47+
time=2025-02-04T13:37:58.675Z level=DEBUG msg=Action idx=90 weight=33 summary="Create Subinterface Base 101" command=update path="/interfaces/interface[name=Ethernet16]/subinterfaces/subinterface[index=101]"
48+
time=2025-02-04T13:37:58.796Z level=ERROR msg=Failed err="failed to run agent: failed to process agent config from k8s: failed to process agent config loaded from k8s: failed to apply actions: GNMI set request failed: gnmi set request failed: rpc error: code = InvalidArgument desc = SubInterfaces are not supported"
49+
```
50+
51+
#### Known workarounds
52+
53+
Configure remote VPCPeering wherever local peering would target a switch not supporting
54+
subinterfaces. You can double-check whether your switch model supports them by looking at
55+
the [Switch Profiles Catalog](../reference/profiles.md) entry for it.
56+
57+
### External peering over a connection originating from an MCLAG switch can fail
58+
59+
When importing routes via [External Peering](../user-guide/external.md) over a connection
60+
originating from an MCLAG leaf switch, traffic from the peered VPC towards that
61+
prefix can be blackholed. This is due to a routing mismatch between the two MCLAG leaves,
62+
where only one switch learns the imported route. Packets hitting the "wrong" leaf will
63+
be dropped with a Destination Unreachable error.
64+
65+
#### Diagnosing this issue
66+
67+
No connectivity from the workload server(s) in the VPC towards the prefix routed via the external.
68+
69+
#### Known workarounds
70+
71+
Connect your externals to non-MCLAG switches instead.
72+
73+
### MCLAG leaf with no surviving spine connection will blackhole traffic
74+
75+
When a leaf switch in an MCLAG pair loses its last uplink to the spine, the BGP
76+
session to the spine goes down, causing the leaf to stop advertising and receiving
77+
EVPN routes. This leads to blackholing of traffic for endpoints connected to the
78+
isolated leaf, as the rest of the fabric no longer has reachability information for
79+
those endpoints, even though the MCLAG peering session is up.
80+
81+
#### Diagnosing this issue
82+
83+
Traffic destined for endpoints connected to the leaf is blackholed. All BGP sessions
84+
from the affected leaf towards the spines are down.
85+
86+
#### Known workarounds
87+
88+
None.

0 commit comments

Comments
 (0)