NOTE: If you are having problems, first consult our known issues doc.
This guide relates to troubleshooting problems between host + container networking, primarily when using the Silk CNI. Some concepts can be used for other CNIs directly. Others might require slight adaptations
-
Discovering All CF Networking Logs:
All cf-networking components log lines are prefixed with
cfnetworking
(no hyphen) followed by the component name. To find all CF Networking logs, run:grep -r cfnetworking /var/vcap/sys/log/*
The log lines for the following components will be returned:
silk-daemon
(from silk-release)silk-controller
(from silk-release)vxlan-policy-agent
(from silk-release)policy-server
policy-server-internal
netmon
(from silk-release)silk-cni
(from silk-release)garden
(from garden-runc-release)
The log lines for the following components will be not returned:
garden-external-networker
: Garden will only print errors if creating a container fails.cni-wrapper-plugin
(from silk-release): Only stdout and stderr are printed in garden logs. However the call to the underlying silk-cni will log to/var/vcap/sys/log/silk-cni/silk-cni.stdout.log
. It defaults to only logging error messages, so debugging may be needed.
Most components log at the info
level by default. In many cases, the log level can be
adjusted at runtime by making a request to the debug server of the component running on the VM.
For example, to enable debug logs for policy server, ssh onto the VM and make this request to its debug server:
curl -X POST -d 'DEBUG' localhost:31821/log-level
To switch back to info logging make this request:
curl -X POST -d 'INFO' localhost:31821/log-level
This procedure can be used on the following jobs using the default (or overridden) debug port:
Job | Default Debug Port | Property to Override |
---|---|---|
policy-server | 31821 | debug_port |
policy-server-internal | 31945 | debug_port |
policy-server-asg-syncer | - | log_level - Job must be restarted for changes to take effect. |
silk-daemon | 22233 | debug_port |
silk-controller | 46455 | debug_port |
vxlan-policy-agent | 8721 | debug_server_port |
netmon | - | log_level - Job must be restarted for changes to take effect. |
To enable debug logging for the silk-cni
, create the /var/vcap/jobs/silk-cni/config/enable_debug
file on the diego-cell
VM. Subsequent silk-cni
calls will then log with debugging.
NOTE: Be cautious when enabling debugging on the networking components. There will be a substantial increase in disk usage due to the volume of logs being written.
Problems applying egress iptables rules for silk-based containers will show up in the vxlan-policy-agent logs
at /var/vcap/sys/log/vxlan-policy-agent/vxlan-policy-agent.stdout.log
.
If IPTables rule application is successful, but the rules are incorrect, also check the policy-server-internal
logs on
the VM(s) hosting that server at /var/vcap/sys/log/policy-server/policy-server-internal.stdout.log
, as well as the
policy-server-asg-syncer
logs in /var/vcap/sys/log/policy-server-asg-syncer/policy-server-asg-syncer.stdout.log
.
If container create is failing check the garden logs, located on the cell VMs
at /var/vcap/sys/log/garden/garden.stdout.log
. Garden logs stdout and
stderr from calls to the CNI plugin, you can find any errors related to the
CNI ADD/DEL there.
Search for external-network
or CNI
, and look for messages related to setting up the container.
There will also likely be results for failures to tear down the container - ignore those. Garden
will attempt to destroy any failed resources it might have created, so if the create failed, this
destroy will also likely fail. Focus on the initial create.
Unsuccessful create will say things like exit status 1
in the stderr
field
of the log message.
This is likely an issue with the ASGs assigned to the container. It could indicate one of the following problems with Dynamic ASGs:
- ASG rules not being what is necessary for the traffic to succeed.
- Review IPTables rules on the container, and compare them with the egress traffic that is failing. Add to or adjust the ruleset in the ASG definitions, restart the app (unless dynamic ASGs are enabled), and try again.
- Not having the correct ASGs associated with the app's space
- Review the global ASGs for the container type (running or staging), as well as what ASGs are applied to the container's space (again either for running or staging). Create or apply the necessary ASGs for the failing traffic.
- Not getting the correct IPTables rules applied
- Investigate the logs for
policy-server-asg-syncer
, andvxlan-policy-agent
for issues encountered syncing rule data, or applying IPTables rules.
- Investigate the logs for
This is likely an issue upstream from the cell, related to cf ssh
. There is a small chance that this is related to
container networking. See debuggging cf ssh for troubleshooting steps.
This case is extremely rare, but might be seen when running on an untested/beta stemcell. There is definitely a problem with the networking for the application container(s). Debug host-side networking and container-side networking to determine what the issue is, and open an issue against the container's CNI (e.g. silk-release).
- Run
cf space-ssh-allowed
andcf ssh-enabled <app>
to ensure that app ssh is enabled. If not, use thecf
CLI to enable SSH for the app. bosh ssh
into the diego-cell hosting the container.- Use
cfdot actual-lrps
andjq
to find theinstance_id
of the container you are trying to SSH into. - Use
iptables -S -t nat | grep <first-octet-of-container-instance-id
to identify the port-forwarding rules that translate thessh traffic
between the host's--dport
(>61000) and--to-destination 10.255.x.x:61002
or2222
. Note the value of the--dport
flag. - Run a tcpdump to determine if the cell receives any trafficd on the port determined above. If not, it indicates the issue is likely upstream of cf-networking. If there is traffic received, there is likely an issue with the container.
- For issues upstream, review diego's
ssh_proxy
job logs, and the configuration of the loadbalancer used for app SSH. - If step 3 determined that SSH traffic was reaching the host but not going through, debug host-side networking and container-side networking.
Debugging Host-Side Networking when using silk-release
bosh ssh
into the diego-cell hosting the container.- Use
cfdot actual-lrps
andjq
to find theinstance_guid
andinstance_address
of the container you are trying to SSH into. - Use
ip addr show | grep -B3 <container ip>
to find the interface name, MAC address, and namespace id of the host-side interrface. Validate that the MAC address begins withaa:aa:<hex-encoded-container-ip>
. Validate that the interface name matchess-<zero-padded-container-ip>
. For example:The namespace id is$ ip addr show | grep -B2 10.255.211.40 64376: s-010255211040@if64375: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default link/ether aa:aa:0a:ff:d3:28 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 169.254.0.1 peer 10.255.211.40/32 scope link s-010255211040
0
, obtained fromlink-netnsid 0
.s-010255211040
is the interface name, andaa:aa:0a:ff:d3:28
is the MAC address (0a
is hex for10
,ff
is hex for255
,d3
is hex for 211, and28
is hex for40
). If the interface name does not match the IP, or MAC does not matchaa:aa:<hex-endoded-ip-addr>
something is wrong with the way the overlay bridge was set up in thesilk-cni
binary. Reviewsilk-cni
logs for any errors, or enable debugging onsilk-cni
for more information. - Use
arp -a | grep <container ip>
to validate that the ARP table has an entry pointing the container IP through the interface name obtained above, using a MAC addr ofee:ee:<hex-encoded-container-ip>
. If this entry is incorrect or missing, there was an issue insilk-cni
setting up the overlay bridge. Reviewsilk-cni
logs for any erors, or enable debugging onsilk-cni
for more information. - The IP address of the
s-<zero-padded-container-ip>
interface should be 169.254.0.1, and should ALWAYS match the default gateway defined inside the container (see debuggging container-side-networking). - If everything else looks good, validate that the namespace ID for the container processes match the namespace ID for the host's
s-<zero-padded-container-ip>
interface:- Run
ps -awxfu | less
to get a full host process-tree. Search the output for the container'sinstance_guid
to find the parentgdn
process. Scan down thegdn
process's tree to find child processes fordiego-sshd
,envoy
, and the app process. Note the process IDs of these three processes (second column of the output). - Validate that all three processes share the same networking namespace inode reference by running
ls -l /proc/<pid>/ns/net
. It should show up as a link tonet:[<namespace inode>]
. - Confirm that the namespace inode matches the namespace id obtained from the
s-<zero-padded-container-ip>
interface above:lsns -l -t net | egrep 'NETNSID|<namespace inode>'
. The NETNSID column should reflect the namespace ID of the interface.
- Run
Debugging Container-Side Networking when using silk-release
bosh-ssh
into the diego-cell hosting the container.- Run
ps -awxfu | less
to get a full host process-tree. Search the output for the container'sinstance_guid
to find the parentgdn
process. Scan down thegdn
process's tree to find child processes fordiego-sshd
,envoy
, and the app process. Note the process IDs of these three processes (second column of the output). - Validate that all three processes share the same networking namespace inode reference by running
ls -l /proc/<pid>/ns/net
. It should show up as a link tonet:[<namespace inode>]
. - Enter a bash shell as root in the container namespaces with
nsenter -t <app-pid> -a bash
. - Use
ip addr show
to validate that there is an interface namedc-<zero-padded-container-ip>
, with a MAC address ofee:ee:<hex-encoded-ip-addr>
. - Use
arp -a
to validate an entry exists for 169.254.0.1 (or thes-<zero-padded-container-ip>
IP addr found when debuggging host-side networking), and that the entry points to the sameaa:aa:<hex-encoded-ip-addr>
MAC address of that interface. - Use
netstat -rn
to ensure the default gateway of the container points to the IP addr listed in thearp -a
output.