NOTE: If you are having problems, first consult our known issues doc.
If you are seeing latency issues with container to container networking, please see our debugging guide.
-
Discovering All CF Networking Logs:
All cf-networking components log lines are prefixed with
cfnetworking
(no hyphen) followed by the component name. To find all CF Networking logs, run:grep -r cfnetworking /var/vcap/sys/log/*
The log lines for the following components will be returned:
silk-daemon
(from silk-release)silk-controller
(from silk-release)vxlan-policy-agent
(from silk-release)policy-server
policy-server-internal
netmon
(from silk-release)
The log lines for the following components will be not returned:
iptables
(from silk-release): We have limited room to add an identifier, and these logs are high-volume.garden-external-networker
: Garden will only print errors if creating a container fails.cni-wrapper-plugin
(from silk-release): Only stdout and stderr are printed in garden logs.
-
Container Create is Failing:
If container create is failing check the garden logs, located on the cell VMs at
/var/vcap/sys/log/garden/garden.stdout.log
. Garden logs stdout and stderr from calls to the CNI plugin, you can find any errors related to the CNI ADD/DEL there.Unsuccessful create will say things like
exit status 1
in thestderr
field of the log message. -
Problems Creating Policies:
Problems creating policies are usually related to issues on the policy server VM(s). Check the logs at
/var/vcap/sys/log/policy-server/policy-server.stdout.log
If a policy is successfully created you will see a log line created with the message
created-policies
along with other relevant data.If a policy is successfully deleted you will see a log line created with the message
deleted-policies
along with other relevant data.
The policy server log at the info
level by default. The log level can be
adjusted at runtime by making a request to the debug server running on the VM.
To enable debug logging ssh to the VM and make this request to the debug server:
curl -X POST -d 'DEBUG' localhost:31821/log-level
To switch back to info logging make this request:
curl -X POST -d 'INFO' localhost:31821/log-level
For the policy server, the debug server listens on port 31821 by default, it can
be overridden by the manifest property cf_networking.policy_server.debug_port
.
CF networking components emit metrics which can be consumed from the firehose, e.g. with the datadog firehose nozzle. Relevant metrics have theses prefixes:
policy_server
This section describes how to recover from a deploy of CF Networking with Silk
which has an overlay network configured which conflicts with the entire CF
subnet. We set network
on the silk-controller
to the same subnet as CF and
BOSH (10.0.0.0/16). When we deploy we fail to bring up the first diego cell
17:31:56 | Updating instance diego-cell: diego-cell/4abb639b-33a9-4d8d-8a95-21c3863c7b0c (0) (canary) (00:03:25)
L Error: Timed out sending 'get_state' to baa8bc66-df64-4c2f-80d0-c090559ac28d after 45 seconds
17:35:22 | Error: Timed out sending 'get_state' to baa8bc66-df64-4c2f-80d0-c090559ac28d after 45 seconds
bosh vms shows:
$ bosh vms
Using environment 'https://104.196.19.37:25555' as client 'admin'
Task 44. Done
Deployment 'cf'
Instance Process State AZ IPs VM CID VM Type
api/45e92d63-905b-4659-8400-3df97b97850b running z1 10.0.16.10 vm-bd55a96c-e4b4-4a33-53d0-7297e63d5bed m3.large
api/f381c70a-58b7-4972-96d0-5a717122ce1e running z2 10.0.32.8 vm-3579ec81-6cad-4e44-42f7-170545c11dc2 m3.large
blobstore/a4c314e9-1b79-46bc-b572-c57e1be563b3 running z1 10.0.16.9 vm-6a787749-9f7b-4fb7-7565-2f0b66ed1794 m3.large
cc-bridge/1999c3c2-7b4c-4ac0-8b43-189561edc7d6 running z1 10.0.16.16 vm-ee138dd7-e01c-473b-402e-2a5896746e1f m3.medium
cc-bridge/fb02bc5b-5b75-4e3b-8e1c-451b4baaafe6 running z2 10.0.32.13 vm-5ddb0e3d-6c44-4bd0-5232-04c5d28e9151 m3.medium
cc-clock/d2a40e2f-b93f-4821-8d9e-b5e7f2fafec5 running z1 10.0.16.15 vm-a94c4a98-ece2-4ad5-68f5-aa53f4c988c5 m3.large
cc-worker/ed3655bf-0d6a-4ea7-bd21-060d7c7146b7 running z1 10.0.16.11 vm-f5f3f577-80fe-43b9-554f-7ea55f954c82 m3.medium
cc-worker/feb22fc9-a701-4dff-9c44-e920a03d8520 running z2 10.0.32.9 vm-8465f4e3-d3c5-453a-7b88-0392e3efd160 m3.medium
consul/00d7d425-6291-4f8a-8313-7deae2fbd3c9 running z2 10.0.32.4 vm-d61b76e9-45ae-432b-504d-f3290f7b09e6 m3.medium
consul/906080b6-28d2-411f-9a96-88d1556f0d82 running z3 10.0.48.4 vm-0ba6eed9-20fa-4375-53ac-e6911fd60b49 m3.medium
consul/d2c6612c-cad6-412d-bac7-ac35c08895c5 running z1 10.0.16.4 vm-a03136b0-e769-468a-6848-2ab8347782d1 m3.medium
diego-bbs/800ab3a6-529a-476d-ab07-95ee5847ad8f running z1 10.0.16.7 vm-c96eec32-b644-4c42-7e46-90a054eb6bbb m3.large
diego-bbs/c99ce96a-1e4e-4e33-b972-3add40a831db running z2 10.0.32.6 vm-29d36c97-d372-4b7c-4cfd-ff58e4046107 m3.large
diego-brain/2a81d157-3e65-4335-b661-0393f867d252 running z2 10.0.32.11 vm-fd18327e-5044-4056-4d23-c5c8a833c172 m3.medium
diego-brain/9811e507-4b3e-490a-839e-147f3eaa9089 running z1 10.0.16.13 vm-04482d80-97cc-4eb8-44e2-0756f45529e1 m3.medium
diego-cell/460d0ead-a1b4-40c5-9f88-70327b536aa8 running z2 10.0.32.12 vm-de76fbd8-cd3c-44ab-6ccc-bc0755909a91 r3.xlarge
diego-cell/4abb639b-33a9-4d8d-8a95-21c3863c7b0c unresponsive agent z1 10.0.16.14 vm-323a3988-5671-4b0d-4939-622d1886185f r3.xlarge
doppler/0108ae24-b3ca-4d14-9cd6-c5640ffc359c running z1 10.0.16.17 vm-30645eea-47cd-428c-52a7-116ce26d46e9 m3.medium
doppler/05d36944-d4ee-4c40-9eba-f8abce03d942 running z2 10.0.32.14 vm-97fc113f-d0b1-4bf4-6b2c-b99a82e88c74 m3.medium
etcd/2bef16f1-ce6e-4049-a3b9-728966135072 running z1 10.0.16.5 vm-0b59d1db-bf0b-4a22-43ad-d4e04edad572 m3.medium
etcd/ca1a776d-405c-4687-800b-ea707d015abb running z3 10.0.48.5 vm-b671dcea-6235-4217-6f50-8172a414fd98 m3.medium
etcd/f7171ac1-d760-4e79-986f-52409f22008a running z2 10.0.32.5 vm-268b024e-00ee-4fe0-5dd7-5cd36a975661 m3.medium
log-api/3e8f0f29-1dfe-4620-ba26-6d51db8254ed running z1 10.0.16.18 vm-29a20458-fc53-426f-47ff-293394fbf416 t2.small
log-api/4bc34462-5ae8-4cc1-82d2-bc74ef2866f5 running z2 10.0.32.15 vm-560e68ef-0629-4457-41b4-fbc76b2d09e6 t2.small
mysql/e448ba11-5f14-46f2-876c-e0e926fef2d5 running z1 10.0.16.6 vm-3be727a8-da1c-427a-6701-16ee19be6264 m3.large
nats/5fd2bb98-f384-4bbf-86b9-b5e888543b1a running z2 10.0.47.191 vm-5d271164-c9db-44d8-6dd2-a014ff075eaa c3.large
nats/b977aa7e-ced3-4e68-8dd9-bd5bb5640dec running z1 10.0.31.191 vm-86297c9b-19b5-45db-5dbc-dd3ed5638d42 c3.large
router/40f14eb0-47f2-4b1e-831d-6f1c9620c5ef running z3 10.0.48.6 vm-8590c4d2-2e89-4d72-7841-6d191d36a31a m3.medium
router/7a932987-5a50-4a2e-a221-c9ad1d601e6f running z2 10.0.32.10 vm-91df6232-7572-479f-51d8-5d0b24f8cdc3 m3.medium
router/a947e6e0-d814-434f-a66b-2b610e9eae93 running z1 10.0.16.12 vm-afeed179-d644-441b-7f49-add45d73b905 m3.medium
uaa/15d93dba-7b04-4d25-a244-7d1a93712e47 running z1 10.0.16.8 vm-715d445e-117d-4fd3-710a-7e92b028ec1b m3.medium
uaa/4c334724-7362-47b6-917e-3502566a8fad running z2 10.0.32.7 vm-e8cb1c9a-02c8-4b2b-7c9c-21261cb5e5a0 m3.medium
Trying to roll back fails since the deployment lock is still being held.
These commands get the deployment unstuck, so the operator can roll back to a previous version or correct the configuration:
bosh update-resurrection off
bosh ignore diego-cell/4abb639b-33a9-4d8d-8a95-21c3863c7b0c
bosh delete-vm vm-a4f6c259-c1f8-4b35-746e-f1db3359fad6
Once the deploy is complete run:
bosh update-resurrection on
bosh unignore diego-cell/4abb639b-33a9-4d8d-8a95-21c3863c7b0c
The VXLAN tunnel endpoint can be inspected using the ip
utility from the
iproute2
package.
From the Diego cell, install a recent version of iproute2
and its dependency
libmnl
:
curl -o /tmp/iproute2.deb -L http://mirrors.kernel.org/ubuntu/pool/main/i/iproute2/iproute2_4.3.0-1ubuntu3_amd64.deb
curl -o /tmp/libmnl0.deb -L http://mirrors.kernel.org/ubuntu/pool/main/libm/libmnl/libmnl0_1.0.3-5_amd64.deb
dpkg -i /tmp/*.deb
Then you can see details of the VTEP device by running
ip -d link list silk-vtep
which should resemble
3: silk-vtep: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether ee:ee:0a:ff:17:00 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 10.0.32.15 dev eth0 srcport 0 0 dstport 4789 nolearning ageing 300 gbp
Note srcport 0 0 dstport 4789
. The dstport PORT
value is the external UDP
port used by the VTEP for all encapsulated VXLAN packets. See the vtep_port
on the silk-daemon
job.
For more details, look at man ip-link
.
If you've installed cf-networking-release
and silk-release
and find that you
don't have connectivity between app containers, there are some basic
troubleshooting steps to follow:
First, verify basic infrastructure network connectivity between diego cells.
You can see the infrastructure ip from running bosh vms
. bosh ssh diego-cell/0
and bosh ssh diego-cell/1
. Then ping
the infrastructure
(underlay) IP of one cell from the other cell.
If that part works, the next would be to test the overlay network connectivity between diego cells themselves. Each diego cell has an overlay IP address for itself. You can discover this by running
ip addr show silk-vtep
from the cell. It will show something like:
294: silk-vtep: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN group default
link/ether ee:ee:0a:ff:22:00 brd ff:ff:ff:ff:ff:ff
inet 10.255.34.0/16 scope link silk-vtep
valid_lft forever preferred_lft forever
Then ping that overlap IP address from a different cell
ping 10.255.34.0
This step should always succeed.
If it doesn't, use tcpdump
on each side of the connection to inspect packets.
First, use a recent version of the ip
utility to discover your VTEP
port, or confirm that your vtep_port
is set
to 4789
.
Then, from each diego cell, open a terminal and run
tcpdump -n port 4789
In a separate terminal, re-run the ping $CELL_OVERLAY_ADDRESS
. The tcpdump
output on each cell should show a single request/reply packet pair:
02:19:44.965688 IP 10.0.16.17.43146 > 10.0.32.15.4789: VXLAN, flags [I] (0x08), vni 1
IP 10.255.34.0 > 10.255.23.0: ICMP echo request, id 56571, seq 1, length 64
02:19:44.966850 IP 10.0.32.15.41667 > 10.0.16.17.4789: VXLAN, flags [I] (0x08), vni 1
IP 10.255.23.0 > 10.255.34.0: ICMP echo reply, id 56571, seq 1, length 64
Note that each packet shows up as two lines in tcpdump:
- the first line is the underlay network packet, destined for VXLAN listener on UDP port 4789.
- the second line shows the encapsulated, overlay ICMP packet
If tcpdump
shows outgoing packets from the sending cell, but no incoming
packets on the receiving cell, your infrastructure may be blocking access to the
VXLAN port 4789
. Consider changing the vtep_port
value in your BOSH
manifest and re-deploying.
To determine a failure on a c2c packet, bosh ssh
onto a cell suspected of
hosting an app that is not receiving failing packets.
As above, set up a packet capture
tcpdump -n -XX port 4789
where 4789
is the vtep_port
setting on the silk-daemon
job (or use a
recent version of the ip
utility to discover your VTEP
port).
Here's a packet capture from an HTTP request over the overlay network
03:39:07.934028 IP 10.0.32.15.41669 > 10.0.16.17.4789: VXLAN, flags [I] (0x88), vni 1
IP 10.255.82.67.58054 > 10.255.144.63.8080: Flags [P.], seq 1:83, ack 1, win 215, options [nop,nop,TS val 9490690 ecr 9562410], length 82: HTTP: GET / HTTP/1.1
0x0000: 4201 0a00 0001 4201 0a00 200f 0800 4500 B.....B.......E.
0x0010: 00b8 69c1 0000 4011 cc54 0a00 200f 0a00 [email protected]......
0x0020: 1011 a2c5 12b5 00a4 0000 8800 0003 0000 ................
0x0030: 0100 eeee 0aff 9000 eeee 0aff 5200 0800 ............R...
0x0040: 4500 0086 ae2d 4000 3f06 94c4 0aff 5243 E....-@.?.....RC
0x0050: 0aff 903f e2c6 1f90 84e6 7f2b 872d 297f ...?.......+.-).
0x0060: 8018 00d7 1be5 0000 0101 080a 0090 d102 ................
0x0070: 0091 e92a 4745 5420 2f20 4854 5450 2f31 ...*GET./.HTTP/1
0x0080: 2e31 0d0a 5573 6572 2d41 6765 6e74 3a20 .1..User-Agent:.
0x0090: 6375 726c 2f37 2e33 352e 300d 0a48 6f73 curl/7.35.0..Hos
0x00a0: 743a 2031 302e 3235 352e 3134 342e 3633 t:.10.255.144.63
0x00b0: 3a38 3038 300d 0a41 6363 6570 743a 202a :8080..Accept:.*
0x00c0: 2f2a 0d0a 0d0a /*....
Note:
vni 1
corresponds to the 3 bytes starting at0x002E
- The VXLAN port
4789
in decimal is12b5
in hex. You can see that at address0x0024
- In this example, the sending application was tagged with VXLAN Group Based
Policy
(GBP) ID
0x0003
. This is at address0x002C
- The overlay destination ip address of
10.255.144.63
is0aff 903f
. You can see that at address0x0050
- The overlay destination port
8080
in decimal is1f90
in hex. You can see that at address0x0066
To see which application is assigned the GBP tag 0x0003
, query the Network
Policy Server:
cf curl /networking/v1/external/tags | jq .
{
"tags": [
{
"id": "f0782c6c-a2fe-4cad-a206-3970c57bb532",
"tag": "0001"
},
{
"id": "0f2e3339-0d19-4e87-b381-030c3619bd6b",
"tag": "0002"
},
{
"id": "1ee630fd-528a-4d80-97cd-dea170be759d",
"tag": "0003"
}
]
}
The id
associated with tag 0003
is the source application guid, the result
of cf app --guid app-name
.
You can also inspect /var/vcap/data/container-metadata/store.json
on a diego
cell to see the IP address and CF app metadata for every container:
{
"53069156-e8cf-4ca5-5a0a-838e" : {
"handle" : "53069156-e8cf-4ca5-5a0a-838e",
"ip" : "10.255.82.67",
"metadata" : {
"app_id" : "1ee630fd-528a-4d80-97cd-dea170be759d",
"space_id" : "49dae64f-8231-4294-9882-1490c120f839",
"org_id" : "cfe7a279-ad29-48d4-b188-813e85d2b621",
"policy_group_id" : "1ee630fd-528a-4d80-97cd-dea170be759d"
}
}
}
Note that:
- the
ip
in the above metadata matches the overlay source IP from thetcpdump
output - the
app_id
in the above metadata matches tag0003
in thetags
result - the
tag
of0003
can be read off thetcpdump
output
To capture non-c2c packets (destination is an external address), run the following:
tcpdump -v -XX -i any
The packets that you are interested in will be packets with the source ip being in the container network.
To filter by destination address, you may add -n dst host <destination-ip>
to
the tcpdump
command. For example:
tcpdump -v -XX -i any -n dst host 96.126.115.72
Once you have a packet and want to find information about the application, find the assigned container ip in the packet header. For the example below, the ip is 10.255.29.3.
18:49:04.749158 IP (tos 0x0, ttl 64, id 25116, offset 0, flags [DF], proto TCP (6), length 114)
10.255.29.3.34638 > lax17s05-in-f14.1e100.net.http: Flags [P.], cksum 0xda6f (incorrect -> 0x5709), seq 1:75, ack 1, win 28200, length 74: HTTP, length: 74
GET / HTTP/1.1
User-Agent: curl/7.35.0
Host: google.com
Accept: */*
0x0000: 0a58 0aff 1d01 0a58 0aff 1d03 0800 4500 .X.....X......E.
0x0010: 0072 621c 4000 4006 fe5e 0aff 1d03 d83a .rb.@.@..^.....:
0x0020: d9ce 874e 0050 3a0f 5211 3aa9 9402 5018 ...N.P:.R.:...P.
0x0030: 6e28 da6f 0000 4745 5420 2f20 4854 5450 n(.o..GET./.HTTP
0x0040: 2f31 2e31 0d0a 5573 6572 2d41 6765 6e74 /1.1..User-Agent
0x0050: 3a20 6375 726c 2f37 2e33 352e 300d 0a48 :.curl/7.35.0..H
0x0060: 6f73 743a 2067 6f6f 676c 652e 636f 6d0d ost:.google.com.
0x0070: 0a41 6363 6570 743a 202a 2f2a 0d0a 0d0a .Accept:.*/*....
On the same cell which the packet was captured, run
less /var/vcap/data/container-metadata/store.json | json_pp
and find the entry with the ip. If no entry exists, check the store.json
on
other cells.
The associated app_id
is the application guid.
Example of store.json
output:
{
"9bce657c-b92f-422b-60e0-227a66ad8b48" : {
"metadata" : {
"space_id" : "601577f3-7c2d-4d98-8029-0bd03b6a0682",
"app_id" : "aa4117a2-5e34-4648-9d42-8260380267cc",
"policy_group_id" : "aa4117a2-5e34-4648-9d42-8260380267cc",
"org_id" : "ff585363-1164-49b2-bbf3-55dd0cb06597"
},
"ip" : "10.255.29.4",
"handle" : "9bce657c-b92f-422b-60e0-227a66ad8b48"
},
"cf760bdc-ebf9-414e-4a88-29dc8820643e" : {
"ip" : "10.255.29.3",
"metadata" : {
"policy_group_id" : "f028b20b-7203-4743-ab96-da2bf05fae45",
"space_id" : "601577f3-7c2d-4d98-8029-0bd03b6a0682",
"app_id" : "f028b20b-7203-4743-ab96-da2bf05fae45",
"org_id" : "ff585363-1164-49b2-bbf3-55dd0cb06597"
},
"handle" : "cf760bdc-ebf9-414e-4a88-29dc8820643e"
}
}