Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retina CLI command to create ephemeral container with network debugging tools #910

Open
wedaly opened this issue Oct 29, 2024 · 1 comment
Labels
area/captures scope/L Change is Large type/enhancement New feature or request

Comments

@wedaly
Copy link
Member

wedaly commented Oct 29, 2024

Is your feature request related to a problem? Please describe.

I want to be able to do adhoc, exploratory debugging of node and pod networking.
 
Describe the solution you'd like

kubectl retina sh pods/<pod>
kubectl retina sh nodes/<node>
  • Start an ephemeral container in the specified node or pod.
  • Use an image based on Azure Linux with built-in networking tools (ping, curl, nslookup, tcpdump, conntrack, iproute2, iptables, etc.). If the cluster is using Retina, then it should already have firewall configuration to download this new image from the same source as the other Retina images.
  • Run an interactive shell (bash).
  • Assign the appropriate permissions to ensure the tools work (NET_ADMIN and NET_RAW).
  • (Optionally) mount host filesystem for inspecting CNI logs, which iptables executable is installed on the host, etc. Make this opt-in (--mount-host-filesystem flag) since most network troubleshooting doesn't need it.

Future work could expand this to support Windows (image with PowerShell, use host process containers for node access), but to start I care most about Linux support.

Describe alternatives you've considered

  • Node-shell gives full access to the node using "nsenter" on PID 1.
    • Overprivileged: Starts privileged pods on customer nodes (CAP_SYS_ADMIN and access to /proc filesystem).
    • Alpine Image: Defaults to an alpine image from DockerHub that is often blocked by customer firewalls or subject to DockerHub rate limits. Can override this using an env var, but still requires an image with nsenter installed (and Azure Linux 2.0+ do not have it installed).
    • No pod debugging: Cannot easily debug connectivity from within a pod network namespace. Users can find the netns for a pod, then use ip netns exec, but this is painful and error-prone.
  • Kubectl exec allows us to execute a process within a pod's namespace.
    • Pod image missing tools: Pod images may lack networking tools or even a shell (more common now with distroless images).
    • Host networking: If there's no host networking pod, we can't test connectivity from node IP.
    • Pod crashing: If the target pod is crashing, we cannot exec into it.
  • Kubectl debug creates ephemeral containers to get access to either a node or a pod. It provides "profiles" with different levels of permission.
    • Missing tools: Requires an image with network debugging tools installed. A popular image is netshoot, but there isn't currently an equivalent image in MCR. People often fallback to running Azure Linux and manually installing whatever packages they need, which is slow and tedious.
    • Confusing permissions: the "general" profile provides access to the node filesystem, but not NET_ADMIN capability required for iptables/nftables, leading to misleading permissions errors like Permission denied (you must be root) even for the root user. kubectl with k8s 1.27 and later have "netadmin" profile that grants NET_ADMIN capability, but not access to host filesystem.

Additional context

Retina has a unique opportunity to overcome the limitations for the other tools listed above. If people are already using Retina for monitoring/observability, then it would be very convenient to also use it for adhoc network debugging.

This would also pair well with Retina's existing packet capture command. I often find myself starting a packet capture, then using kubectl exec / kubectl debug to reproduce an issue. It would be really nice to do this without needing a separate tool.

@wedaly
Copy link
Member Author

wedaly commented Nov 1, 2024

Here's a feature branch with an initial implementation: main...wedaly:retina:retina-shell-feature-branch

I'll open separate PRs for each of the three commits in order (creating the image, publishing with GH workflows, and adding the CLI command).

github-merge-queue bot pushed a commit that referenced this issue Nov 4, 2024
# Description

Build a new image retina-shell for adhoc network debugging on Linux
nodes/pods.

## Related Issue

#910

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

Tested building with the following commands:
```
IMAGE_REGISTRY=widalytest.azurecr.io BUILDX_ACTION=--push PLATFORM=linux/amd64 make retina-shell-image
IMAGE_REGISTRY=widalytest.azurecr.io BUILDX_ACTION=--push PLATFORM=linux/arm64 make retina-shell-image
IMAGE_REGISTRY=widalytest.azurecr.io BUILDX_ACTION=--push make manifest-shell-image
```

Then ran it locally:
<img width="779" alt="image"
src="https://github.com/user-attachments/assets/7a6b0163-aa90-48b1-815a-99e64a042a25">


## Additional Notes

There are two issues with the AzLinux 3 base image that should be fixed
in the upcoming AzLinux3 release. See comments in the Dockerfile for
details.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Will Daly <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Nov 6, 2024
# Description

Update GitHub workflows to publish retina-shell image.

## Related Issue

#910

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

<img width="550" alt="image"
src="https://github.com/user-attachments/assets/cddac7c8-c4f6-4716-a3e0-9a1075c25e13">

## Additional Notes

N/A

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Will Daly <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Nov 7, 2024
# Description

* bpftool/bpftrace don't work from inside the container, so remove them
from the image.
* nc is useful for testing TCP connectivity, so add it.
* jq is useful for parsing IMDS output, so add it too.

## Related Issue

#910

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [ ] I have updated the documentation, if necessary.
- [ ] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

<img width="962" alt="image"
src="https://github.com/user-attachments/assets/9864282f-e31b-43b5-8168-a7988c698b86">
<img width="477" alt="image"
src="https://github.com/user-attachments/assets/75fcb8e1-18e6-482a-bd85-07143a093cd8">


## Additional Notes

Add any additional notes or context about the pull request here.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Will Daly <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Nov 8, 2024
# Description

Build retina-shell image in .pipelines/cg-pipeline.yaml, which is used
to publish to acnpublic.azurecr.io/containernetworking/retina-shell

## Related Issue

#910 

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

N/A

## Additional Notes

Untested, but follows the same pattern as other release jobs.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Will Daly <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Nov 18, 2024
# Description

Add ethtool to retina-shell image. Useful for querying/modifying network
driver settings.

## Related Issue

#910 

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [x] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

<img width="1112" alt="image"
src="https://github.com/user-attachments/assets/c9c5f640-e32d-4db3-98b0-5a14fe817d4c">

## Additional Notes

N/A

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Will Daly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/captures scope/L Change is Large type/enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

2 participants