Skip to content
This repository has been archived by the owner on Mar 7, 2024. It is now read-only.

Deploy Kubernetes on Jetstream with Kubespray 2.21.0 #46

Merged
merged 7 commits into from
Jul 26, 2023

Conversation

zonca
Copy link
Contributor

@zonca zonca commented Mar 2, 2023

Still finishing testing

@jlf599
Copy link
Collaborator

jlf599 commented Mar 2, 2023

You'll want to change

openstack recordset list tg-xxxxxxxxx.projects.jetstream-cloud.org.

to

_openstack recordset list xxxxxxxxx.projects.jetstream-cloud.org.
_

@zonca
Copy link
Contributor Author

zonca commented Mar 2, 2023

@jlf599 sorry, the link up there was to the old version, I removed it, better if you check the diff in the pull request

@jlf599 jlf599 marked this pull request as ready for review March 2, 2023 02:10
@jlf599
Copy link
Collaborator

jlf599 commented Mar 2, 2023

I inadvertently marked ready for review -- sorry. Is it ready or not?

@zonca
Copy link
Contributor Author

zonca commented Mar 2, 2023

No, I'm still testing

@jlf599 jlf599 self-assigned this Mar 2, 2023
@zonca zonca marked this pull request as draft March 2, 2023 03:02
@zonca
Copy link
Contributor Author

zonca commented Mar 2, 2023

ok, @jlf599, I completed testing. Everything seems to work fine.

do you have someone that could test the tutorial and provide feedback?

I have already asked @julienchastang

@zonca zonca marked this pull request as ready for review March 2, 2023 22:53
@zonca
Copy link
Contributor Author

zonca commented Mar 8, 2023

@robertej09 @julienchastang would you have time to test the recipe in the next couple of weeks?
Otherwise we can merge it immediately and you can provide feedback later on.

@julienchastang
Copy link

OK thank you for doing this. Yes, I will try to make some time to evaluate this soon. Thanks again!

@zonca
Copy link
Contributor Author

zonca commented Mar 16, 2023

ok, solved the issue with Designate by disassociating and reassociating the floating IP.
Also, I now use the existing "auto allocated network".

I am debugging some networking issues for which I have an open ticket, however I do not this that they are related to this tutorial.
So it is ready for review and merging.

@julienchastang
Copy link

OK will review soon. I need to clear a couple of things off of my plate and then will get to this.

@jlf599
Copy link
Collaborator

jlf599 commented Mar 16, 2023

I do think at some point, converting the instance creation methodology is going to be a good idea. Basically, if you create NO networking infrastructure and just create an instance, it should use the Openstack "Just give me a network" protocol and do all of the right things.

If you need to get the public IP, you can use one of these methods to have the instance phone home with it:

wget http://169.254.169.254/latest/meta-data/public-ipv4 -qO -
wget http://ipinfo.io/ip -qO -
curl http://169.254.169.254/latest/meta-data/public-ipv4
curl http://ipinfo.io/ip

Aaron Wells on the JS2 team has code examples for Terraform for setting up. I believe some are linked from the docs site but you can also consult with him directly via Slack if you'd like.

@zonca
Copy link
Contributor Author

zonca commented Mar 16, 2023

I do think at some point, converting the instance creation methodology is going to be a good idea. Basically, if you create NO networking infrastructure and just create an instance, it should use the Openstack "Just give me a network" protocol and do all of the right things.

the problem is that the Kubespray Terraform recipe is quite complex, I am trying to modify it as little as possible to prevent other issues to popup.

If you need to get the public IP, you can use one of these methods to have the instance phone home with it:

wget http://169.254.169.254/latest/meta-data/public-ipv4 -qO - wget http://ipinfo.io/ip -qO - curl http://169.254.169.254/latest/meta-data/public-ipv4 curl http://ipinfo.io/ip

Aaron Wells on the JS2 team has code examples for Terraform for setting up. I believe some are linked from the docs site but you can also consult with him directly via Slack if you'd like.

I think the current workaround is suitable, see in the tutorial where I release and add back the $IP

@julienchastang
Copy link

I went through the new workflow:

  • Terraform prompts be about installing some experimental feature to complete the workflow (no biggie)
  • The VM is now on the auto_allocated_network
  • The usual workflow ran to completion (terraform / ansible / helm, etc.)
  • The Hub appears to be up and running judging by kubtctl get pods -A output.
  • DNS is not working, therefore the Hub URL is unreachable. I did see these instructions. Still trying to figure out what is going on here.

(ping @robertej09)

@julienchastang
Copy link

julienchastang commented Mar 23, 2023

OK, I have DNS working now. Just had to:

# Create new DNS record
openstack recordset create \
  --record <floating-ip-of-instance> \
  --type A \
  <project-ID>.projects.jetstream-cloud.org. \
  <your-desired-hostname>.<project-ID>.projects.jetstream-cloud.org.

Thanks Ana (@robertej09) for reminding me that this was in our docs all along.

@julienchastang
Copy link

Also, I noticed that install_jhub.sh now has a bunch of diagnostic niceties. For Letsencrypt I did not have to run deploymentPatch.sh. I am not sure if that is intentional or not.

@zonca
Copy link
Contributor Author

zonca commented Apr 3, 2023

one issue with this deployment is that Terraform creates a dedicated subnet.
Once the subnet is available, people on the same allocation that create a VM without specifying networking might get assigned to the subnet created by Terraform.
So when we run Terraform delete, the subnet cannot be deleted.

I think this is not a big issue, I have been affected because I am in the Jetstream support allocation and there is a lot of creation/deletion of VMs.

However, I would like to try once more to see if I can modify the Terraform recipes of Kubespray to not create any networking as suggested by @jlf599. Because this should also fix our other issue with Designate.

For reference I will use these Terraform recipes: https://github.com/wellsaar/terraform-js2/blob/main/ubuntu_nginx_mariadb/Ubuntu22.tf

@zonca
Copy link
Contributor Author

zonca commented May 3, 2023

I tried hard to make the Terraform recipe use the auto_allocated_network, but I couldn't get it to work, see zonca/jetstream_kubespray#23.

Moreover, even if I get it to work, there are just too many changes that will be difficult to re-apply to every update of Kubespray.

So I think the recipe should continue to create a k8s-internal-network. However, I will add some explanation of this in the tutorial.

I also plan to rerun the tutorial a couple of times and make sure all the steps are fine.
I'll notify again once this is ready to merge.

@jlf599
Copy link
Collaborator

jlf599 commented May 3, 2023

It might require a rework, but IIRC, Aaron Wells (JS2 staff) has terraform that utilizes auto_allocated_network

The key difference is when you go to set things up, if you don't create any networking at all and just create instances with the auto_allocated_network and then create a floating ip and attach it, that's all you do. No need to create router, net, subnet, or port.

@jlf599
Copy link
Collaborator

jlf599 commented May 3, 2023

the other issue here is that if people on that allocation are using the auto_allocated_ subnet, you might break things for them. I'd highly recommend engaging Mike and Aaron via Slack on this.

@zonca
Copy link
Contributor Author

zonca commented May 3, 2023

@jlf599 I think the issue is marginal, I guess not many people happen to deploy kubernetes using kubespray and also launch instances on "auto_allocated_network" in the same allocation. And as long as we make people aware of the problem in the documentation, they can go around it.

@julienchastang @robertej09 do you have a preference between switching to use auto_allocated_network, so that we are creating less networking resources and we do not risk breaking other people in the same allocation using auto_allocated_network themselves (and having larger maintenance burden due to large changes to terraform) and leaving the recipe as it is now creating a dedicated k8s-internal-network?

@julienchastang
Copy link

Thank you all for your hard work on this matter. While I don't have a strong preference, I do emphasize the importance of clean resource management for the numerous JupyterHub clusters we run. Ensuring that resources are properly torn down without any dangling or orphaned resources is just as important to avoid tedious manual cleanup.

@ana-v-espinoza
Copy link

Hey all,

I'll admit that I had to read through this a few times to ensure I was understanding what I was looking at. I'm sure attempting to multi-task while doing so didn't help!

As Julien said, thanks all for your hard work. I personally don't have a strong preference either way either, but I can see the merit in both options. While keeping the Terraform workflow minimally modified reduces the maintenance workload and is proven to work, it creates/uses more network resources which not every allocation may have access to. There's also this:

one issue with this deployment is that Terraform creates a dedicated subnet.
Once the subnet is available, people on the same allocation that create a VM without specifying networking might get assigned to the subnet created by Terraform.

and this:

the other issue here is that if people on that allocation are using the auto_allocated_ subnet, you might break things for them.

@zonca , does this happen exclusively when creating VMs through Terraform, or also via the openstack cli or the web portals? Depending on whether or not other Jetstream2 users even use Terraform for their infrastructure creation (outside of this Kubespray workflow), as Andrea says, this may not be a large problem, especially if it's a well documented emergent "feature."

Just my 2 cents,

Ana

@jlf599
Copy link
Collaborator

jlf599 commented May 15, 2023

@zonca , does this happen exclusively when creating VMs through Terraform, or also via the openstack cli or the web portals? Depending on whether or not other Jetstream2 users even use Terraform for their infrastructure creation (outside of this Kubespray workflow), as Andrea says, this may not be a large problem, especially if it's a well documented emergent "feature."


This can affect users that are working in ANY of the JS2 interfaces. It's hard to spot if you're using Exosphere exclusively, though. It's easier to track down via Horizon or the CLI.

I think ultimately, moving to the OpenStack "Just give me a network" style -- which basically says you don't create or specify any network bits and OpenStack puts you on the auto_allocated_network and auto_allocated_subnet.

Basically, there's no security gain in isolated subnets or networks. It's not like physical networking where you are physically attaching to devices and being isolated that way. All of this is really handled via iptables and routing rules. So if the desire for doing this is security, it's really not gaining anything. If the desire for this is just to avoid making larger changes, I understand that, though in the long run refactoring makes the whole process simpler for users to troubleshoot.

I would say in the long run, we should work together to see if we can make it work the new way...though that day may not be today (or tomorrow).

@julienchastang
Copy link

Another complication is the recordset entry needs to be manually deleted upon cluster tear down. Otherwise next time you try to attach a new recordset to the auto_allocated_network , you'll be stymied. It would be nice if the recordset / publicly accessible URL entries were automatically handled as before (i.e., it just worked, for the most part).

@jlf599
Copy link
Collaborator

jlf599 commented May 23, 2023 via email

@zonca
Copy link
Contributor Author

zonca commented May 23, 2023

given the feedback I'll try again to make kubespray work with the auto allocated network. It is going to be a big upfront amount of work, hopefully successful, but it should simplify a lot the infrastructure.

@jlf599
Copy link
Collaborator

jlf599 commented May 23, 2023 via email

@zonca
Copy link
Contributor Author

zonca commented May 23, 2023

I tried already, the instances have a network, but they cannot connect to each other.

@zonca
Copy link
Contributor Author

zonca commented May 23, 2023

anyway I'll try again then ask for help

@zonca
Copy link
Contributor Author

zonca commented May 30, 2023

I have some good progress going on! I'll update this soon.

@zonca
Copy link
Contributor Author

zonca commented Jul 9, 2023

thanks @jlf599 @zacharygraber, the tutorial is now working,
there are now a lot more changes to the Terraform recipe, see zonca/jetstream_kubespray#21, however I am mostly removing resources.

I think the PR can be merged.

Then @julienchastang @robertej09 should make a more extensive testing, I deployed a simple JupyterHub and everything seemed to be working, but there might be still something more subtle which is broken.

@julienchastang
Copy link

OK, sounds good. I've been on vacation, but will try to make time for this soon.

@zonca
Copy link
Contributor Author

zonca commented Jul 24, 2023

@jlf599 @zacharygraber I think this PR can be merged

@jlf599 jlf599 merged commit 8bb95c0 into jetstream-cloud:main Jul 26, 2023
@julienchastang
Copy link

@robertej09 and I have been working on this over the last few days and I believe things look good. Just have to remember to replace router_id as described in cluster.tvars. Easy enough, but I wasn't sure if this was mentioned in the blog/docs. Anyway, it looks good from here. Thanks for doing this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants