This repository has been archived by the owner on Dec 13, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 102
HA failover causes cinder-volume to stop responding #942
Milestone
Comments
this may actually be related to the rabbitmq connection not getting severed on failover of the rabbitmq VIP |
ok - I've tracked this down to an issue with cinder-scheduler on the controller nodes where they do not reconnect correctly when the VIP fails over. Since cinder-scheduler isn't all that useful when the cinder-volume node is down I propose that we change the ha-controller* roles to only include cinder-setup(for controller1) and cinder-api for both nodes. The cinder volume storage nodes then get cinder-scheduler and cinder-volume. If the volume nodes are offline it doesn't make much sense to have cinder-schedulers available that cannot schedule volumes to volume servers. more to come tomorrow |
claco
added a commit
to claco/openstack-ha
that referenced
this issue
May 28, 2014
Removed VXPVNC from monitoring. RHEL does not support and being that RHEL is a supported platform we need to make sure that the offering is consistent on both RHEL and Ubuntu. Issue rcbops/chef-cookbooks#942 (cherry picked from commit 8b8e203)
claco
added a commit
to claco/chef-cookbooks
that referenced
this issue
Jun 13, 2014
HA failover causes cinder-volume to stop responding because the scheduler does not reconnect properly after the vip failover. Since the scheduler is worthless w/o the volume service anyways, just put it right there where the volume is and off of the ha controller 1/2 nodes. Issue rcbops#942
claco
added a commit
to claco/chef-cookbooks
that referenced
this issue
Jun 13, 2014
HA failover causes cinder-volume to stop responding because the scheduler does not reconnect properly after the vip failover. Since the scheduler is worthless w/o the volume service anyways, just put it right there where the volume is and off of the ha controller 1/2 nodes. Issue rcbops#942 (cherry picked from commit 74bb5b7)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When failover occurs, cinder-volume stops consuming messages from the cinder-volume queue and requires the cinder-volume service to be restarted before it begins consuming messages again.
During this time, you can see from the cinder-volume.log that it has re-established the mysql and rabbit connections, and is sending service updates, which you can see in cinder service-list.
Jason discovered that cinder is using a direct consumer queue that is created when the cinder-volume service is started (see Direct Consumer at http://docs.openstack.org/developer/cinder/devref/rpc.html ), and is removed when the failover occurs.
E.g., cinder-volume_fanout_37c73e1379414cb7a0461aab85c69288
I traced the creation of this queue to https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L267 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L694 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L740 which is only called with fanout=True on service startup https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/service.py#L58
So it looks like the direct consumer queue is dropped when the connection to rabbit drops during failover, and then that queue is never recreated, so no messages are processed until cinder-volume is restarted and a new direct consumer fanout queue is created.
Cookbooks: v4.2.2
Cinder packages: 1:2013.2.2-0ubuntu1~cloud0
The text was updated successfully, but these errors were encountered: