HA failover causes cinder-volume to stop responding #942

JCallicoat · 2014-04-30T02:29:14Z

When failover occurs, cinder-volume stops consuming messages from the cinder-volume queue and requires the cinder-volume service to be restarted before it begins consuming messages again.

During this time, you can see from the cinder-volume.log that it has re-established the mysql and rabbit connections, and is sending service updates, which you can see in cinder service-list.

Jason discovered that cinder is using a direct consumer queue that is created when the cinder-volume service is started (see Direct Consumer at http://docs.openstack.org/developer/cinder/devref/rpc.html ), and is removed when the failover occurs.

E.g., cinder-volume_fanout_37c73e1379414cb7a0461aab85c69288

I traced the creation of this queue to https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L267 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L694 via https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/impl_kombu.py#L740 which is only called with fanout=True on service startup https://github.com/openstack/cinder/blob/stable/havana/cinder/openstack/common/rpc/service.py#L58

So it looks like the direct consumer queue is dropped when the connection to rabbit drops during failover, and then that queue is never recreated, so no messages are processed until cinder-volume is restarted and a new direct consumer fanout queue is created.

Cookbooks: v4.2.2
Cinder packages: 1:2013.2.2-0ubuntu1~cloud0

breu · 2014-04-30T03:25:59Z

this may actually be related to the rabbitmq connection not getting severed on failover of the rabbitmq VIP

breu · 2014-04-30T05:53:21Z

ok - I've tracked this down to an issue with cinder-scheduler on the controller nodes where they do not reconnect correctly when the VIP fails over. Since cinder-scheduler isn't all that useful when the cinder-volume node is down I propose that we change the ha-controller* roles to only include cinder-setup(for controller1) and cinder-api for both nodes. The cinder volume storage nodes then get cinder-scheduler and cinder-volume. If the volume nodes are offline it doesn't make much sense to have cinder-schedulers available that cannot schedule volumes to volume servers.

more to come tomorrow

Removed VXPVNC from monitoring. RHEL does not support and being that RHEL is a supported platform we need to make sure that the offering is consistent on both RHEL and Ubuntu. Issue rcbops/chef-cookbooks#942 (cherry picked from commit 8b8e203)

HA failover causes cinder-volume to stop responding because the scheduler does not reconnect properly after the vip failover. Since the scheduler is worthless w/o the volume service anyways, just put it right there where the volume is and off of the ha controller 1/2 nodes. Issue rcbops#942

HA failover causes cinder-volume to stop responding because the scheduler does not reconnect properly after the vip failover. Since the scheduler is worthless w/o the volume service anyways, just put it right there where the volume is and off of the ha controller 1/2 nodes. Issue rcbops#942 (cherry picked from commit 74bb5b7)

claco added this to the v4.2.3 milestone May 23, 2014

cloudnull mentioned this issue May 25, 2014

Fix for issue 915 rcbops-cookbooks/openstack-ha#80

Merged

claco mentioned this issue May 28, 2014

Fix for issue 942 rcbops-cookbooks/openstack-ha#81

Merged

claco mentioned this issue Jun 13, 2014

Move cinder scheduler to where volume is located #960

Merged

claco mentioned this issue Jun 13, 2014

Move cinder scheduler to where volume is located #961

Merged

ihavegerms mentioned this issue Jan 13, 2015

Probes not picking up on neutron ovs agent issues on compute rcbops/openstack-probes#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA failover causes cinder-volume to stop responding #942

HA failover causes cinder-volume to stop responding #942

JCallicoat commented Apr 30, 2014

breu commented Apr 30, 2014

breu commented Apr 30, 2014

HA failover causes cinder-volume to stop responding #942

HA failover causes cinder-volume to stop responding #942

Comments

JCallicoat commented Apr 30, 2014

breu commented Apr 30, 2014

breu commented Apr 30, 2014