-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device path resolution times out for aws and ali #310
Comments
Why is the cpi passing down the |
What's the actual problem here? Is there a long timeout that your log snippet doesn't reveal? Based on the timestamps:
it looks like the timeout is a millisecond, so perhaps there are more log lines that show the real length of the timeout? Also, we use the Bosh Agent in tests that deploy on many IAASs (including AWS) Those tests aren't failing, so it doesn't seem like the thing described in this Issue is causing deployment failures on AWS. Are you running into deployment failures? |
Hi @klakin-pivotal, Hi @rkoster With our PR we can improve the stability by cutting the prefix from the glob search text. The id will still be untouched, we only removing the prefix from the id. The glob will still find the correct disk. AWS for example remove the "-" from the id in there disks mounts. You can see in the logs above that they name the disk |
Regarding:
To my understanding this have to happen in the CPI |
@rkoster @klakin-pivotal @nouseforaname Do you agree with us? Can we bring in the change? Would be very important for us to get this in soon... |
I still don't fully understand where this problem is coming from. Who changed what, and why is this suddenly an urgent issue to fix? And what makes SAP's environment different from the rest of the community, so that you all are affected more than others (or is everybody affected but do people not notice)? The proposed solution feels hacky/brittle, as it is based on a naming convention that is not documented. But how will we know when it fails in the future? None of our tests capture the current issue so merging the proposed bandaid solution to me feels really temporary. |
Hi Ruben, the issue should not be related to SAP environment. To reproduce the issue on both AWS and ALI the following steps can be performed:
|
Why would I want to create a disk out of band and attach it to bosh using bosh attach-disk? |
That was just for simplifying the test case. It does not matter where the disk comes from (e.g. if it was orphaned before/created from a snapshot etc). It was/is a big issue for us, because due to some reason, ali started to provide the wrong device path back in some cases. This lead to the scenario where both ways of resolving failed (resulting in the mentioned error). Meanwhile they have adapted their CPI and we will probably just create a new CPI release to (hopefully) fix the disk mounting by device path. However, afterwards there will still be this issue present, meaning that for AWS and ALI the resolution /by-id/ will not work and the agent error "Failed to get device real path by disk ID: 'vol-'. Error: 'Timed out getting real device path for ''', timeout: 'true'" will be logged every time a disk gets attached. Therefore we still believe the change is necessary, but (hopefully) not as urgent anymore ;). |
I will try to summarise what I know. Context: Impact: Why we see this: Fixes: This issue is about resolution by ID done by the agent, which fails currently on Ali and AWS as described. The Ali CPI fix will unblock the situation but we still could make the resolution by ID work on AWS and Ali. If the suggested fix is not optimal what could be a better way to address this? |
btw we observe same time of failure with openstack cpi. stemcell 1.95 was ok, stemcell 1.108 ko |
Do you have any technical details why the resolution by-id and by-path don't work? Looking into the bosh agent changes from stemcell 1.95 to 1.108 I can say that only this commit is suspicious. |
After discussing this during the FI WG the option we see to fix this is described here. |
It seems weird to me for the Agent to fix up device IDs coming from the CPI (which gets data from the IAAS, which I think should be the source of truth), but maybe that's because I don't have enough information about the expectations that the Agent has of Disk IDs coming from the CPI. |
The current understanding is that these prefixes are not something the cpi passes in, so somewhere in the interplay between the IaaS and the OS, the prefixes get added. Given the prefixes are IaaS specific the best way to configure these is the stemcell builder, which controls the static per IaaS agent config. This PR adds the ability to configure a prefix to strip from the IDs. |
Actually I'm having almost the same situation in Openstack. Error: Action Failed get_task: Task ce868444-20c7-4913-6089-03e916de82f5 result: Adjusting persistent disk partitioning: Getting real device path: Resolving mapped device path: Timed out getting real device path for /dev/sdc
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a' like it should be detected:
DEBUG - Adjusting size for persistent disk {ID:80f10b24-4105-43a0-a34f-9b69555afc20 DeviceID: VolumeID:/dev/sdc Lun: HostDeviceID: Path:/dev/sdc ISCSISettings:{InitiatorName: Username: Target: Password:} FileSystemType: MountOptions:[] Partitioner:} Now my only conclusion to this is that the resolution of the disk changed like it was suggested here: #310 (comment) According to this change: https://github.com/cloudfoundry/bosh-agent/pull/308/files diskID := diskSettings.ID[0:20] corresponds to the 20 characters which is supposed to look for into the /dev/disk/by-id directory: *80f10b24-4105-43a0-a But this is replaced by: diskID := diskSettings, attaching the full id 80f10b24-4105-43a0-a34f-9b69555afc20 which doesn't exists in the directory. Question to @rkoster or anybody : Thanks in advance. |
The above mentioned PR has been merged. Now we need a set of IaaS specific PRs to bosh-linux-stemcell-builder to define a regex to strip the volume prefix. |
Hi @rkoster, |
There is also discussion in the bosh-jammy Slack channel about this. I also think that the prefix stripping won't help here. Is there an OpenStack documentation/specification about the way it behaves? |
@turtschan and @andinod, we're not seeing this behavior in our OpenStack tests that we run, but OpenStack is not exactly uniform in its behavior. What versions of OpenStack are you seeing this behavior with? |
Hi @jpalermo, I'm dealing with different versions of Opensatck and all of them we have the same results. However the most recent version tested is the Openstack Train. But let me clarify something, basically at the beginning it was not possible for me to upgrade the stemcell version because of the problem of the bosh agent to detect the persistent disk due to the reasons I explained before. So I started to investigate if I could workaround this or have an issue in my config, because I've got the feedback from others that they didn't have the same results like me, so I guessed this could be on another issue, or bosh configuration or Openstack config. After I studied the code as well as the comparition with the output logs of the failed deployment's vm, I realized that it follows more or less this sequence related to the persistent disks: In this last step it also failed. The disk sent by the bosh director according to its detection was the /dev/sdc and the agent tried to detect the possible options, xvbc, vdc, etc... Of course in this case the disk is not present like this but as the /dev/vdb and the real device that should be sent by the director would be /dev/sdb or /dev/vdb. Due to this, I checked then the logic of the detection of this disk from the CPI to the Director. The detection of the disk was as expected /dev/vdb by the CPI returned by Openstack, but the Director didn't modify anything, so I guessed that something happened in the CPI and I found it. After tracing the error, I found that we had an option configured in out director's CPI that it was not present in the others. There is a line of the code that modifies the disk detection in case of that option is present, and it is defined in this function: first_device_name_letter(server). We had the option config_drive to 'disk' in the CPI config, for some reason it was present and that made the difference. This was an error from the begining, present in the CPI config, but the deployments didn't get there, because the bosh agent was smart enough to detect the presistent disk properly before. So after we unset the option, everything started to work. The CPI was sending the right value to the Director and the Director to the agent and the disk was detected at the end. All of this is to let you know that basically for many people their deployments are still working with the new stemcells because probably they don't have that option set in their CPI, but that doesn't mean that the problem of the bosh agent detecting the disk is not present, because I still can see it in the logs, and definitely that belongs to the agent. In fact, since that change happened, we detected that it was not possible for move to a new version of stemcell, why? because we were using probably a version were the agent had the good disk detection, but for the newer versions with the new changes, the disk was not detected and failed for us. I hope this can help to others as well, |
Thanks for all the details, that was a super helpful writeup. I confirmed what you found, that our test environment is falling back to the It feels like we want to fix this, but I'm not sure how yet. I don't think we can use the new |
Given the |
We looked at fixing this in the Agent by using or modifying the This means to fix it this way we'd need a different Openstack stemcell for Openstack environments that truncate the disk id vs ones that do not (which we suspect newer Openstack environments don't) We also looked at doing this work in the CPI via a configuration that would allow the Operator to configure if the disk id gets trimmed or not. The problem with this is that the Id is the core disk Id, not part of the disk hints. So if the CPI trims it, Bosh is then going to use it later for the Since the environments aren't currently broken, we don't think we need a complicated fix for Jammy stemcells. We created an Issue in the Wishlist for the 2024 stemcell where we revisit the logic happening in the Agent's |
We encountered a similar problem to this comment with OpenStack's Queens. Tracing the error also matched. I am also unable to upgrade Stemcell because of the |
We found that the resolution of the device path by ID times out on ali and aws. This is due to a mismatch in the disk ID and the symlink in /dev/disk/by-id.
Please see below the agent logs for aws:
As can be seen the disk id contains the "-" character, however the symlink does not contain it (last log entry at the end)
The issue for ali is slightly different.
First the agent logs:
and here the content of the /dev/disk/by-id folder:
We propose the following change: #309
This PR removes everything from the DiskID before and including the "-" character, if it exists.
The text was updated successfully, but these errors were encountered: