Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Inconsistent duration swap slots using AzureAppServiceManage@0 for Web Apps on dedicated Linux App Service plan #19273

Open
1 of 4 tasks
Ruud2000 opened this issue Nov 14, 2023 · 35 comments

Comments

@Ruud2000
Copy link

Ruud2000 commented Nov 14, 2023

Task name

AzureAppServiceManage@0

Task version

0.228.1

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

Windows 2019 datacenter-core-g2

Question

Recently we introduced a staging deployment slot for our Web Apps. So each Web App now has a staging and production slot. All Web Apps run on a dedicated Linux App Service plan (P1v3). Average CPU percentage is between 10 and 20, and average memory percentage around 80.

We now deploy our software to the staging slot and use Azure DevOps task AzureAppServiceManage@0 to swap the staging and production slots. The duration of a swap is not consistent between multiple deployments. Most of the times the duration is between 1m 30s and 2m 30s, but we also have occurrences where a swap takes more than 12 minutes. Especially when multiple Web Apps have a slow swap the pipeline takes a very long time, risking hitting the 60 minutes timeout.

The diagnostics show the swap starts by invoking:

[POST]https://management.azure.com/subscriptions/[redacted]/resourceGroups/Workload.WestEurope/providers/Microsoft.Web/sites/[redacted]/slots/staging/slotsswap?api-version=2016-08-01

Then we see the following call being invoked every 15 seconds, returning a HTTP response 202

[GET]https://management.azure.com/subscriptions/[redacted]/resourceGroups/Workload.WestEurope/providers/Microsoft.Web/sites/[redacted]/slots/staging/operationresults/08021b2d-33b1-4e10-bddb-8ac2b7ebd2cd?api-version=2016-08-01

And eventually after slightly more than 12 minutes this same call returns a HTTP response 200 and the swap is complete.

When we execute a swap in the Azure Portal we never seem to hit a slow swap. Looking at the developer tools in the browser while executing a swap in the portal shows a more current version for the swap API is used. The portal uses slotsswap?api-version=2018-11-01 while AzureAppServiceManage@0 uses api-version=2016-08-01. Could this perhaps explain the inconsistent durations?

I found this question from November 2021 which is almost identical to our situation: https://learn.microsoft.com/en-us/answers/questions/612601/optimizing-cd-pipeline-when-swapping-multiple-web
Unfortunately I have not yet been able to verify if swap times are more consistent when using the AzureCLI@2 task as suggested in the answer to this question, because our self hosted build agent currently has no Azure CLI installed.

@ivanBereznev
Copy link

Experiencing the same issue. It takes 4-5 minutes to swap slots for a single web app. Tried both AzureAppServiceManage@0 and AzureCLI@2 and although the latter seems to be slightly faster all the results are still in the same ball park.

@211211
Copy link

211211 commented Mar 27, 2024

still facing the same issue on March 2024 with AzureAppServiceManage@0.
Switched to AzureCLI@2 and it works fine.

My command:
az webapp deployment slot swap -g {{your_rs_group}} -n {{app_name}} --slot {{source_slot}} --target-slot {{target_slot}}

@devdeer-alex
Copy link

devdeer-alex commented Apr 19, 2024

I think this is related to Azure slots itself. The task just waits until the slot is swapped and this is sometimes taking a ridicilously long time. Its all over the usual discussions like on SO.

I currently randomly get the following output after 20+ minutes:

Starting: Swap Slot api-dd-alerting
==============================================================================
Task         : Azure App Service manage
Description  : Start, stop, restart, slot swap, slot delete, install site extensions or enable continuous monitoring for an Azure App Service
Version      : 0.238.1
Author       : Microsoft Corporation
Help         : https://docs.microsoft.com/azure/devops/pipelines/tasks/deploy/azure-app-service-manage
==============================================================================
Warming-up slots
Swapping App Service '***' slots - 'deploy' and 'production'
Successfully updated deployment History at https://***-deploy.scm.azurewebsites.net/api/deployments/35831713538731739
Successfully updated deployment History at https://***.scm.azurewebsites.net/api/deployments/35831713538731739
##[error]Error: Failed to swap App Service '***' slots - 'deploy' and 'production'. Error: ExpectationFailed - Cannot swap site slots for site '***' because the 'deploy' slot did not respond to http ping. (CODE: 417)
Finishing: Swap Slot api-dd-alerting

@P-DHrestak
Copy link

still facing the same issue on March 2024 with AzureAppServiceManage@0. Switched to AzureCLI@2 and it works fine.

My command: az webapp deployment slot swap -g {{your_rs_group}} -n {{app_name}} --slot {{source_slot}} --target-slot {{target_slot}}

Tried this solution but using AzureCLI@2 task takes just as long (20+ minutes) as does the AzureAppServiceManage one. The activity log has no useful data in it:
image

@tobias-johansson-nltg
Copy link

still facing the same issue on March 2024 with AzureAppServiceManage@0. Switched to AzureCLI@2 and it works fine.
My command: az webapp deployment slot swap -g {{your_rs_group}} -n {{app_name}} --slot {{source_slot}} --target-slot {{target_slot}}

Tried this solution but using AzureCLI@2 task takes just as long (20+ minutes) as does the AzureAppServiceManage one. The activity log has no useful data in it: image

We also tried this but with the same result as using AzureAppServiceManage@0. Is there no way of getting more information of what it is actually doing? Our deploy pipeline contains quite a few steps, including creating and deleting a database copy, and the two swaps are the steps that take by far the most time :)

@omer-glazer
Copy link

Same here.
Our deployment swap takes up to 11 minutes, with no visible reason (activity logs/output logs).

@chrisflem
Copy link

Same here.
I have noticed that if I access the slot in a browser, the swap completes shortly after. Is there a bug in the code calling the slot, since it works when I do it manually ?

@goodmanmd
Copy link

We ran into this during a deploy last night. Both slots were accessible via browser and yet the task timed out after 23 minutes (!) with this error:

Error: Failed to swap App Service 'xxx' slots - 'staging' and 'production'. Error: ExpectationFailed - Cannot swap site slots for site 'xxx' because the 'staging' slot did not respond to http ping. (CODE: 417)

Is the task actually looking at HTTP rather than HTTPS? If so, that could explain what's going on. Our site redirects HTTP => HTTPS and therefore would not be returning a 2xx response code for any HTTP request if that's what the script is looking for to determine success. Even if it's not using HTTP, in our app, all requests to / would redirect to an auth screen so the same issue could still apply.

FWIW we fell back to manually swapping slots for our apps via the portal and those processed successfully within 30-60 seconds.

@DennisJensen95
Copy link

We ran into this during a deploy last night. Both slots were accessible via browser and yet the task timed out after 23 minutes (!) with this error:

Error: Failed to swap App Service 'xxx' slots - 'staging' and 'production'. Error: ExpectationFailed - Cannot swap site slots for site 'xxx' because the 'staging' slot did not respond to http ping. (CODE: 417)

Is the task actually looking at HTTP rather than HTTPS? If so, that could explain what's going on. Our site redirects HTTP => HTTPS and therefore would not be returning a 2xx response code for any HTTP request if that's what the script is looking for to determine success. Even if it's not using HTTP, in our app, all requests to / would redirect to an auth screen so the same issue could still apply.

FWIW we fell back to manually swapping slots for our apps via the portal and those processed successfully within 30-60 seconds.

Besides the varying deployment times, which we also experience. We are also experiencing like you @goodmanmd the same stochastic timeout, if you then rerun it it succeeds. There are no indiciations of why this happens. We are using AzureCLI@2 for the swap operation. How are you doing to swap @goodmanmd?

@goodmanmd
Copy link

goodmanmd commented May 16, 2024

@DennisJensen95 for this particular app our deploys are infrequent - perhaps once or twice a year. In this case we fell back to swapping the slots via the Azure Portal as it was only 3 applications with 2 swaps each (staging, production, last-known-good).

Edit: Re-reading the question and I think you may be asking what method we're using for the automated swap in our pipeline -- we are currently using AzureAppServiceManage@0

@ash-skelton
Copy link

Would be great if Microsoft acknowledged this. We are seeing the exact same thing (using AzureAppServiceManager@0). It's happening sporadically across a few of our apps but it has definitely been getting worse.

@pumacln
Copy link

pumacln commented Jun 3, 2024

@DennisJensen95
@goodmanmd

I am having the same issue.
We use Azure PowerShell via Octopus Deploy to Start / Stop / Swap slots.

#Start the Staging Slot Start-AzWebAppSlot
-ResourceGroupName "#{ResourceGroup}" -Name "#{Website}"
-Slot "Staging"
#Swap the staging slot into production
Switch-AzWebAppSlot -ResourceGroupName "#{ResourceGroup}"
-Name "#{Website}" -SourceSlotName "Staging"
-DestinationSlotName "Production"
#Stop the Staging Slot
Stop-AzWebAppSlot -ResourceGroupName "#{ResourceGroup}"
-Name "#{Website}" -Slot "Staging"

The behavior is the same, sometimes the swap operation will just time out. Re-running works 99% of the time.

Where is @microsoft or @Azure support?

@Saturate
Copy link

Saturate commented Jun 4, 2024

We also see random swaps taking 20 minutes plus for a nodejs application. Sometime they timeout, rerunning works often.

@rvvincelli
Copy link

We're having this too, but the bad thing is that sometimes, when the swap fails the staging slot is left corrupted: the envvars from the prod slot get poured into the staging slot. Such a swap should be a transaction, contacted the Azure team on this one but they were to unable or even acknowledge the issue.

@StephenWBertrand
Copy link

We are seeing where the swaps take along time, but also it basically locks up the production slot and requests just start taking forever or just dropped. Normally cpu is under 10% all the time, then just for a deployment we jump to 50%, which is still plenty of headroom, but then something just sort of hangs for a bit. Sometimes the swaps is successfully, but the site seems down for a few minutes and sometimes and swaps doesnt work, and the old version pops back up after a few minutes of the site appearing down.

Kind of goes against the whole no downtime idea of using slots :)

@goleafs
Copy link

goleafs commented Aug 22, 2024

Very similar issues have started for us. Deployments/swaps succeeded fine with no interuption, although sometimes slower than others.

We host 2 web apps on one app service plan. Now when one is trying to swap staging to production, not only will it fail, it ends up taking down the other site due to shared app service plan and throttling resources. At this point everything is dead until instances can be restarted.

As stated, everything had been working flawlessly, this seems due to some internal MS change.

Would the AzureCli help here? don't see why. Could remove the swap from pipeline and try it manually, but what kind of automation is that?

@michalkrzych
Copy link

We had a similar issue with slot swapping and spoke to Microsoft about it. We have been told to try this:

Please add the below to setting to the app settings on the staging slot:

  1. WEBSITE_SWAP_WARMUP_PING_PATH=/ : This setting is warming the staging slot in order to complete the swap operation. I kindly suggest adding this to the staging slot first and check if it resolves the issue. If not, then along with this setting add below as well:
  2. WEBSITE_OVERRIDE_STICKY_DIAGNOSTICS_SETTINGS=0

Please refer to blow articles to get more information about swapping slot.
A Subtle Gotcha with Azure Deployment Slots and ASP.NET Core | You’ve Been Haacked
Set up staging environments - Azure App Service | Microsoft Learn

In our case, the first option has magically fixed the issue with slot swapping - well, we've only been able to observe this for the last 2 or 3 days but haven't seen any errors yet.

@FrancescoBonizzi
Copy link

Same problem here.
@michalkrzych, just a question. You said that you edited this value: WEBSITE_SWAP_WARMUP_PING_PATH, but the default is /. I didn't understand how you changed it!

Thanks

@michalkrzych
Copy link

Same problem here. @michalkrzych, just a question. You said that you edited this value: WEBSITE_SWAP_WARMUP_PING_PATH, but the default is /. I didn't understand how you changed it!

Thanks

Apologies for the confusion. I have only added this setting.

BTW. these settings haven't made an impact on our slot swapping issue. It's still happening and we are still troubleshooting with Microsoft. Next thing to check is the app start up in the slot, apparently when using vnets, kvs, managed identities, some of the settings aren't being copied from the parent slot so have to be added manually to ensure the app can start up without errors in order for the slot swapping to succeed.

@rvvincelli
Copy link

rvvincelli commented Sep 19, 2024

Same problem here. @michalkrzych, just a question. You said that you edited this value: WEBSITE_SWAP_WARMUP_PING_PATH, but the default is /. I didn't understand how you changed it!
Thanks

Apologies for the confusion. I have only added this setting.

BTW. these settings haven't made an impact on our slot swapping issue. It's still happening and we are still troubleshooting with Microsoft. Next thing to check is the app start up in the slot, apparently when using vnets, kvs, managed identities, some of the settings aren't being copied from the parent slot so have to be added manually to ensure the app can start up without errors in order for the slot swapping to succeed.

hi @michalkrzych ! Honestly, we gave up on this, after a lot of inconclusive debugging with the Azure/Mindtree teams, and resorted to preemptively patch the staging slot. So basically, everytime the slot swap fails, we give:

FIX-STAGING-SLOT:
    if: ${{ failure() && ((github.event_name == 'push' || github.event_name == 'workflow_dispatch') && github.ref == 'refs/heads/master-php8') }}
    needs: [SWAP-STAGING-TO-PRODUCTION]
    runs-on: ubuntu-latest
[...]

notice the failure() condition together with the needs, so that the Github Actions job only runs in case the swap fails; and the command (to repeat for each envvar that's slot-specific):

az webapp config appsettings set --resource-group ${{ vars.RESOURCE_GROUP }} --name ${{ vars.WEBAPP_NAME }} --slot ${{ vars.SLOT_NAME }} --slot-settings APP_DEBUG="${{ vars.APP_DEBUG }}"
echo "Setting environment variable: APP_ENV"

The issue is: a slot swap is not a transaction. In our case, sometimes it just hangs and fails because of some internal issues we still have to address (even with those ping health envvars etc), but no matter what: a swap should be a transaction and it is not. The team kind of avoided acknowledging this, but it is evident: swap fails, staging slot is left with the prod envvars.

In particular, what happens is that the staging slot gets corrupted because its envvars get overwritten with the envvar values from the prod slot, rendering the staging slot unusable (especially if you have kv-backed envvars on segregated keyvaults) and effectively breaking the whole green-blue swap thing. According to your scheme (e.g. canary with % traffic), it breaks prod too.

@devdeer-alex
Copy link

@Ruud2000 Just wondering which SKU your App Service Plan is running on? We've just moved from deprecated S1 to P0v3 (50 bucks more per month 😒). This solved timing issues so far.

@jcrichlake
Copy link

Any update on this? Our team is having this issue as well. At the very least could the CLI be updated to not hang for 10+ minutes?

@jcrichlake
Copy link

@Ruud2000 Just wondering which SKU your App Service Plan is running on? We've just moved from deprecated S1 to P0v3 (50 bucks more per month 😒). This solved timing issues so far.

We've been on a premium SKU but are still having this issue 😞

@Ruud2000
Copy link
Author

Ruud2000 commented Oct 3, 2024

@Ruud2000 Just wondering which SKU your App Service Plan is running on? We've just moved from deprecated S1 to P0v3 (50 bucks more per month 😒). This solved timing issues so far.

We run on P1v3

@KryptoBeard
Copy link

We are also running into this issue. 10+ minutes for a simple app service swap...

@ampandres
Copy link

During the operation, we are experiencing the same inconsistent slow slot swaps + high CPU and Memory (80-90 percent). We are using p1v3.

@rvvincelli
Copy link

rvvincelli commented Nov 11, 2024

Another thing helped us here... make sure you do not perform any az config operations right before launching the swap. Some of them (e.g. updating the SCM whitelist) do result in a soft restart of the instances: we noticed great improvements after we added some sleeps in between.

@zdenek-jelinek
Copy link

zdenek-jelinek commented Nov 12, 2024

I'm observing the exact same behavior as per the above post - my pipeline applies infrastructure as code changes that always contain differences (due to App Service changing the App Insights connection string, another story...) and this leads to swap being stuck approx. 50 % of the time.

Restarting the App Service swap source slot manually in Azure Portal helps. So I added stop + start the source slot in the pipeline and it gets stuck much less frequently.

I'm writing this because some people in this thread mentioned changing configuration properties helped while others state it did not. Changing the configuration causes a restart which may result in the swap working again so it may appear that a configuration change helped but it probably did not. I have all of the properties mentioned in this thread set up already.

Also, I do think this is not an issue with the pipelines task but with Linux App Service itself. The same behavior happens if I use Azure CLI directly.

I want to try some more things in the deployment pipeline like waiting for a bit or polling the URL and see what happens. Will report back if I find anything that helps.

I'm getting this on both P0v3 and P1v3.

@rvvincelli Could you share your sleep durations, please? Have you tried different values? Did you observe issues?

@rvvincelli
Copy link

Hi @zdenek-jelinek !

We are sleeping for two minutes after each az config change. After interleaving all these sleeps the rollout duration got 10% slower, but we are almost never encountering issues.

I think in this thread two issues got intertwined:

  • one thing is the slot failure failing altogether because of un-readiness issues (possibly addressed with avoiding az config changes etc)
  • another is that in case there's no warmup path (or it's not giving a 2xx), the incoming slot doesn't get swapped in for good

Finally, no matter the scenario, it shouldn't be the case that the slot is left inconsistent (i.e. envvars gets mixed) but sometimes that happens too.

@v-gayatrij
Copy link
Contributor

@Ruud2000 , Thanks for reporting this. For further investigation, could you please share complete debug logs by setting variable system.debug = true

@Ruud2000
Copy link
Author

Ruud2000 commented Dec 8, 2024

@Ruud2000 , Thanks for reporting this. For further investigation, could you please share complete debug logs by setting variable system.debug = true

Unfortunately I cannot since I'm no longer working at the client where we faced this issue, But since more people are experiencing the same issue, hopefully someone will be able to provide debug logs.

@zdenek-jelinek
Copy link

zdenek-jelinek commented Dec 30, 2024

I have stopped reproducing this issue around Nov 19th, coinciding with App Service maintenance in West Europe where my instance is located.

I have removed the manual restart steps that helped mitigate the issue and still did not manage to reproduce the issue for several days.

@FrancescoBonizzi
Copy link

FrancescoBonizzi commented Jan 22, 2025

@zdenek-jelinek You are saying that now it swaps fast?

@zdenek-jelinek
Copy link

zdenek-jelinek commented Jan 22, 2025

@FrancescoBonizzi I am consistently seeing swaps take 1:30 - 2 min for P0V3 and P1V3 Linux App Services right after deploying a new artifact into the source slot, whereas previously (i.e. before Nov 2024) they got stuck until the pipeline timed out more often than not.

@FrancescoBonizzi
Copy link

Thanks @zdenek-jelinek. I was trying before 2024 and I had to rollback everything, now it seems time to try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests