Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guacamole connections start to fail #3641

Closed
Danny-Cooke-CK opened this issue Jul 29, 2023 · 19 comments
Closed

Guacamole connections start to fail #3641

Danny-Cooke-CK opened this issue Jul 29, 2023 · 19 comments
Labels
bug Something isn't working

Comments

@Danny-Cooke-CK
Copy link
Contributor

In an existing workspace we have Linux and WIndows machines which can be connected to fine. However since yesterday morning things have changed.
Connections to existing Windows machines successful
Connections to existing Linux machines successful
Connections to New WIndows machines successful
Connection to New Linux machines FAILED

We found errors in the Guacamole logs ; -

Unable to refresh session: error refreshing tokens: unable to redeem refresh token: failed to get token: oauth2: cannot fetch token: 400 Bad Request
"error":"invalid_grant","error_description":"AADSTS65001: The user or administrator has not consented to use the application with ID '*********************************' named 'tre-ws-f4eb'. Send an interactive authorization request for this user and resource
2023-07-28T16:13:14.120090970Z 16:13:14.119 [http-nio-8080-exec-10] DEBUG c.a.identity.EnvironmentCredential - Azure Identity => ERROR in EnvironmentCredential: Missing required environment variable AZURE_CLIENT_ID
2023-07-28T16:13:14.126404402Z 16:13:14.123 [http-nio-8080-exec-10] DEBUG c.a.i.ManagedIdentityCredential - Azure Identity => Found the following environment variables: MSI_ENDPOINT, MSI_SECRET
2023-07-28T16:13:14.130749393Z 16:13:14.127 [http-nio-8080-exec-10] DEBUG c.a.i.SharedTokenCacheCredential - Azure Identity => Found the following environment variables: MSI_ENDPOINT, MSI_SECRET
2023-07-28T16:13:14.143955970Z 16:13:14.141 [http-nio-8080-exec-10] DEBUG i.o.j.i.a.v.s.c.a.c.t.o.OpenTelemetryTracer - Could not extract key 'trace-context' of type 'interface io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context' from context.
2023-07-28T16:13:14.143978270Z 16:13:14.141 [http-nio-8080-exec-10] DEBUG i.o.j.i.a.v.s.c.a.c.t.o.OpenTelemetryTracer - Could not extract key 'parent-span' of type 'interface io.opentelemetry.javaagent.shaded.io.opentelemetry.api.trace.Span' from context.
2023-07-28T16:13:14.143985371Z 16:13:14.141 [http-nio-8080-exec-10] DEBUG i.o.j.i.a.v.s.c.a.c.t.o.OpenTelemetryTracer - Could not extract key 'trace-context' of type 'interface io.opentelemetry.javaagent.shaded.io.opentelemetry.context.Context' from context.
2023-07-28T16:13:14.143990371Z 16:13:14.141 [http-nio-8080-exec-10] DEBUG i.o.j.i.a.v.s.c.a.c.t.o.OpenTelemetryTracer - Could not extract key 'parent-span' of type 'interface io.opentelemetry.javaagent.shaded.io.opentelemetry.api.trace.Span' from context.

Its strange that this would only fail with New Linux machines but work fine with new linux and also existing machines?
we've tried creating a new Guacamole server but the problem just continues there.

Pipelines have been rerun all the way through to reinstate anything missing like AZURE_CLIENT_ID but no success.

(Famous last words but) nothing has changed in this environment so why something has become missing is a mystery

@Danny-Cooke-CK Danny-Cooke-CK added the bug Something isn't working label Jul 29, 2023
@marrobi
Copy link
Member

marrobi commented Jul 31, 2023

The error:

The user or administrator has not consented to use the application with ID '*********************************' named 'tre-ws-f4eb'. Send an interactive authorization request for this user and resource.

Suggest the user has not accessed the TRE workspace portal or the application registration has been incorrectly provisioned.

Can you confirm the user in question can access the workspace portal, and no pop ups appear or are blocked to request consent?

If that seems ok, post a screen shot of the API permissions for the tre-ws-f4eb application registration.

@Danny-Cooke-CK
Copy link
Contributor Author

Danny-Cooke-CK commented Jul 31, 2023

Hi Marcus. It's all users including me as the Admin.
I have spun up Linux and WIndows fine and can still connect to the ones i've already created
I can connect to new windows too but not new linux

We have rerun the build pipeline too trying to see if anything has changed but it had no effect on the symptoms

image

@marrobi
Copy link
Member

marrobi commented Jul 31, 2023

Thats users in the Enterprise App, can you get API permissions in the app registration. Thanks.

@Danny-Cooke-CK
Copy link
Contributor Author

sorry here you go
image

@marrobi
Copy link
Member

marrobi commented Jul 31, 2023

Can you confirm you see the error ` The user or administrator has not consented to use the application with ID '*********************************' named 'tre-ws-f4eb'. each time?

There is no reason it would work for some VMs and not others, are you 100% sure that is the case. Can you provide the logs before that error?

@Danny-Cooke-CK
Copy link
Contributor Author

ok i will connect to a working VM in the WS and then a failing WS, then capture the logs. bear with me

@Danny-Cooke-CK
Copy link
Contributor Author

Logs sent over privately

@marrobi
Copy link
Member

marrobi commented Aug 1, 2023

Can you check the cloud init logs on the failing VM? The logs show:

RDP server closed/refused connection: Server refused connection (wrong security type?)

Which makes me think RDP hasn't configured correctly. Recommend using a prebaked image with everything configured to remove the risk of these transient issues.

@Danny-Cooke-CK
Copy link
Contributor Author

Danny-Cooke-CK commented Aug 1, 2023

these VM's where built using the out of the box images.

Looks like it there are failures around nexus in the logs
Could not connect to nexus-tre.uksouth.cloudapp.azure.com:443 (10.67.160.24). - connect (111: Connection refused)

I've checked and nexus is running on that IP too

@Danny-Cooke-CK
Copy link
Contributor Author

So from a working VM i can browser to the Nexus server and telnet on port 80

image

From a failing server, i have logged in via bastion and can connect with telnet on port 80 too

image

@Danny-Cooke-CK
Copy link
Contributor Author

from looking at the logs of both servers it seems to go wrong here

Cloud-init v. 22.2-0ubuntu1~18.04.2 running 'modules:config' at Fri, 28 Jul 2023 14:02:26 +0000. Up 570.95 seconds.
2023-07-28 14:02:26,223 - modules.py[WARNING]: Could not find module named cc_emit_upstart (searched ['cc_emit_upstart', 'cloudinit.config.cc_emit_upstart'])
curl: (7) Failed to connect to nexus-tre.uksouth.cloudapp.azure.com port 443: Connection refused
gpg: no valid OpenPGP data found.
Err:1 https://nexus-tre.uksouth.cloudapp.azure.com/repository/ubuntu bionic InRelease
Could not connect to nexus-tre.uksouth.cloudapp.azure.com:443 (10.67.160.24). - connect (111: Connection refused)

i wonder if this is an issue with the nexus ssl certs ?

@marrobi
Copy link
Member

marrobi commented Aug 1, 2023

Looks like could be SSL cert issue. Strange that previous Linux VMs worked though.

What do you get at https://nexus-tre.uksouth.cloudapp.azure.com ? Is the certificate valid?

@Danny-Cooke-CK
Copy link
Contributor Author

no the certificate is not valid. However, its been like this since day one of this environment?

image

@marrobi
Copy link
Member

marrobi commented Aug 1, 2023

I doubt it will have ever worked if always been like this.

Can you confirm the certs service is installed and the certificate name, along with the certificate name specified in the nexus deployment, and that the two match? Again worth checking the nexus cloud init logs.

Full details can be found here - https://microsoft.github.io/AzureTRE/v0.12.0/tre-templates/shared-services/nexus/

@Danny-Cooke-CK
Copy link
Contributor Author

we have 2 environments currently. both with the same cert message but only this one has started failing. I am still deploying successfully into the other environment fine this morning.
I will look deeper into the nexus service and get back to you.

@Danny-Cooke-CK
Copy link
Contributor Author

image

@Danny-Cooke-CK
Copy link
Contributor Author

looks like the certificate expired
image

@marrobi
Copy link
Member

marrobi commented Aug 1, 2023

@marrobi
Copy link
Member

marrobi commented Aug 3, 2023

So the issue with the renew action failing has been fixed in #3647 - merged to main and will be in the next release.

And the nexus issues are being tracked here - #3642

I'm going to close this so we can keep conversation consolidated.

@marrobi marrobi closed this as completed Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants