Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping #4695

Closed
ljelinkova opened this issue Jan 9, 2019 · 69 comments

Comments

@ljelinkova
Copy link
Collaborator

ljelinkova commented Jan 9, 2019

Related HK issue - https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2520
When I create a new booster in OSIO and open it in Che, Che is started properly but no projects are loaded and no error message is displayed to the user (wait timeout is 10 minutes).

Che

E2E tests encountered this several times, so far only on clusters 1a and 2.

I can see this errors in browser log:

[08:39:30.879] SEVERE _app-0.js 17981 WebSocket connection to 'wss://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/wsagent?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw&clientId=992733577' failed: Error during WebSocket handshake: Unexpected response code: 503
[08:39:30.880] WARNING _app-0.js 17908:125 "WARNING (org.eclipse.che.ide.websocket.impl.BasicWebSocketEndpoint): Error occurred for endpoint wss://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/wsagent?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw&clientId=992733577"
[08:39:30.975] SEVERE _app-0.js 16432 WebSocket connection to 'wss://route2yt5yxcj-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/pty?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw' failed: Error during WebSocket handshake: Unexpected response code: 503
[08:39:30.976] SEVERE https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type - Failed to load resource: the server responded with a status of 503 (Service Unavailable)
[08:39:30.977] SEVERE https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type - Failed to load resource: the server responded with a status of 503 (Service Unavailable)
[08:39:30.977] SEVERE https://che.openshift.io/osio-ci-e2e-002/e2e-0109-0837-5795-eoclc - Access to XMLHttpRequest at 'https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type' from origin 'https://che.openshift.io' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[08:39:30.977] SEVERE _app-0.js 17908:125 "ERROR (org.eclipse.che.ide.projecttype.ProjectTypeRegistryImpl): Can't load project types: org.eclipse.che.ide.commons.exception.ServerDisconnectedException"

@ljelinkova
Copy link
Collaborator Author

It may be related to #4511

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 10, 2019

@ljelinkova could you provide identity_id of e2e users against which accounts this problem is reproducible ?

@ibuziuk ibuziuk self-assigned this Jan 10, 2019
@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 10, 2019

also, please provide links to jobs where the problem was caught

@ljelinkova
Copy link
Collaborator Author

This is the last failed job: https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2143/console
User: osio-ci-e2e-002

I am not sure what you mean by identity_id - is it something you get from

curl -X GET "https://auth.openshift.io/api/users?filter[username]=osio-ci-e2e-002" |jq .

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 10, 2019

@jmelis @aditya-konarde @pbergene are there any known infrastructure issues on starter clusters ? Che currently fails to start for me on 2 cluster due to networking issues. Are we positive that https://redhat.service-now.com/help?id=rh_ticket&table=incident&sys_id=b173eceedb36a7c01e48cae3b9961912&view=ess has been fully resolved and not affect other clusters ? As I understand currently it is not possible to obtain any cluster specific health metrics ?

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 10, 2019

@rhopp @Katka92 @ScrewTSW could you please confirm that this problem only persists on particular clusters while not being reproducible on others cc: @ppitonak @ldimaggi

@Katka92
Copy link

Katka92 commented Jan 10, 2019

@ibuziuk I have run tests many times and the only failure it discovered is bayesian (known issue). I created new workspace from OSIO from the space that I created some time ago with vertx-http-booster from my repo and works fine.
I'll investigate further tomorrow by manual testing directly from OSIO - I'll try reset environment and try it again.

@Katka92
Copy link

Katka92 commented Jan 11, 2019

Today I tried to reset environment and create new space + new workspaces and works fine. I've run tests with that account several times and haven't reproduced this issue. The user is provided on 2 cluster. @rhopp @ScrewTSW could you give it a try please?

@rhopp
Copy link
Collaborator

rhopp commented Jan 11, 2019

I'm not observing any troubles starting workspaces with my accounts on 2, 2a and 1a

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 11, 2019

@Katka92 @rhopp thanks for verification. Indeed, it looks like temp infra related issue.
@ljelinkova I see that the job has been stabilized, so please close the issue if the problem is no longer reproducible.

I have raised concerns about oso starter cluster health monitoring in the mailing thread and would probably raise it again on the next IC. The fact that we do not control cluster(s) state affects us dramatically in many cases

@ljelinkova
Copy link
Collaborator Author

I've found one failed job on cluster 2

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1392/console

User osio-ci-e2e-005

I agree that this is infrastructure problem but I'd like to have this issue open in case anybody else encounters it.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Jan 11, 2019

@ljelinkova please remove P1 label in this case

@ljelinkova
Copy link
Collaborator Author

Done.

@pbergene
Copy link
Collaborator

Note a tenant update was running on 2 in this timeframe, might be good for @MatousJobanek to know

@ibuziuk ibuziuk removed their assignment Jan 11, 2019
@ljelinkova
Copy link
Collaborator Author

I am putting P1 label back since the errors are quite frequent.

@ljelinkova ljelinkova added the priority/P1 Critical label Jan 14, 2019
@MatousJobanek
Copy link

Note a tenant update was running on 2 in this timeframe

that's true, however, the update was triggered only for jenkins namespaces, so it shouldn't influence che anyhow.

@ibuziuk ibuziuk changed the title Che fails to load project Che fails to start on some clusters (1a / 2) Jan 14, 2019
@jmelis
Copy link
Contributor

jmelis commented Jan 14, 2019

Since Jan 12, 15 jobs have run in starter-2 out of which 12 have failed.

@garagatyi
Copy link

Starting Che workspaces on prod-preview against 2a cluster. Workspace starts, but the route is not available, so here is what shown:
image

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 5, 2019

just tested and nothing have changed since my last on 2a cluster [1]
@pbergene any updates on that ?

[1] #4695 (comment)

@ibuziuk ibuziuk changed the title Che fails to start on some clusters (1a / 2 / 2a) Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues Feb 5, 2019
@ibuziuk ibuziuk changed the title Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping Feb 5, 2019
@ScrewTSW
Copy link
Collaborator

This error has surfaced again.
redhat-developer/rh-che#1248

@ljelinkova
Copy link
Collaborator Author

@ibuziuk This issue has been open for over a month and it is still occurring. Have you considered to (maybe temporarily) accept the fact that the route takes long to create and modify Che to wait for a longer time?

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 21, 2019

@ljelinkova e2e tests should not be affected by routes as long as timeout is 10 minutes. My opinion is that is infrastructure is not good enough for users but good enough for tests to pass and I told about that on IC to @ppitonak @rhopp . Basically, I do not understand why e2e tests are failing on che if manually I can not reproduce this issue

@ljelinkova
Copy link
Collaborator Author

ljelinkova commented Feb 21, 2019

@ibuziuk Tests wait until the project is shown but it is never shown. I guess it would be possible to reload page after 5 minutes and it should work - but that is not something a user would do.

The tests always fail during detecting the project type so I think that it would be worth to try to fix that part of the code to wait / retry the API call.

[06:45:19.135]   req_id:1000000459.1471   [RESPONSE]   503   Service Unavailable   https://route0r63fdqz-osio-ci-e2e-002-preview-che.b542.starter-us-east-2a.openshiftapps.com/api/project-type

As for why you were not able to reproduce - the tests run 30x per day on production in different times so they are more probable to catch any infra glitch as any manual tests. Besides, I guess @rhopp was able to reproduce manually and @ppitonak saw that issue also manually.

@ppitonak
Copy link
Collaborator

@ibuziuk I'm able to reproduce it manually every time

  1. open https://prod-preview.openshift.io
  2. log in
  3. create a new space
  4. create a new codebase
  5. select mission REST API Level 0 and runtime Spring Boot
  6. select whatever pipeline you like (I recommend the shortest one)
  7. on last screen of Launcher, clikc "Return to your Dashboard"
  8. open Codebases page
  9. click Create workspace
  10. click Open
  11. wait for workspace to start

che_workspace

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 21, 2019

@ppitonak against which cluster(s) are you able to reproduce it ?

@ppitonak
Copy link
Collaborator

@ibuziuk see the steps in my last coment... I reproduced it 2 minutes ago on prod-preview, i.e. us-east-2a

@Katka92
Copy link

Katka92 commented Feb 21, 2019

Hi guys, there is another way how it can be reproduced:

I was able to reproduce it on prod-preview now 4/5 times.
I am able to reproduce it on 2a almost every try. It happens on cluster 2 too - today my PR failed on that [1]. But I was not able to reproduce it manually. There is a screenshot showing that project is not imported [2].

One more thing I would like to say is that it is not affecting only e2e tests - our periodic tests on 2a cluster are failing too. You can see log here [3] and screenshots here [4].

[1] https://ci.centos.org/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/1169/console
[2] https://ci.centos.org/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/1169/console
[3] https://ci.centos.org/job/devtools-rh-che-periodic-prod-preview-2a/194/console
[4] http://artifacts.ci.centos.org/devtools/rhche/devtools-rh-che-periodic-prod-preview-2a/194/screenshots/

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 22, 2019

@Katka92 thanks for details, so far I was not able to reproduce manually against us-east-2, but I was able to repro against 2a
@ppitonak @ljelinkova until infrastructure is fixed could we add workaround with page refresh on e2e side ?

@ppitonak
Copy link
Collaborator

@ibuziuk after discussion with @ljelinkova we decided not to implement a workaround in e2e test. This is a production service and we cannot require our users to refresh the page.

There are two possible ways how to deal with this issue:

  1. (preferred) you fix the issue in Che
  2. we completely delete the che test from e2e smoketest if you don't care about it

@rhopp could this affect CRW customers if their infra had similar problems?

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 25, 2019

@ppitonak this issue only occurs when infrastructure is not in a proper state (e.g. route issue). We have no capacity to fix this for che 6 and add some re-try logic, since che 6 is going to be deprecated in a short run (che 7 beta is released this week). Taken into account that this problem is not reproducible on clusters which are in a proper state, adding workaround in e2e tests with page refresh sounds like a relevant trade-off until infra is fixed [1].

[1] https://redhat.service-now.com/help?id=rh_ticket&table=incident&sys_id=03c4c19cdbe36780981c8a7239961936&view=ess

@ibuziuk
Copy link
Collaborator

ibuziuk commented Feb 25, 2019

  1. we completely delete the che test from e2e smoketest if you don't care about it

@ppitonak and please do not say that I do not care, since I do. The situation with infrastructure affects pretty much every single service and, we trying to bring it and resolve with to ops + find some SLA based solution. Util this is resolved, I hope we could be more flexible with the decisions regarding support of the existing codebase. In the current situation I do believe the page refresh could be treated as a know issue.

@ibuziuk
Copy link
Collaborator

ibuziuk commented Mar 4, 2019

@ppitonak @ljelinkova FYI the following workaround has been applied on rh-che funtional test side - redhat-developer/rh-che#1273
Please, consider adding it to e2e side until infra is fixed

@Katka92
Copy link

Katka92 commented Apr 23, 2019

I can see this happening again. E.g. https://ci.centos.org/job/devtools-rh-che-periodic-prod-2/508/console on 20th April.

@Katka92
Copy link

Katka92 commented Apr 29, 2019

@ibuziuk
Copy link
Collaborator

ibuziuk commented Aug 1, 2019

Closing, since currently there is no route flapping / long exposure for quite a long period of time:

image

@ibuziuk ibuziuk closed this as completed Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests