Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping #4695

ljelinkova · 2019-01-09T09:44:30Z

Related HK issue - https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2520
When I create a new booster in OSIO and open it in Che, Che is started properly but no projects are loaded and no error message is displayed to the user (wait timeout is 10 minutes).

E2E tests encountered this several times, so far only on clusters 1a and 2.

I can see this errors in browser log:

[08:39:30.879] SEVERE _app-0.js 17981 WebSocket connection to 'wss://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/wsagent?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw&clientId=992733577' failed: Error during WebSocket handshake: Unexpected response code: 503
[08:39:30.880] WARNING _app-0.js 17908:125 "WARNING (org.eclipse.che.ide.websocket.impl.BasicWebSocketEndpoint): Error occurred for endpoint wss://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/wsagent?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw&clientId=992733577"
[08:39:30.975] SEVERE _app-0.js 16432 WebSocket connection to 'wss://route2yt5yxcj-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/pty?token=eyJhbGciOiJSUzI1NiIsImtpbmQiOiJtYWNoaW5lX3Rva2VuIiwia2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyJ9.eyJ3c2lkIjoid29ya3NwYWNlZTNza2c1ZXVyNm1sMHgxMyIsInVpZCI6ImI1MzczNGFlLTNiNTMtNDU5Yi1iNjQwLWU1MzQwYjMzZjIwOSIsImF1ZCI6IndvcmtzcGFjZWUzc2tnNWV1cjZtbDB4MTMiLCJuYmYiOi0xLCJ1bmFtZSI6Im9zaW8tY2ktZTJlLTAwMiIsImlzcyI6IndzbWFzdGVyIiwiZXhwIjoxNTc4NTU5MTQxLCJpYXQiOjE1NDcwMjMxNDEsImp0aSI6ImJmNWEyNmI2LTA4ZWMtNDEyYS04MjE0LTNlNTgxZDc1N2Q2MyJ9.kYjpti7wP-2-htZBryuQQqETyYRFhN77pLHi4I2mSd6n04dhsengKNXPS3xqntLcYwfQQDHSWwtvzWDPb5zWglC4S6ppmOYIqCcVTuwvnEN1qJldPmtMgvfTFc5neg8f765zzrkCGvgsi5wZGgKbueCMelZAZqrMHMVkwAunKXq2kLjeEd2QVCiznXtB7tOHxD2yHjQzoUNGBflXt1OO9Ve0vNsXElMHkshkGxDaQDV5ouJRCQPAKjRHALRMCATDDLSpomHdwTybV8gvibTGBA8xQB1dp8qb92PL_izfiKreOpP9EDXMN_vkq7976jCk2zY9hoi26La7V0a8UmqNsw' failed: Error during WebSocket handshake: Unexpected response code: 503
[08:39:30.976] SEVERE https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type - Failed to load resource: the server responded with a status of 503 (Service Unavailable)
[08:39:30.977] SEVERE https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type - Failed to load resource: the server responded with a status of 503 (Service Unavailable)
[08:39:30.977] SEVERE https://che.openshift.io/osio-ci-e2e-002/e2e-0109-0837-5795-eoclc - Access to XMLHttpRequest at 'https://route458cf5up-osio-ci-e2e-002-che.9a6d.starter-us-east-1a.openshiftapps.com/api/project-type' from origin 'https://che.openshift.io' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
[08:39:30.977] SEVERE _app-0.js 17908:125 "ERROR (org.eclipse.che.ide.projecttype.ProjectTypeRegistryImpl): Can't load project types: org.eclipse.che.ide.commons.exception.ServerDisconnectedException"

The text was updated successfully, but these errors were encountered:

ljelinkova · 2019-01-09T09:47:48Z

It may be related to #4511

ibuziuk · 2019-01-10T09:12:27Z

@ljelinkova could you provide identity_id of e2e users against which accounts this problem is reproducible ?

ibuziuk · 2019-01-10T09:13:48Z

also, please provide links to jobs where the problem was caught

ljelinkova · 2019-01-10T10:21:20Z

This is the last failed job: https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2143/console
User: osio-ci-e2e-002

I am not sure what you mean by identity_id - is it something you get from

curl -X GET "https://auth.openshift.io/api/users?filter[username]=osio-ci-e2e-002" |jq .

ibuziuk · 2019-01-10T11:00:08Z

@jmelis @aditya-konarde @pbergene are there any known infrastructure issues on starter clusters ? Che currently fails to start for me on 2 cluster due to networking issues. Are we positive that https://redhat.service-now.com/help?id=rh_ticket&table=incident&sys_id=b173eceedb36a7c01e48cae3b9961912&view=ess has been fully resolved and not affect other clusters ? As I understand currently it is not possible to obtain any cluster specific health metrics ?

ibuziuk · 2019-01-10T14:40:05Z

@rhopp @Katka92 @ScrewTSW could you please confirm that this problem only persists on particular clusters while not being reproducible on others cc: @ppitonak @ldimaggi

Katka92 · 2019-01-10T16:57:46Z

@ibuziuk I have run tests many times and the only failure it discovered is bayesian (known issue). I created new workspace from OSIO from the space that I created some time ago with vertx-http-booster from my repo and works fine.
I'll investigate further tomorrow by manual testing directly from OSIO - I'll try reset environment and try it again.

Katka92 · 2019-01-11T07:49:38Z

Today I tried to reset environment and create new space + new workspaces and works fine. I've run tests with that account several times and haven't reproduced this issue. The user is provided on 2 cluster. @rhopp @ScrewTSW could you give it a try please?

rhopp · 2019-01-11T08:07:39Z

I'm not observing any troubles starting workspaces with my accounts on 2, 2a and 1a

ibuziuk · 2019-01-11T08:19:09Z

@Katka92 @rhopp thanks for verification. Indeed, it looks like temp infra related issue.
@ljelinkova I see that the job has been stabilized, so please close the issue if the problem is no longer reproducible.

I have raised concerns about oso starter cluster health monitoring in the mailing thread and would probably raise it again on the next IC. The fact that we do not control cluster(s) state affects us dramatically in many cases

ljelinkova · 2019-01-11T08:40:17Z

I've found one failed job on cluster 2

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1392/console

User osio-ci-e2e-005

I agree that this is infrastructure problem but I'd like to have this issue open in case anybody else encounters it.

ibuziuk · 2019-01-11T08:42:10Z

@ljelinkova please remove P1 label in this case

ljelinkova · 2019-01-11T08:48:08Z

Done.

pbergene · 2019-01-11T09:12:14Z

Note a tenant update was running on 2 in this timeframe, might be good for @MatousJobanek to know

ljelinkova · 2019-01-14T07:59:32Z

We had more issues since yesterday morning, both on 1a and 2 clusters. Especially worrying is the cluster 2 since the tests failed 4 out of 6 last runs on this Che problem.

1a
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2182/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2183/console

2
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1408/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1409/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1411/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1413/console

ljelinkova · 2019-01-14T08:13:27Z

I am putting P1 label back since the errors are quite frequent.

MatousJobanek · 2019-01-14T08:21:18Z

Note a tenant update was running on 2 in this timeframe

that's true, however, the update was triggered only for jenkins namespaces, so it shouldn't influence che anyhow.

jmelis · 2019-01-14T10:32:00Z

Since Jan 12, 15 jobs have run in starter-2 out of which 12 have failed.

7 failed because: git clone failed (import project): 1414, 1413, 1411, 1408, 1404, 1403, 1401
4 failed because: Build job did not run: 1412, 1410, 1405, 1400
1 failed because: Workspace failed to start (FailedMount): 1409

garagatyi · 2019-02-05T09:13:03Z

Starting Che workspaces on prod-preview against 2a cluster. Workspace starts, but the route is not available, so here is what shown:

ibuziuk · 2019-02-05T09:49:59Z

just tested and nothing have changed since my last on 2a cluster [1]
@pbergene any updates on that ?

[1] #4695 (comment)

ScrewTSW · 2019-02-13T16:27:25Z

This error has surfaced again.
redhat-developer/rh-che#1248

ljelinkova · 2019-02-21T11:56:17Z

@ibuziuk This issue has been open for over a month and it is still occurring. Have you considered to (maybe temporarily) accept the fact that the route takes long to create and modify Che to wait for a longer time?

ibuziuk · 2019-02-21T12:01:50Z

@ljelinkova e2e tests should not be affected by routes as long as timeout is 10 minutes. My opinion is that is infrastructure is not good enough for users but good enough for tests to pass and I told about that on IC to @ppitonak @rhopp . Basically, I do not understand why e2e tests are failing on che if manually I can not reproduce this issue

ljelinkova · 2019-02-21T12:10:40Z

@ibuziuk Tests wait until the project is shown but it is never shown. I guess it would be possible to reload page after 5 minutes and it should work - but that is not something a user would do.

The tests always fail during detecting the project type so I think that it would be worth to try to fix that part of the code to wait / retry the API call.

[06:45:19.135]   req_id:1000000459.1471   [RESPONSE]   503   Service Unavailable   https://route0r63fdqz-osio-ci-e2e-002-preview-che.b542.starter-us-east-2a.openshiftapps.com/api/project-type

As for why you were not able to reproduce - the tests run 30x per day on production in different times so they are more probable to catch any infra glitch as any manual tests. Besides, I guess @rhopp was able to reproduce manually and @ppitonak saw that issue also manually.

ppitonak · 2019-02-21T12:14:10Z

@ibuziuk I'm able to reproduce it manually every time

open https://prod-preview.openshift.io
log in
create a new space
create a new codebase
select mission REST API Level 0 and runtime Spring Boot
select whatever pipeline you like (I recommend the shortest one)
on last screen of Launcher, clikc "Return to your Dashboard"
open Codebases page
click Create workspace
click Open
wait for workspace to start

ibuziuk · 2019-02-21T12:15:36Z

@ppitonak against which cluster(s) are you able to reproduce it ?

ppitonak · 2019-02-21T12:18:09Z

@ibuziuk see the steps in my last coment... I reproduced it 2 minutes ago on prod-preview, i.e. us-east-2a

Katka92 · 2019-02-21T13:29:31Z

Hi guys, there is another way how it can be reproduced:

create workspace via dashboard but don't start it
open workspace via direct link ( https://che.prod-preview.openshift.io/user/workspace_name )

I was able to reproduce it on prod-preview now 4/5 times.
I am able to reproduce it on 2a almost every try. It happens on cluster 2 too - today my PR failed on that [1]. But I was not able to reproduce it manually. There is a screenshot showing that project is not imported [2].

One more thing I would like to say is that it is not affecting only e2e tests - our periodic tests on 2a cluster are failing too. You can see log here [3] and screenshots here [4].

[1] https://ci.centos.org/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/1169/console
[2] https://ci.centos.org/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/1169/console
[3] https://ci.centos.org/job/devtools-rh-che-periodic-prod-preview-2a/194/console
[4] http://artifacts.ci.centos.org/devtools/rhche/devtools-rh-che-periodic-prod-preview-2a/194/screenshots/

ibuziuk · 2019-02-22T08:37:53Z

@Katka92 thanks for details, so far I was not able to reproduce manually against us-east-2, but I was able to repro against 2a
@ppitonak @ljelinkova until infrastructure is fixed could we add workaround with page refresh on e2e side ?

#1257)

ppitonak · 2019-02-25T09:41:44Z

@ibuziuk after discussion with @ljelinkova we decided not to implement a workaround in e2e test. This is a production service and we cannot require our users to refresh the page.

There are two possible ways how to deal with this issue:

(preferred) you fix the issue in Che
we completely delete the che test from e2e smoketest if you don't care about it

@rhopp could this affect CRW customers if their infra had similar problems?

ibuziuk · 2019-02-25T09:47:26Z

@ppitonak this issue only occurs when infrastructure is not in a proper state (e.g. route issue). We have no capacity to fix this for che 6 and add some re-try logic, since che 6 is going to be deprecated in a short run (che 7 beta is released this week). Taken into account that this problem is not reproducible on clusters which are in a proper state, adding workaround in e2e tests with page refresh sounds like a relevant trade-off until infra is fixed [1].

[1] https://redhat.service-now.com/help?id=rh_ticket&table=incident&sys_id=03c4c19cdbe36780981c8a7239961936&view=ess

ibuziuk · 2019-02-25T10:06:37Z

we completely delete the che test from e2e smoketest if you don't care about it

@ppitonak and please do not say that I do not care, since I do. The situation with infrastructure affects pretty much every single service and, we trying to bring it and resolve with to ops + find some SLA based solution. Util this is resolved, I hope we could be more flexible with the decisions regarding support of the existing codebase. In the current situation I do believe the page refresh could be treated as a know issue.

ibuziuk · 2019-03-04T14:20:50Z

@ppitonak @ljelinkova FYI the following workaround has been applied on rh-che funtional test side - redhat-developer/rh-che#1273
Please, consider adding it to e2e side until infra is fixed

Katka92 · 2019-04-23T06:50:14Z

I can see this happening again. E.g. https://ci.centos.org/job/devtools-rh-che-periodic-prod-2/508/console on 20th April.

Katka92 · 2019-04-29T05:29:18Z

Seen again https://ci.centos.org/view/Devtools/job/devtools-rh-che-periodic-prod-1b/333/console on 26th April.

ibuziuk · 2019-08-01T07:40:09Z

Closing, since currently there is no route flapping / long exposure for quite a long period of time:

ljelinkova added area/che team/che priority/P1 Critical area/e2e-tests labels Jan 9, 2019

ibuziuk self-assigned this Jan 10, 2019

ibuziuk added team/service-delivery and removed team/che labels Jan 11, 2019

ljelinkova removed the priority/P1 Critical label Jan 11, 2019

ibuziuk removed their assignment Jan 11, 2019

ljelinkova added the priority/P1 Critical label Jan 14, 2019

ibuziuk mentioned this issue Jan 14, 2019

Launching che7 on che.openshift.com fails eclipse-che/che#12400

Closed

ibuziuk changed the title ~~Che fails to load project~~ Che fails to start on some clusters (1a / 2) Jan 14, 2019

ppitonak removed the SEV2-high label Jan 31, 2019

ibuziuk changed the title ~~Che fails to start on some clusters (1a / 2 / 2a)~~ Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues Feb 5, 2019

ibuziuk changed the title ~~Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues~~ Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping Feb 5, 2019

ppitonak mentioned this issue Feb 8, 2019

Updating plugin registry to the latest version openshiftio/saas-openshiftio#1215

Merged

This was referenced Feb 13, 2019

Updating che on prod to the latest 6.18.1 upstream version openshiftio/saas-openshiftio#1221

Merged

Workspace project files are not being pulled on 2a starter cluster redhat-developer/rh-che#1248

Closed

Katka92 added a commit to Katka92/rh-che that referenced this issue Feb 22, 2019

Workaround for importing project. Issue: openshiftio/openshift.io#4695

2a24fd3

Katka92 mentioned this issue Feb 22, 2019

Workaround for importing project redhat-developer/rh-che#1257

Merged

Katka92 added a commit to redhat-developer/rh-che that referenced this issue Feb 22, 2019

Workaround for importing project. Issue: openshiftio/openshift.io#4695 (

b2baa6b

#1257)

ibuziuk mentioned this issue Feb 28, 2019

opening a Che 7 workspace from factory fails due to cors eclipse-che/che#12788

Closed

rhopp mentioned this issue Mar 12, 2019

Very slow git clone on starter cluster 2a redhat-developer/rh-che#1289

Closed

ibuziuk closed this as completed Aug 1, 2019

Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping #4695

Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping #4695

Comments

ljelinkova commented Jan 9, 2019 • edited by ibuziuk Loading

ljelinkova commented Jan 9, 2019

ibuziuk commented Jan 10, 2019

ibuziuk commented Jan 10, 2019

ljelinkova commented Jan 10, 2019

ibuziuk commented Jan 10, 2019

ibuziuk commented Jan 10, 2019

Katka92 commented Jan 10, 2019

Katka92 commented Jan 11, 2019 • edited Loading

rhopp commented Jan 11, 2019

ibuziuk commented Jan 11, 2019

ljelinkova commented Jan 11, 2019

ibuziuk commented Jan 11, 2019

ljelinkova commented Jan 11, 2019

pbergene commented Jan 11, 2019

ljelinkova commented Jan 14, 2019

ljelinkova commented Jan 14, 2019

MatousJobanek commented Jan 14, 2019

jmelis commented Jan 14, 2019 • edited Loading

garagatyi commented Feb 5, 2019

ibuziuk commented Feb 5, 2019 • edited Loading

ScrewTSW commented Feb 13, 2019

ljelinkova commented Feb 21, 2019

ibuziuk commented Feb 21, 2019

ljelinkova commented Feb 21, 2019 • edited Loading

ppitonak commented Feb 21, 2019

ibuziuk commented Feb 21, 2019

ppitonak commented Feb 21, 2019

Katka92 commented Feb 21, 2019

ibuziuk commented Feb 22, 2019

ppitonak commented Feb 25, 2019

ibuziuk commented Feb 25, 2019 • edited Loading

ibuziuk commented Feb 25, 2019

ibuziuk commented Mar 4, 2019

Katka92 commented Apr 23, 2019

Katka92 commented Apr 29, 2019

ibuziuk commented Aug 1, 2019

ljelinkova commented Jan 9, 2019 •

edited by ibuziuk

Loading

Katka92 commented Jan 11, 2019 •

edited

Loading

jmelis commented Jan 14, 2019 •

edited

Loading

ibuziuk commented Feb 5, 2019 •

edited

Loading

ljelinkova commented Feb 21, 2019 •

edited

Loading

ibuziuk commented Feb 25, 2019 •

edited

Loading