-
Notifications
You must be signed in to change notification settings - Fork 66
Che fails to start on some clusters (1a / 2 / 2a) due to infrastructure issues related to long route exposure and route flapping #4695
Comments
It may be related to #4511 |
@ljelinkova could you provide |
also, please provide links to jobs where the problem was caught |
This is the last failed job: https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2143/console I am not sure what you mean by
|
@jmelis @aditya-konarde @pbergene are there any known infrastructure issues on starter clusters ? Che currently fails to start for me on 2 cluster due to networking issues. Are we positive that https://redhat.service-now.com/help?id=rh_ticket&table=incident&sys_id=b173eceedb36a7c01e48cae3b9961912&view=ess has been fully resolved and not affect other clusters ? As I understand currently it is not possible to obtain any cluster specific health metrics ? |
@ibuziuk I have run tests many times and the only failure it discovered is bayesian (known issue). I created new workspace from OSIO from the space that I created some time ago with vertx-http-booster from my repo and works fine. |
I'm not observing any troubles starting workspaces with my accounts on 2, 2a and 1a |
@Katka92 @rhopp thanks for verification. Indeed, it looks like temp infra related issue. I have raised concerns about oso starter cluster health monitoring in the mailing thread and would probably raise it again on the next IC. The fact that we do not control cluster(s) state affects us dramatically in many cases |
I've found one failed job on cluster 2 https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1392/console User osio-ci-e2e-005 I agree that this is infrastructure problem but I'd like to have this issue open in case anybody else encounters it. |
@ljelinkova please remove P1 label in this case |
Done. |
Note a tenant update was running on 2 in this timeframe, might be good for @MatousJobanek to know |
We had more issues since yesterday morning, both on 1a and 2 clusters. Especially worrying is the cluster 2 since the tests failed 4 out of 6 last runs on this Che problem. 1a 2 |
I am putting P1 label back since the errors are quite frequent. |
that's true, however, the update was triggered only for |
just tested and nothing have changed since my last on 2a cluster [1] [1] #4695 (comment) |
This error has surfaced again. |
@ibuziuk This issue has been open for over a month and it is still occurring. Have you considered to (maybe temporarily) accept the fact that the route takes long to create and modify Che to wait for a longer time? |
@ljelinkova e2e tests should not be affected by routes as long as timeout is 10 minutes. My opinion is that is infrastructure is not good enough for users but good enough for tests to pass and I told about that on IC to @ppitonak @rhopp . Basically, I do not understand why e2e tests are failing on che if manually I can not reproduce this issue |
@ibuziuk Tests wait until the project is shown but it is never shown. I guess it would be possible to reload page after 5 minutes and it should work - but that is not something a user would do. The tests always fail during detecting the project type so I think that it would be worth to try to fix that part of the code to wait / retry the API call.
As for why you were not able to reproduce - the tests run 30x per day on production in different times so they are more probable to catch any infra glitch as any manual tests. Besides, I guess @rhopp was able to reproduce manually and @ppitonak saw that issue also manually. |
@ibuziuk I'm able to reproduce it manually every time
|
@ppitonak against which cluster(s) are you able to reproduce it ? |
@ibuziuk see the steps in my last coment... I reproduced it 2 minutes ago on prod-preview, i.e. us-east-2a |
Hi guys, there is another way how it can be reproduced:
I was able to reproduce it on prod-preview now 4/5 times. One more thing I would like to say is that it is not affecting only e2e tests - our periodic tests on 2a cluster are failing too. You can see log here [3] and screenshots here [4]. [1] https://ci.centos.org/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/1169/console |
@Katka92 thanks for details, so far I was not able to reproduce manually against |
@ibuziuk after discussion with @ljelinkova we decided not to implement a workaround in e2e test. This is a production service and we cannot require our users to refresh the page. There are two possible ways how to deal with this issue:
@rhopp could this affect CRW customers if their infra had similar problems? |
@ppitonak this issue only occurs when infrastructure is not in a proper state (e.g. route issue). We have no capacity to fix this for che 6 and add some re-try logic, since che 6 is going to be deprecated in a short run (che 7 beta is released this week). Taken into account that this problem is not reproducible on clusters which are in a proper state, adding workaround in e2e tests with page refresh sounds like a relevant trade-off until infra is fixed [1]. |
@ppitonak and please do not say that I do not care, since I do. The situation with infrastructure affects pretty much every single service and, we trying to bring it and resolve with to ops + find some SLA based solution. Util this is resolved, I hope we could be more flexible with the decisions regarding support of the existing codebase. In the current situation I do believe the page refresh could be treated as a know issue. |
@ppitonak @ljelinkova FYI the following workaround has been applied on rh-che funtional test side - redhat-developer/rh-che#1273 |
I can see this happening again. E.g. https://ci.centos.org/job/devtools-rh-che-periodic-prod-2/508/console on 20th April. |
Seen again https://ci.centos.org/view/Devtools/job/devtools-rh-che-periodic-prod-1b/333/console on 26th April. |
Related HK issue - https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2520
When I create a new booster in OSIO and open it in Che, Che is started properly but no projects are loaded and no error message is displayed to the user (wait timeout is 10 minutes).
E2E tests encountered this several times, so far only on clusters 1a and 2.
I can see this errors in browser log:
The text was updated successfully, but these errors were encountered: