-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824
Comments
sending to the experts! https://expensify.slack.com/archives/C035J5C9FAP/p1726004032791809 |
Investigation in process! |
Working on getting the logs. It's not related to the linked PR, but keeping this open as a daily for that investigation |
@dangrous Whoops! This issue is 2 days overdue. Let's get this updated quick! |
margelo team is on it I believe, in that same slack thread. @kirillzyusko let me know if you want me to assign you here! |
@dangrous yeah, feel free to assign me on this! |
@dangrous, @kirillzyusko Eep! 4 days overdue now. Issues have feelings too... |
It failed because of timeout issue (we hit a limit of 5400s) - 1.5h. I think we merged a PR #47777 which increases it to 7200 (2h). Do you think we can close the issue? |
It looks from the screengrab that it crashed though, right? And that's what caused the timeout since the app never reopened? We should see if we can figure out what that crash was.... |
@dangrous yeah, you are right, but from my observation:
In fact in out e2e tests we allow test to crash 3 times during its 60 runs. And we are relying on this fact. The problem is that when test crashes, then we are waiting 5 mins to force quit it (we have 5 mins timeout for a test). And if we get 2 random failures in any test, it will result in 10 minutes overhead for 1 test-suite. We have 5 test suites, so potentially retrying mechanism can add ~50 minutes for our test run 🤷♂️ And I think because of that we hit a limit in this particular test. One of the things to optimize it I've been thinking of is reducing the timeout interval (from 5 minutes to 2.5 minutes). But I think we need to ask @hannojg why such relatively big timeout was chosen for e2e tests? |
oh okay that makes sense - yeah I feel like we could even go shorter than 2.5 mins - I feel like if something is hanging for more than, say, 1 minute, then something is wrong enough that we should look at it. But curious what @hannojg thinks. Or if he's still OOO I think we can close this in the meantime |
Agree, we can definitely make this timeout interval shorter! |
Great! @kirillzyusko do you want to put up a PR to drop that timeout, maybe start with 2.5 mins and we see how that one goes? Probably could go even shorter but maybe that's a good starting point |
Kiryl is OOO, and will be back next week to pick this one up! |
@dangrous, @kirillzyusko Uh oh! This issue is overdue by 2 days. Don't forget to update your issues! |
@dangrous, @kirillzyusko 6 days overdue. This is scarier than being forced to listen to Vogon poetry! |
@kirillzyusko let us know when you're back and can knock out the timeout adjustment! |
|
The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.48-2 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue: If no regressions arise, payment will be issued on 2024-10-22. 🎊 For reference, here are some details about the assignees on this issue:
|
I'm going to close this, since no payment required, and we'll see eventually if it helped out the failing jobs. Should be fine! |
🚨 Failure Summary 🚨:
warning: The following actions use a deprecated Node.js version and will be forced to run on node20: realm/aws-devicefarm@7b9a912. For more info: https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/
failure: Process completed with exit code 9.
failure: Process completed with exit code 1.
failure: Test run timed out. Consider increasing the 'timeout' (currently 5400s) parameter.
failure: Process completed with exit code 9.
🛠️ A recent merge appears to have caused a failure in the job named e2ePerformanceTests / Run E2E tests in AWS device farm.
This issue has been automatically created and labeled with
Workflow Failure
for investigation.👀 Please look into the following:
🐛 We appreciate your help in squashing this bug!
Issue Owner
Current Issue Owner: @kirillzyuskoThe text was updated successfully, but these errors were encountered: