[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

github-actions · 2024-09-09T19:56:52Z

🚨 Failure Summary 🚨:

📋 Job Name: e2ePerformanceTests / Run E2E tests in AWS device farm
🔧 Failure in Workflow: Process new code merged to main
🔗 Triggered by PR: PR Link
👤 PR Author: @rushatgabhane
🤝 Merged by: @dangrous
🐛 Error Message:
warning: The following actions use a deprecated Node.js version and will be forced to run on node20: realm/aws-devicefarm@7b9a912. For more info: https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/
failure: Process completed with exit code 9.
failure: Process completed with exit code 1.
failure: Test run timed out. Consider increasing the 'timeout' (currently 5400s) parameter.
failure: Process completed with exit code 9.

⚠️ Action Required ⚠️:

🛠️ A recent merge appears to have caused a failure in the job named e2ePerformanceTests / Run E2E tests in AWS device farm.
This issue has been automatically created and labeled with Workflow Failure for investigation.

👀 Please look into the following:

Why the PR caused the job to fail?
Address any underlying issues.

🐛 We appreciate your help in squashing this bug!

Issue Owner

Current Issue Owner: @kirillzyusko

The text was updated successfully, but these errors were encountered:

dangrous · 2024-09-10T21:34:07Z

sending to the experts! https://expensify.slack.com/archives/C035J5C9FAP/p1726004032791809

dangrous · 2024-09-12T00:37:50Z

Investigation in process!

dangrous · 2024-09-13T21:56:08Z

Working on getting the logs. It's not related to the linked PR, but keeping this open as a daily for that investigation

melvin-bot · 2024-09-17T18:18:55Z

@dangrous Whoops! This issue is 2 days overdue. Let's get this updated quick!

dangrous · 2024-09-17T22:26:51Z

margelo team is on it I believe, in that same slack thread. @kirillzyusko let me know if you want me to assign you here!

kirillzyusko · 2024-09-19T09:51:13Z

@dangrous yeah, feel free to assign me on this!

melvin-bot · 2024-09-24T18:12:01Z

@dangrous, @kirillzyusko Eep! 4 days overdue now. Issues have feelings too...

kirillzyusko · 2024-09-24T18:42:59Z

It failed because of timeout issue (we hit a limit of 5400s) - 1.5h.

I think we merged a PR #47777 which increases it to 7200 (2h). Do you think we can close the issue?

dangrous · 2024-09-24T20:51:46Z

It looks from the screengrab that it crashed though, right? And that's what caused the timeout since the app never reopened? We should see if we can figure out what that crash was....

kirillzyusko · 2024-09-30T10:16:34Z

@dangrous yeah, you are right, but from my observation:

these crashes are happening only in e2e tests (most likely they are coming from that fact that we use flashlight tool);
I've tried to read a stacktrace but it was c++ code with its weird stacktraces so I didn't get any useful insights into what causing these crashes.

In fact in out e2e tests we allow test to crash 3 times during its 60 runs. And we are relying on this fact. The problem is that when test crashes, then we are waiting 5 mins to force quit it (we have 5 mins timeout for a test). And if we get 2 random failures in any test, it will result in 10 minutes overhead for 1 test-suite. We have 5 test suites, so potentially retrying mechanism can add ~50 minutes for our test run 🤷‍♂️ And I think because of that we hit a limit in this particular test.

One of the things to optimize it I've been thinking of is reducing the timeout interval (from 5 minutes to 2.5 minutes). But I think we need to ask @hannojg why such relatively big timeout was chosen for e2e tests?

dangrous · 2024-09-30T17:58:36Z

oh okay that makes sense - yeah I feel like we could even go shorter than 2.5 mins - I feel like if something is hanging for more than, say, 1 minute, then something is wrong enough that we should look at it. But curious what @hannojg thinks. Or if he's still OOO I think we can close this in the meantime

hannojg · 2024-10-01T07:06:31Z

Agree, we can definitely make this timeout interval shorter!

dangrous · 2024-10-02T14:16:04Z

Great! @kirillzyusko do you want to put up a PR to drop that timeout, maybe start with 2.5 mins and we see how that one goes? Probably could go even shorter but maybe that's a good starting point

hannojg · 2024-10-03T14:34:53Z

Kiryl is OOO, and will be back next week to pick this one up!

melvin-bot · 2024-10-03T18:07:22Z

@dangrous, @kirillzyusko Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

melvin-bot · 2024-10-07T18:23:40Z

@dangrous, @kirillzyusko 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

dangrous · 2024-10-09T15:32:50Z

@kirillzyusko let us know when you're back and can knock out the timeout adjustment!

kirillzyusko · 2024-10-09T16:15:47Z

@dangrous here is a PR: #50512 👀

melvin-bot · 2024-10-15T03:17:50Z

Reviewing label has been removed, please complete the "BugZero Checklist".

melvin-bot · 2024-10-15T03:17:53Z

The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.48-2 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

[NoQA] fix: shorter e2e interaction timeout #50512

If no regressions arise, payment will be issued on 2024-10-22. 🎊

For reference, here are some details about the assignees on this issue:

@kirillzyusko does not require payment (Contractor)

dangrous · 2024-10-22T15:23:34Z

I'm going to close this, since no payment required, and we'll see eventually if it helped out the failing jobs. Should be fine!

github-actions bot added Hourly KSv2 Workflow Failure labels Sep 9, 2024

github-actions bot assigned dangrous Sep 9, 2024

dangrous added Daily KSv2 and removed Hourly KSv2 labels Sep 13, 2024

melvin-bot bot added the Overdue label Sep 16, 2024

melvin-bot bot removed the Overdue label Sep 17, 2024

dangrous assigned kirillzyusko Sep 19, 2024

melvin-bot bot added the Overdue label Sep 23, 2024

melvin-bot bot removed the Overdue label Sep 24, 2024

melvin-bot bot added the Overdue label Oct 2, 2024

chiragsalian mentioned this issue Oct 8, 2024

Investigate workflow job failing on main: e2ePerformanceTests / Build apk from latest release as a baseline / Build Android app #50480

Closed

kirillzyusko mentioned this issue Oct 9, 2024

[NoQA] fix: shorter e2e interaction timeout #50512

Merged

48 tasks

melvin-bot bot added Reviewing Has a PR in review Weekly KSv2 and removed Daily KSv2 Overdue labels Oct 9, 2024

melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Oct 15, 2024

melvin-bot bot changed the title ~~Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm~~ [HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm Oct 15, 2024

melvin-bot bot removed the Reviewing Has a PR in review label Oct 15, 2024

melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Oct 21, 2024

dangrous closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

github-actions bot commented Sep 9, 2024 •

edited by dangrous

Loading

dangrous commented Sep 10, 2024

dangrous commented Sep 12, 2024

dangrous commented Sep 13, 2024

melvin-bot bot commented Sep 17, 2024

dangrous commented Sep 17, 2024

kirillzyusko commented Sep 19, 2024

melvin-bot bot commented Sep 24, 2024

kirillzyusko commented Sep 24, 2024

dangrous commented Sep 24, 2024

kirillzyusko commented Sep 30, 2024

dangrous commented Sep 30, 2024

hannojg commented Oct 1, 2024

dangrous commented Oct 2, 2024

hannojg commented Oct 3, 2024

melvin-bot bot commented Oct 3, 2024

melvin-bot bot commented Oct 7, 2024

dangrous commented Oct 9, 2024

kirillzyusko commented Oct 9, 2024

melvin-bot bot commented Oct 15, 2024

melvin-bot bot commented Oct 15, 2024

dangrous commented Oct 22, 2024

[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

Comments

github-actions bot commented Sep 9, 2024 • edited by dangrous Loading

dangrous commented Sep 10, 2024

dangrous commented Sep 12, 2024

dangrous commented Sep 13, 2024

melvin-bot bot commented Sep 17, 2024

dangrous commented Sep 17, 2024

kirillzyusko commented Sep 19, 2024

melvin-bot bot commented Sep 24, 2024

kirillzyusko commented Sep 24, 2024

dangrous commented Sep 24, 2024

kirillzyusko commented Sep 30, 2024

dangrous commented Sep 30, 2024

hannojg commented Oct 1, 2024

dangrous commented Oct 2, 2024

hannojg commented Oct 3, 2024

melvin-bot bot commented Oct 3, 2024

melvin-bot bot commented Oct 7, 2024

dangrous commented Oct 9, 2024

kirillzyusko commented Oct 9, 2024

melvin-bot bot commented Oct 15, 2024

melvin-bot bot commented Oct 15, 2024

dangrous commented Oct 22, 2024

github-actions bot commented Sep 9, 2024 •

edited by dangrous

Loading