Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

Closed
github-actions bot opened this issue Sep 9, 2024 · 21 comments
Assignees
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Daily KSv2 Workflow Failure

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2024

🚨 Failure Summary 🚨:

⚠️ Action Required ⚠️:

🛠️ A recent merge appears to have caused a failure in the job named e2ePerformanceTests / Run E2E tests in AWS device farm.
This issue has been automatically created and labeled with Workflow Failure for investigation.

👀 Please look into the following:

  1. Why the PR caused the job to fail?
  2. Address any underlying issues.

🐛 We appreciate your help in squashing this bug!

Issue OwnerCurrent Issue Owner: @kirillzyusko
@dangrous
Copy link
Contributor

@dangrous
Copy link
Contributor

Investigation in process!

@dangrous dangrous added Daily KSv2 and removed Hourly KSv2 labels Sep 13, 2024
@dangrous
Copy link
Contributor

Working on getting the logs. It's not related to the linked PR, but keeping this open as a daily for that investigation

@melvin-bot melvin-bot bot added the Overdue label Sep 16, 2024
Copy link

melvin-bot bot commented Sep 17, 2024

@dangrous Whoops! This issue is 2 days overdue. Let's get this updated quick!

@dangrous
Copy link
Contributor

margelo team is on it I believe, in that same slack thread. @kirillzyusko let me know if you want me to assign you here!

@melvin-bot melvin-bot bot removed the Overdue label Sep 17, 2024
@kirillzyusko
Copy link
Contributor

@dangrous yeah, feel free to assign me on this!

Copy link

melvin-bot bot commented Sep 24, 2024

@dangrous, @kirillzyusko Eep! 4 days overdue now. Issues have feelings too...

@kirillzyusko
Copy link
Contributor

It failed because of timeout issue (we hit a limit of 5400s) - 1.5h.

I think we merged a PR #47777 which increases it to 7200 (2h). Do you think we can close the issue?

@melvin-bot melvin-bot bot removed the Overdue label Sep 24, 2024
@dangrous
Copy link
Contributor

It looks from the screengrab that it crashed though, right? And that's what caused the timeout since the app never reopened? We should see if we can figure out what that crash was....

@kirillzyusko
Copy link
Contributor

@dangrous yeah, you are right, but from my observation:

  • these crashes are happening only in e2e tests (most likely they are coming from that fact that we use flashlight tool);
  • I've tried to read a stacktrace but it was c++ code with its weird stacktraces so I didn't get any useful insights into what causing these crashes.

In fact in out e2e tests we allow test to crash 3 times during its 60 runs. And we are relying on this fact. The problem is that when test crashes, then we are waiting 5 mins to force quit it (we have 5 mins timeout for a test). And if we get 2 random failures in any test, it will result in 10 minutes overhead for 1 test-suite. We have 5 test suites, so potentially retrying mechanism can add ~50 minutes for our test run 🤷‍♂️ And I think because of that we hit a limit in this particular test.

One of the things to optimize it I've been thinking of is reducing the timeout interval (from 5 minutes to 2.5 minutes). But I think we need to ask @hannojg why such relatively big timeout was chosen for e2e tests?

@dangrous
Copy link
Contributor

oh okay that makes sense - yeah I feel like we could even go shorter than 2.5 mins - I feel like if something is hanging for more than, say, 1 minute, then something is wrong enough that we should look at it. But curious what @hannojg thinks. Or if he's still OOO I think we can close this in the meantime

@hannojg
Copy link
Contributor

hannojg commented Oct 1, 2024

Agree, we can definitely make this timeout interval shorter!

@dangrous
Copy link
Contributor

dangrous commented Oct 2, 2024

Great! @kirillzyusko do you want to put up a PR to drop that timeout, maybe start with 2.5 mins and we see how that one goes? Probably could go even shorter but maybe that's a good starting point

@melvin-bot melvin-bot bot added the Overdue label Oct 2, 2024
@hannojg
Copy link
Contributor

hannojg commented Oct 3, 2024

Kiryl is OOO, and will be back next week to pick this one up!

Copy link

melvin-bot bot commented Oct 3, 2024

@dangrous, @kirillzyusko Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

Copy link

melvin-bot bot commented Oct 7, 2024

@dangrous, @kirillzyusko 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

@dangrous
Copy link
Contributor

dangrous commented Oct 9, 2024

@kirillzyusko let us know when you're back and can knock out the timeout adjustment!

@melvin-bot melvin-bot bot added Reviewing Has a PR in review Weekly KSv2 and removed Daily KSv2 Overdue labels Oct 9, 2024
@kirillzyusko
Copy link
Contributor

@dangrous here is a PR: #50512 👀

@melvin-bot melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Oct 15, 2024
@melvin-bot melvin-bot bot changed the title Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm [HOLD for payment 2024-10-22] Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm Oct 15, 2024
Copy link

melvin-bot bot commented Oct 15, 2024

Reviewing label has been removed, please complete the "BugZero Checklist".

@melvin-bot melvin-bot bot removed the Reviewing Has a PR in review label Oct 15, 2024
Copy link

melvin-bot bot commented Oct 15, 2024

The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.48-2 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-10-22. 🎊

For reference, here are some details about the assignees on this issue:

@melvin-bot melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Oct 21, 2024
@dangrous
Copy link
Contributor

I'm going to close this, since no payment required, and we'll see eventually if it helped out the failing jobs. Should be fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Daily KSv2 Workflow Failure
Projects
None yet
Development

No branches or pull requests

3 participants