Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NoQA] Fix reauthentication #52727

Merged
merged 21 commits into from
Nov 22, 2024
Merged

Conversation

zirgulis
Copy link
Contributor

@zirgulis zirgulis commented Nov 18, 2024

Explanation of Change

  1. Fix reauthentication (@neil-marcellini) Fix re-authentication when it fails to fetch #52228
  2. Add the improved reauth when offline test (original PR [No QA] [HOLD ON PR #52228] Improve simulating online/offline conditions in reauthentication test #52165)
  3. Fix network tests flakiness

We were seeing a problem where the user's Auth token would expire, which triggers an Authenticate request to re-authenticate the user. However, if the user is on a bad connection it's possible this request fails to fetch. In that case we were catching the error and signing the user out, which means they could lose data in queued write requests, and it's annoying to have to sign in again.

To fix this, we add a retry mechanism with exponential backoff, using the same throttling mechanism we already have for the SequentialQueue. The throttle is now a class so we can have separate instances. If re-auth is still failing after the maximum number of retries we'll log them out, but that should be extremely rare. We verified that if the user is offline when re-authenticating that the retries are paused until they come back online, so failed to fetch errors are very unlikely to cause a log out now.

Finally, we fixed the re-auth tests and added one for this exact flow, verifying that they fail on main and only pass with the fix. Previously, the tests were invalid, false positives.

Fixed Issues

$ #51707
PROPOSAL: N/A

Tests

  1. Merge in this PR which has changes to test the exact scenario Manual test re-auth failing to fetch #52281
  2. Log in with any validated account
  3. Open the JS console
  4. Send a message in any chat
  5. Search the console logs for Ndebug failing to fetch in Authenticate and verify that it's logged about 10 times and that Authenticate requests are made more and more rarely
  6. Verify you are not logged out.
  • Verify that no errors appear in the JS console

Offline tests

  1. Merge in this PR which has changes to test the exact scenario Manual test re-auth failing to fetch #52281
  2. Log in with any validated account
  3. Open the JS console
  4. Send a message in any chat
  5. Go offline immediately after
  6. Search the console logs for Ndebug failing to fetch in Authenticate and verify that it's not logged repeatedly, signaling that the retry is paused while offline
  7. Verify you are not logged out
  8. Go online
  9. Search the console logs for Ndebug failing to fetch in Authenticate and verify that it's logged about 10 times and that Authenticate requests are made more and more rarely
  10. Verify you are not logged out

QA Steps

No QA, this is a very specific situation that is very hard to reproduce.

  • Verify that no errors appear in the JS console

PR Author Checklist

  • I linked the correct issue in the ### Fixed Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I added steps for the expected offline behavior in the Offline steps section
    • I added steps for Staging and/or Production testing in the QA steps section
    • I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
    • MacOS: Desktop
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
      • If any non-english text was added/modified, I verified the translation was requested/reviewed in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Tests were only run on web since changes are platform independent.

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

Online

OnlineReauth2024-11-21_18-29-01.mp4

Offline

OfflineReAuth2024-11-21_18-33-46.mp4
MacOS: Desktop

@zirgulis
Copy link
Contributor Author

I see ReportTest.ts is now failing for some reason, I will fix that tomorrow.

Copy link
Contributor

@allgandalf allgandalf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this quick work, please let me know when this one is ready for review

@zirgulis
Copy link
Contributor Author

@allgandalf it seems the ReportTest.ts just timed out and with a typo fix commit it went away. Locally I was not able to make that test fail, all succeed.

@allgandalf
Copy link
Contributor

wow, that's awesome, I also noticed that your branch is 1000+ commits behind main, it's better we merge main, can you do that please

@zirgulis
Copy link
Contributor Author

@neil-marcellini Could you please share how to reproduce/test this reauth issue? I tried to reproduce this locally by having a bash script running which would periodically turn off and on my Mac's network. While doing that the local proxy server (npm run web) started to crash. Later I tried to do the same in staging env but without any luck, I didn't get logged out.

Copy link
Contributor

@neil-marcellini neil-marcellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress so far. Let's make sure this is rock solid before we ship it. I'll start doing some manual testing.

src/libs/Middleware/Reauthentication.ts Outdated Show resolved Hide resolved
src/libs/Network/index.ts Show resolved Hide resolved
tests/unit/NetworkTest.ts Show resolved Hide resolved
tests/unit/NetworkTest.ts Outdated Show resolved Hide resolved
tests/unit/NetworkTest.ts Outdated Show resolved Hide resolved
tests/unit/NetworkTest.ts Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's please examine the other re-authentication test in this file, and make sure it's failing before any fixes from this PR. For example, I don't understand why we mock it to expect another call after failed re-authentication using the old authToken. It seems to me re-auth should fail to fetch, then get retried and succeed.

// Fail the call to re-authenticate
.mockImplementationOnce(actualXhr)
// The next call should still be using the old authToken
.mockImplementationOnce(() =>
Promise.resolve({
jsonCode: CONST.JSON_CODE.NOT_AUTHENTICATED,
}),
)

It's also odd that we mock several responses but then only assert about two.

expect(callsToOpenPublicProfilePage.length).toBe(1);
expect(callsToAuthenticate.length).toBe(1);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neil-marcellini I didn't touch this test but this one is quite interesting. If we run this on main it will succeed but if we look at the logs it actually signs out the user, giving us a false positive.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. The test doesn't really make sense. That's why I modified it in my PR. It still wasn't working quite right for me, but maybe with your latest changes it will.

Let's update the test, verify it's passing with the fix, then cherry pick the test to main and make sure it's failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neil-marcellini I can confirm that both added tests are failing on main:
image

@neil-marcellini
Copy link
Contributor

@neil-marcellini Could you please share how to reproduce/test this reauth issue? I tried to reproduce this locally by having a bash script running which would periodically turn off and on my Mac's network. While doing that the local proxy server (npm run web) started to crash. Later I tried to do the same in staging env but without any luck, I didn't get logged out.

Sure, yes it's a bit tricky. I explained in this comment how I was able to manually reproduce the issue. I used a backend change to mark the auth token as expired when I add a comment, but you might be able to modify the app's network logic to mock that on the frontend. I modified app to fail to fetch once when re-authenticating.

I'll see if I can get that working and write manual test steps into this PR description.

@neil-marcellini
Copy link
Contributor

neil-marcellini commented Nov 19, 2024

You can merge this PR Manual test re-auth failing to fetch into this one with the fix locally, and verify if it's working. That manual test PR currently fails on main.

Copy link
Contributor

@neil-marcellini neil-marcellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your latest changes look good. Still a few more things to update as mentioned in Slack.

Copy link
Contributor

@neil-marcellini neil-marcellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great, thanks! I'll finish manually testing and then we should be good to go. Lmk if you want to address the non-blocking comments now, or in a follow up PR.

src/libs/Network/index.ts Outdated Show resolved Hide resolved
src/libs/Network/index.ts Outdated Show resolved Hide resolved
src/libs/RequestThrottle.ts Outdated Show resolved Hide resolved
src/libs/Network/SequentialQueue.ts Show resolved Hide resolved
src/libs/Middleware/Reauthentication.ts Outdated Show resolved Hide resolved
@zirgulis
Copy link
Contributor Author

Looks really great, thanks! I'll finish manually testing and then we should be good to go. Lmk if you want to address the non-blocking comments now, or in a follow up PR.

Yes will do that in this PR

Copy link

melvin-bot bot commented Nov 21, 2024

@ Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button]

@melvin-bot melvin-bot bot removed the request for review from a team November 21, 2024 17:05
@neil-marcellini
Copy link
Contributor

One more thing, maybe we should have a special max retry count for re-authentication? We reach the 10 retry limit pretty quickly, and since it's quite important that we don't sign people out, I think it would be good if the re-auth max throttle retry time was about a couple minutes.

Also, it would be smart to manually test that we don't sent reauth retries while offline, only once the app is back online, otherwise it's likely they will keep failing to fetch and hit the max retry count.

Copy link
Contributor

@tgolen tgolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all of Neil's NAB comments, and I actually don't think they are NABs. I would like to have them be done. If not in this PR, then in a separate PR that comes right after this.

I also think we should have the separate retry limit as Neil suggests.

src/libs/Middleware/Reauthentication.ts Outdated Show resolved Hide resolved
src/libs/Network/index.ts Outdated Show resolved Hide resolved
@zirgulis
Copy link
Contributor Author

I also think we should have the separate retry limit as Neil suggests.

@tgolen I think this is out of scope for this PR, but I'm happy to tackle this in the next PR

@allgandalf
Copy link
Contributor

@zirgulis when i merged https://github.com/Expensify/App/pull/52281/files and tested it as mentioned in testing steps, I do get logged out:

Screen.Recording.2024-11-22.at.1.36.17.AM.mov

I even got logged out when i sent a message:

Screen.Recording.2024-11-22.at.1.38.06.AM.mov

Copy link
Contributor

@tgolen tgolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@neil-marcellini neil-marcellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your latest changes, it looks really solid to me now. 10 retries ends up with a pretty long delay, and I also tested it while offline and verified that the retries are paused which is great because I think it would be pretty unlikely for this to actually fail now. I updated the PR description for you and I think this is good to go.

@allgandalf it looks like maybe you didn't pull the latest changes from remote before testing, because the log lines are different now.

@allgandalf
Copy link
Contributor

@allgandalf it looks like maybe you didn't pull the latest changes from remote before testing, because the log lines are different now.

Trying again now

@allgandalf
Copy link
Contributor

allgandalf commented Nov 22, 2024

Reviewer Checklist

  • I have verified the author checklist is complete (all boxes are checked off).
  • I verified the correct issue is linked in the ### Fixed Issues section above
  • I verified testing steps are clear and they cover the changes made in this PR
    • I verified the steps for local testing are in the Tests section
    • I verified the steps for Staging and/or Production testing are in the QA steps section
    • I verified the steps cover any possible failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
  • I checked that screenshots or videos are included for tests on all platforms
  • I included screenshots or videos for tests on all platforms
  • I verified tests pass on all platforms & I tested again on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
    • MacOS: Desktop
  • If there are any errors in the console that are unrelated to this PR, I either fixed them (preferred) or linked to where I reported them in Slack
  • I verified proper code patterns were followed (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick).
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I verified that this PR follows the guidelines as stated in the Review Guidelines
  • I verified other components that can be impacted by these changes have been tested, and I retested again (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar have been tested & I retested again)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG)
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR reviewer checklist, including those that don't apply to this PR.

Screenshots/Videos

MacOS: Chrome / Safari

Online:

Screen.Recording.2024-11-22.at.7.03.04.PM.mov

Offline:

Screen.Recording.2024-11-22.at.7.23.25.PM.mov
MacOS: Desktop Online:
Screen.Recording.2024-11-22.at.7.26.38.PM.mov

Offline:

Screen.Recording.2024-11-22.at.7.38.52.PM.mov
Android: Native
Screen.Recording.2024-11-22.at.7.43.50.PM.mov
Android: mWeb Chrome
Screen.Recording.2024-11-22.at.7.47.48.PM.mov
iOS: Native
Screen.Recording.2024-11-22.at.7.50.39.PM.mov
iOS: mWeb Safari
Screen.Recording.2024-11-22.at.7.52.00.PM.mov

Copy link
Contributor

@allgandalf allgandalf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests well in both online and offline mode:

  • Verified that the re-authentication count stops at 10 and it slows down till it reached 10.

  • Verified that in offline mode, there is no call made, which means that the calls are paused until the user is online

  • Verified that we are not logged out in both the cases

@melvin-bot melvin-bot bot requested a review from neil-marcellini November 22, 2024 14:24
Copy link
Contributor

@neil-marcellini neil-marcellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! Thanks guys

@neil-marcellini neil-marcellini merged commit 8d69d60 into Expensify:main Nov 22, 2024
22 of 24 checks passed
@OSBotify
Copy link
Contributor

✋ This PR was not deployed to staging yet because QA is ongoing. It will be automatically deployed to staging after the next production release.

Copy link
Contributor

🚀 Deployed to staging by https://github.com/neil-marcellini in version: 9.0.66-0 🚀

platform result
🤖 android 🤖 success ✅
🖥 desktop 🖥 success ✅
🍎 iOS 🍎 success ✅
🕸 web 🕸 success ✅
🤖🔄 android HybridApp 🤖🔄 success ✅
🍎🔄 iOS HybridApp 🍎🔄 success ✅

Copy link
Contributor

🚀 Deployed to production by https://github.com/mountiny in version: 9.0.66-8 🚀

platform result
🤖 android 🤖 success ✅
🖥 desktop 🖥 success ✅
🍎 iOS 🍎 success ✅
🕸 web 🕸 success ✅
🤖🔄 android HybridApp 🤖🔄 failure ❌
🍎🔄 iOS HybridApp 🍎🔄 failure ❌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants