Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating Bugsnag OOM Crashes #1145

Open
Augustyniak opened this issue Jul 3, 2021 · 12 comments
Open

Investigating Bugsnag OOM Crashes #1145

Augustyniak opened this issue Jul 3, 2021 · 12 comments
Labels
backlog We hope to fix this feature/bug in the future feature request Request for a new feature

Comments

@Augustyniak
Copy link

Description

It's hard to tell what the reason for a given Bugsnag OOM crash was - whether a crash was happened by a memory leak, a retain cycle or just a high memory usage caused but the lack of optimizations.

Describe the solution you'd like
A way to tell whether a given OOM crash was a result of normal app usage (where the applications just happens to consume too much memory because of the lack of optimizations) or whether there is a memory leak / retain cycle somewhere in the app.

Describe alternatives you've considered
Using Xcode instruments to profile the app - one can profile only a small subsets of all of the possible applications' configurations that production users experience. Looking at a Bugsnag report - even with a lot of breadcrumbs in it - it's hard to replicate the state a user was in and be able to tell whether their OOM crash was a result of a retain cycle / memory leak.

Additional context

I do not have a clear idea for how this could be implemented but I wonder whether Bugsnag team has any suggestions / tips / plans for features that could make it easier to detect whether a given OOM crash was a result of a memory leak or a retain cycle.

@mattdyoung
Copy link

Hi @Augustyniak

Thanks for your thoughts. The challenge with OOM detection on iOS is that Apple doesn't provide an event hook for out of memory events. Bugsnag relies on our own heuristic to identify during app re-launch whether the previous termination of the app was an unexpected termination by the OS watchdog. We can identify several other known reasons the OS may kill the app such as device reboot, app upgrade, app hang on main thread, and we send an Out Of Memory error report that the app was likely terminated by the operating system while in the foreground if all other detectable causes have been ruled out.

For an OOM crash, we can't run any code at crash time and it's hard to predict in advance when a termination is likely to happen. So in terms of capturing breadcrumbs leading up to an OOM it's a balance between what would be useful to diagnose and what would be resource heavy to capture, as we don't want to have any significant impact app performance where the app may not terminate.

In theory leaks and retain cycles could be identified by scanning the heap memory, but we'd need to suspend threads when scanning which would cause a noticeable hang, and we'd need to determine when to perform a scan (possibly relying on memory warning notifications). And even if we detect a leak, identifying the root cause might need the stack trace at time of allocation and memory graph which can't reasonably be tracked in production apps.

We'd suggest continuing to consider which breadcrumbs are most useful to capture the application state, such as those to track the view controller lifecycle, and trying to replicate that state when profiling in Xcode.

We're discussing this internally to consider whether there is any other information we could feasibly capture in OOM reports to help with diagnosing these issues.

@mattdyoung mattdyoung added the needs discussion Requires internal analysis/discussion label Jul 7, 2021
@Augustyniak
Copy link
Author

Augustyniak commented Jul 9, 2021

Thank you for your response @mattdyoung.

Out Of Memory error report that the app was likely terminated by the operating system while in the foreground if all other detectable causes have been ruled out

Do you have any estimate for the reliability of detection of "Out Of Memory" crashes? Do false positives happen and how often do they happen (percentage-wise)?

When it comes to scanning heap to detect cycles / leaks - thank you for the explanation. I agree that it's probably not worth it if it impacts the performance of an app.

Some of the ideas for how to increase the visibility into OOM crashes:

  1. Add information about how much memory the application uses to Bugsnag crashes. This would allow us to compare the amount of memory used at the time of the OOM crash with the expected value of memory used by our app.
  2. Add information about how much memory application uses to breadcrumbs for "app received memory warning". This could be helpful especially in cases when we can see multiple memory warnings in crash's breadcrumbs I think to see whether the application continues to consume more and more of memory.

I realize that both of these could be implemented by a customer of Bugsnag with the use of your public API but it may be worth adding to the SDK itself if it improves the experience of working with OOM crashes.

@mattdyoung
Copy link

Do you have any estimate for the reliability of detection of "Out Of Memory" crashes? Do false positives happen and how often do they happen (percentage-wise)?

No, we're not able to capture data on the different cases ourselves in real-world apps. We suspect there will be some terminations captured as "Out Of Memory" which aren't related to low memory e.g. the OS terminating the app due to the device overheating would have the same signature.

Thanks for the ideas! We are already considering what other diagnostics we can add to make OOM crashes more actionable and intend to add these to the default behavior of the SDK itself. This is likely to include additional breadcrumbs since for OOMs we can't snapshot diagnostics at crash time, so capturing memory usage or other state information in breadcrumbs leading up to a crash will prove most useful.

@mattdyoung mattdyoung added backlog We hope to fix this feature/bug in the future feature request Request for a new feature and removed needs discussion Requires internal analysis/discussion labels Jul 15, 2021
@sethfri
Copy link

sethfri commented Aug 26, 2021

I'm curious to hear what the status is on improving OOM diagnostics. My team receives quite a lot of them but has been having a difficult time root causing many.

Part of the problem is OOMs seem to be grouped together despite having wildly dissimilar stack traces. Have you considered modifying the grouping on these (app hangs have the same issue)?

@luke-belton
Copy link
Member

Hey @sethfri

We've just released a new version of bugsnag-cocoa which detects Thermal Kill errors (where the OS terminates an app due to a device overheating). This was released in v6.12.0. These Thermal Kill errors would previously have been grouped together with OOMs, so you can now detect when devices have crashed as a result of a thermal critical condition.

When OOMs occur, we can't capture a stacktrace so generally we advise looking at breadcrumbs etc. to understand what was happening in the app in the lead up to the event.

For app hangs we do capture stacktraces and the events should be grouped accordingly. If you're seeing app hang events that you believe are not grouped correctly please could you write into [email protected] with links to some examples and we'd be happy to take a look for you?

@firatagdas
Copy link

I think most of the OOM crashes on Bugsnag are not accurate. I know that because we used Firebase Crashlytics in parallel to Bugsnag. Crashlytics caught the exact crash, but Bugsnag only referred to it as OOM. I love Bugsnag, but we can't rely on it.

You may need to revisit OOM reports IMO.

@mattdyoung
Copy link

@firatagdas
That sounds strange. Could you email [email protected] with details of this crash as captured by Crashlytics so we can investigate and try to reproduce the issue?

@hovox
Copy link

hovox commented Oct 8, 2021

hi @firatagdas . We are also using Crashlytics and Bugsnag, would be good to know in which cases Crashlytics handles crashes and Bugsnag not. We have tested in different scenarios and seems they work same way in case of non-oom crashes.

@firatagdas
Copy link

Hello, @hovox and @mattdyoung. I’ll prepare a case when i am available. But pretty busy this time around.

I know one of the case is accessing [[VungleSDK sharedSDK] currentSuperToken] in another thread before waiting VungleSDK initialization.

Vungle SDK is an Ad SDK. I’ll try to reproduce the issue.

@hovox
Copy link

hovox commented Dec 2, 2021

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

@mattdyoung
Copy link

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

Hi @hovox - we are considering improvements to allow more breadcrumbs in general in the future, so I've flagged this OOM use case to consider as part of that analysis.

@nickdowell
Copy link
Contributor

Hey Bugsnag team, maybe it is reasonable to increase breadcrumbs max count (e.g. to 200) in case of OOM? Since we do not have stacktraces, we may need more info and hence longer time breadcrumbs for OOMs.

In v6.22.0 we increased the default and maximum values for maxBreadcrumbs to 100 and 500, respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog We hope to fix this feature/bug in the future feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants