-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hotfix] CrossRegionHedging: Fixes NullReference Exception Bug #4955
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…bChannelState Object. (#4928) This PR bumps up the `Cosmos.Direct` package version to `3.37.3` which is released on top of `EN20241007` release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow. **Problem:** One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the `LoadBalancingPartition`. The team is using the .NET SDK version `3.39.0-preview` to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying `CosmosClient` is keeping a high number of unhealthy and unused `LbChannelState`s. In a nut shell, below are few of the account configurations and facts: • **account-1:** 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account. • **account-2** 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account. • **account-3:** 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account. To understand this in more detail, please take a look at the memory dump below. ![ic3_memory_dump_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/b78585de-5318-436a-8511-0107d570e8d7) _[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]_ Upon further analysis of the memory dump, it is clear that: - The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up. - Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print. ![ic3_new_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/ea3a4f82-daca-40f8-bcd6-d29da806a3c8) _[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]_ - Even without the replica validation feature, the memory footprint showed a consistent increase over time. ![ic3_old_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/471a50c3-1786-4f12-b5e3-1ab192c6fef5) _[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]_ **Analysis :** Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that: - **With Replica Validation Enabled:** Each of the impacted `LoadBalancingPartition` was holding more than `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. - **With Replica Validation Disabled:** Each of the impacted `LoadBalancingPartition` was holding exactly `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. Let's take a look at the below diagram to understand this in more detail: ![image](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/f4e2d494-962b-41da-ae80-0d542d62accf) _[Fig-4: The above figure shows an instance of the `LoadBalancingPartition` holding more than one entry of unhealthy `LbChannelState`.]_ By looking at the above memory dump snapshot, it is clear that these stale `LbChannelState` entries are kept in the `LoadBalancingPartition`, until they are removed the `openChannels` list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging `LbChannelState`s claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption. **Proposed Solution :** There are few changes proposed to fix this scenario. These are discussed briefly in the below section: - During the replica validation phase, in the `OpenConnectionAsync()`, proactively remove all the `Unhealthy` connections from the `openChannels` within the `LoadBalancingPartition`. This guarantees that any unhealthy `LbChannelStates` will be removed from the `LoadBalancingPartition`, freeing up the additional memory. Please delete options that are not relevant. - [x] Bug fix (non-breaking change which fixes an issue) To automatically close an issue: closes #4467
…ion Object (#4950) # Pull Request Template ## Description - Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection Object - Region Availability: Adding SDK changes for Upcoming Regions. ## Type of change Please delete options that are not relevant. - [x] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #IssueNumber --------- Co-authored-by: Ashutosh Kiran Pologi <[email protected]> Co-authored-by: Ashutosh Kiran Pologi (APTLY TECHNOLOGY CORPORATION) (from Dev Box) <[email protected]> Co-authored-by: Kiran Kumar Kolli <[email protected]>
…#4953) # Pull Request Template ## Description Recently during our v3 sdk CI rolling runs, we observed some performance regressions on the `ItemStreamAsync()` APIs. They regressed beyond 5%. ![image](https://github.com/user-attachments/assets/66cc4f01-2ec6-47e0-b885-5ad74e02bb63) Upon doing further investigation, we figured out that during the non-binary flow, we end up converting the incoming stream into `CloneableStream` which might be the reason for this regression. Please note that the reason this was not caught during the [original version of the binary encoding PR](#4652) was that the performance test used to capture the benchmark for the original PR, was targeted a real cosmos container, where for the CI runs, we use our mocked containers. This PR skips `CloneableStream` conversation for non-binary encoding flow. With the above change in place, our CI builds started passing: ![image](https://github.com/user-attachments/assets/8293a6e5-6fbc-4953-9de0-37162a081194) ## Type of change Please delete options that are not relevant. - [x] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber # Pull Request Template ## Description Please include a summary of the change and which issue is fixed. Include samples if adding new API, and include relevant motivation and context. List any dependencies that are required for this change. ## Type of change Please delete options that are not relevant. - [x] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #IssueNumber
# Pull Request Template ## Description Adds support to use hedging on write requests for multimaster accounts. Note that this does come with the caveat that there will be more 409 errors thrown by the SDK. This is expected and applications that adapt this feature should be prepared to handle these exceptions. This feature will also be an opt in feature for Availability Strategies and users will need to especially specify that in the availability strategy definition. ``` ItemRequestOptions requestOptions = new ItemRequestOptions { AvailabilityStrategy = new CrossRegionHedgingAvailabilityStrategy( threshold: TimeSpan.FromMilliseconds(100), thresholdStep: TimeSpan.FromMilliseconds(50), multiMasterWritesEnabled: true) }; ``` ## Type of change Please delete options that are not relevant. - [] New feature (non-breaking change which adds functionality) ## Closing issues To automatically close an issue: closes #4457 --------- Co-authored-by: Fabian Meiswinkel <[email protected]> Co-authored-by: Kevin Pilch <[email protected]>
# Pull Request Template ## Description This change is to fix CosmosNullReferenceException that happens when a client passes a CancellationToken along with the request. Please see(#4737) for more information. ## Type of change Please delete options that are not relevant. - [] Bug fix (non-breaking change which fixes an issue) ## Closing issues To automatically close an issue: closes #4737 --------- Co-authored-by: Arooshi Avasthy <[email protected]> Co-authored-by: Kiran Kumar Kolli <[email protected]> Co-authored-by: Nalu Tripician <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Description
This PR cherry-picks the following commits:
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber