Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hotfix] CrossRegionHedging: Fixes NullReference Exception Bug #4955

Closed
wants to merge 5 commits into from

Conversation

kundadebdatta
Copy link
Member

Pull Request Template

Description

This PR cherry-picks the following commits:

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #IssueNumber

kundadebdatta and others added 5 commits December 23, 2024 16:14
…bChannelState Object. (#4928)

This PR bumps up the `Cosmos.Direct` package version to `3.37.3` which
is released on top of `EN20241007` release. In a nut-shell, the change
is focused on cleaning up the unhealthy dangling connections, that were
created as a part of the "Advanced Replica Selection" flow.

**Problem:** One or our recent conversations with the IC3 team has
helped us unveiling a potential issue with the RNTBD connection creation
and management flow in the `LoadBalancingPartition`. The team is using
the .NET SDK version `3.39.0-preview` to connect to cosmos db. Recently
they encountered a potential memory leak in some of their clusters and
upon investigation, it appeared that one of the root cause is the
underlying `CosmosClient` is keeping a high number of unhealthy and
unused `LbChannelState`s.

In a nut shell, below are few of the account configurations and facts:

• **account-1:** 9 Partitions with 1 unique tenant. There are approx 4
to 8 clients for this tenant. 2 * no of replica regions is the client
count. They have the connection warm-up enabled on this account.

• **account-2** 2592 Partitions with 249 tenants/ feds. Connections
created in happy path scenario: 249 x Y (Y = number of active clients
for that account). Connection warm-up disabled on this account.

• **account-3:** 27 Partitions with 13 tenants/ feds.
CreateAndInitialize They have the connection warm-up enabled on this
account.

To understand this in more detail, please take a look at the memory dump
below.

![ic3_memory_dump_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/b78585de-5318-436a-8511-0107d570e8d7)

_[Fig-1: The above figure shows a snapshot of the memory dump taken for
multiple accounts This also unviels the potential memory leak by the
unhealthy connection.]_

Upon further analysis of the memory dump, it is clear that:

- The number of stale unhealthy connections are higher in the accounts
where we have the replica validation enabled along with the connection
warm-up.
- Without the connection warm-up, the number of stale unhealthy
connections are comparatively lower, but still good enough to increase
the memory foot-print.

![ic3_new_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/ea3a4f82-daca-40f8-bcd6-d29da806a3c8)

_[Fig-2: The above figure shows how the memory footprint got increased
over time, along with incoming requests. The service was finally needed
to be restarted to free up the memory]_
- Even without the replica validation feature, the memory footprint
showed a consistent increase over time.

![ic3_old_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/471a50c3-1786-4f12-b5e3-1ab192c6fef5)

_[Fig-3: The above figure shows the memory consumption from the IC3
partner-api service which is using an older version (v 3.25.0) of the
.NET SDK and the memory consumption kept increasing with time.]_

**Analysis :** Upon further digging in to the memory dump, and
re-producing the scenario locally, it was noted that:

- **With Replica Validation Enabled:** Each of the impacted
`LoadBalancingPartition` was holding more than `1` Unhealthy stale
`LbChannelState` (which is a wrapper around the `Dispatcher` and a
`Channel`), when the connection to the backend replica was closed
deterministically.

- **With Replica Validation Disabled:** Each of the impacted
`LoadBalancingPartition` was holding exactly `1` Unhealthy stale
`LbChannelState` (which is a wrapper around the `Dispatcher` and a
`Channel`), when the connection to the backend replica was closed
deterministically.

Let's take a look at the below diagram to understand this in more
detail:

![image](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/f4e2d494-962b-41da-ae80-0d542d62accf)

_[Fig-4: The above figure shows an instance of the
`LoadBalancingPartition` holding more than one entry of unhealthy
`LbChannelState`.]_

By looking at the above memory dump snapshot, it is clear that these
stale `LbChannelState` entries are kept in the `LoadBalancingPartition`,
until they are removed the `openChannels` list, which is responsible for
maintaining the number of channels (healthy or unhealthy) for that
particular endpoint. If they are not cleaned up proactively (which is
exactly this case), it might end up claiming extra memory overhead. With
increasing number of partitions, connections over the time, things get
worse with all these unused, yet low hanging `LbChannelState`s claiming
more and more memory, and causing it to a memory leak. This is the
potential root cause of the increased memory consumption.

**Proposed Solution :**

There are few changes proposed to fix this scenario. These are discussed
briefly in the below section:

- During the replica validation phase, in the `OpenConnectionAsync()`,
proactively remove all the `Unhealthy` connections from the
`openChannels` within the `LoadBalancingPartition`. This guarantees that
any unhealthy `LbChannelStates` will be removed from the
`LoadBalancingPartition`, freeing up the additional memory.

Please delete options that are not relevant.

- [x] Bug fix (non-breaking change which fixes an issue)

To automatically close an issue: closes #4467
…ion Object (#4950)

# Pull Request Template

## Description

- Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection Object
- Region Availability: Adding SDK changes for Upcoming Regions.

## Type of change

Please delete options that are not relevant.

- [x] New feature (non-breaking change which adds functionality)

## Closing issues

To automatically close an issue: closes #IssueNumber

---------

Co-authored-by: Ashutosh Kiran Pologi <[email protected]>
Co-authored-by: Ashutosh Kiran Pologi (APTLY TECHNOLOGY CORPORATION) (from Dev Box) <[email protected]>
Co-authored-by: Kiran Kumar Kolli <[email protected]>
…#4953)

# Pull Request Template

## Description

Recently during our v3 sdk CI rolling runs, we observed some performance
regressions on the `ItemStreamAsync()` APIs. They regressed beyond 5%.



![image](https://github.com/user-attachments/assets/66cc4f01-2ec6-47e0-b885-5ad74e02bb63)

Upon doing further investigation, we figured out that during the
non-binary flow, we end up converting the incoming stream into
`CloneableStream` which might be the reason for this regression. Please
note that the reason this was not caught during the [original version of
the binary encoding
PR](#4652) was that
the performance test used to capture the benchmark for the original PR,
was targeted a real cosmos container, where for the CI runs, we use our
mocked containers.

This PR skips `CloneableStream` conversation for non-binary encoding
flow.

With the above change in place, our CI builds started passing:



![image](https://github.com/user-attachments/assets/8293a6e5-6fbc-4953-9de0-37162a081194)

## Type of change

Please delete options that are not relevant.

- [x] Bug fix (non-breaking change which fixes an issue)

## Closing issues

To automatically close an issue: closes #IssueNumber

# Pull Request Template

## Description

Please include a summary of the change and which issue is fixed. Include
samples if adding new API, and include relevant motivation and context.
List any dependencies that are required for this change.

## Type of change

Please delete options that are not relevant.

- [x] Bug fix (non-breaking change which fixes an issue)

## Closing issues

To automatically close an issue: closes #IssueNumber
# Pull Request Template

## Description

Adds support to use hedging on write requests for multimaster accounts.
Note that this does come with the caveat that there will be more 409
errors thrown by the SDK. This is expected and applications that adapt
this feature should be prepared to handle these exceptions.

This feature will also be an opt in feature for Availability Strategies
and users will need to especially specify that in the availability
strategy definition.

```
ItemRequestOptions requestOptions = new ItemRequestOptions
{
    AvailabilityStrategy = new CrossRegionHedgingAvailabilityStrategy(
        threshold: TimeSpan.FromMilliseconds(100),
        thresholdStep: TimeSpan.FromMilliseconds(50),
        multiMasterWritesEnabled: true)
};
```

## Type of change

Please delete options that are not relevant.

- [] New feature (non-breaking change which adds functionality)

## Closing issues

To automatically close an issue: closes #4457

---------

Co-authored-by: Fabian Meiswinkel <[email protected]>
Co-authored-by: Kevin Pilch <[email protected]>
# Pull Request Template

## Description

This change is to fix CosmosNullReferenceException that happens when a
client passes a CancellationToken along with the request. Please
see(#4737) for
more information.

## Type of change

Please delete options that are not relevant.

- [] Bug fix (non-breaking change which fixes an issue)


## Closing issues

To automatically close an issue: closes #4737

---------

Co-authored-by: Arooshi Avasthy <[email protected]>
Co-authored-by: Kiran Kumar Kolli <[email protected]>
Co-authored-by: Nalu Tripician <[email protected]>
@microsoft-github-policy-service microsoft-github-policy-service bot deleted the users/kundadebdatta/hotfix/3.46.1_more branch January 8, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants