Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(otelarrowreceiver): asynchronous stream operations #181

Merged

Conversation

moh-osman3
Copy link
Contributor

@moh-osman3 moh-osman3 commented Apr 23, 2024

This PR

@moh-osman3 moh-osman3 force-pushed the mohosman/apply-bounded-semaphore branch from 348bb99 to 366936e Compare April 25, 2024 13:59
@moh-osman3 moh-osman3 marked this pull request as ready for review April 25, 2024 16:20
collector/go.mod Outdated Show resolved Hide resolved
if err != nil {
response.bytesToRelease = int64(0)
} else {
response.bytesToRelease = uncompSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better than before, thanks for working on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the helpful review!

Comment on lines +440 to +443
err = r.boundedQueue.Acquire(ctx, int64(prevAcquiredBytes))
if err != nil {
return fmt.Errorf("breaking stream: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking the stream when there are too many waiters -- LGTM.

@jmacd
Copy link
Contributor

jmacd commented May 7, 2024

@lquerel please take a look, thanks!

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it seems good to me. However, I suggest improving the logs when the size specified in the header does not match the actual size, for the reasons mentioned in the comments.

// bounded queue to memory limit based on incoming uncompressed request size and waiters.
// Acquire will fail immediately if there are too many waiters,
// or will otherwise block until timeout or enough memory becomes available.
err = r.boundedQueue.Acquire(ctx, int64(prevAcquiredBytes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents a buggy (or misbehaving) client/exporter from setting the 'otlp-pdata-size' arbitrarily high? Without an upper limit or some other form of protection, could this create a type of DoS attack at minimal cost? Could we detect when the 'otlp-pdata-size' does not match the actual size of the message after decompression and then ban the sender for a period of time?

EDIT: Okay, there is detection in place with logging. Could we create some form of structured logs/events containing enough information to enable external (or internal) banning of the same clients that send us invalid data? The ban mechanism could be the topic of a future PR or be an external process triggered by these events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Yes we log in the case that the header otlp-pdata-size, but I think you are correct that we could face a DoS attack if we accept requests with headers that acquire a large portion of the semaphore. I can file an issue for this and would love to here more about how to get the appropriate client information that would allow us to block these clients.

Comment on lines 741 to 747
if sizeHeaderFound {
// a mismatch between header set by exporter and the uncompSize just calculated.
r.telemetry.Logger.Debug("mismatch between uncompressed size in receiver and otlp-pdata-size header", zap.Int("uncompsize", int(uncompSize)), zap.Int("otlp-pdata-size", int(response.bytesToRelease)))
} else if diff < 0 {
// proto.Size() on compressed request was greater than pdata uncompressed size.
r.telemetry.Logger.Debug("uncompressed size is less than compressed size", zap.Int("uncompressed", int(uncompSize)), zap.Int("compressed", int(response.bytesToRelease)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we must ensure that these structured logs contain enough information to enable us to ban buggy or misbehaving clients/exporters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any significant work is needed for this, I'd like to file an issue and come back to it. OTLP doesn't have an in-band way to identify the sender -- I believe we have minimal information in the gRPC peer struct, plus any headers we add ourselves. If we could standardize on a useful header for the exporter to emit, then we could log it here.

Copy link
Contributor Author

@moh-osman3 moh-osman3 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this issue #185 to work on in a followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However after thinking about this more I'm wondering how reliable the address is in the client.Info that we get from the incoming stream context. Does this address show up consistently https://github.com/open-telemetry/opentelemetry-collector/blob/main/client/client.go#L93

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be enough information to identify harmful clients and block them from sending requests to the receiver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lquerel @jmacd added the client.Info.Addr to the log message if it is available. This issue #185 can be used to track a followup to ban clients that are misusing the header, but I would like to know your thoughts on the best way to go about this. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if client.Info is reliable. But we can probably find a way to get the client IP address at the gRPC connection level.

@jmacd jmacd merged commit 62b0393 into open-telemetry:main May 9, 2024
2 checks passed
jmacd added a commit that referenced this pull request May 9, 2024
Included PRs:
- #174
- #181
- #184
- #186
- #188

---------

Co-authored-by: Joshua MacDonald <[email protected]>
jmacd added a commit that referenced this pull request May 10, 2024
This PR adds information about the new config options added in
#181 to the
otelarrowreceiver readme.

---------

Co-authored-by: Joshua MacDonald <[email protected]>
codeboten referenced this pull request in open-telemetry/opentelemetry-collector-contrib May 14, 2024
…or to v0.23.0 (#33055)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[github.com/open-telemetry/otel-arrow/collector](https://togithub.com/open-telemetry/otel-arrow)
| `v0.22.0` -> `v0.23.0` |
[![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fopen-telemetry%2fotel-arrow%2fcollector/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/go/github.com%2fopen-telemetry%2fotel-arrow%2fcollector/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/go/github.com%2fopen-telemetry%2fotel-arrow%2fcollector/v0.22.0/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fopen-telemetry%2fotel-arrow%2fcollector/v0.22.0/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

> [!WARNING]
> Some dependencies could not be looked up. Check the Dependency
Dashboard for more information.

---

### Release Notes

<details>
<summary>open-telemetry/otel-arrow
(github.com/open-telemetry/otel-arrow/collector)</summary>

###
[`v0.23.0`](https://togithub.com/open-telemetry/otel-arrow/releases/tag/v0.23.0)

[Compare
Source](https://togithub.com/open-telemetry/otel-arrow/compare/v0.22.0...v0.23.0)

##### What's Changed

- Update go.mod files post v0.22, restore CI by
[@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/180](https://togithub.com/open-telemetry/otel-arrow/pull/180)
- (boundedqueue): new semaphore implementation by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/174](https://togithub.com/open-telemetry/otel-arrow/pull/174)
- (concurrentbatchprocessor): propagate metadataKeys correctly when
using `multiShardBatcher` by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/184](https://togithub.com/open-telemetry/otel-arrow/pull/184)
- (otelarrowreceiver): asynchronous stream operations by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/181](https://togithub.com/open-telemetry/otel-arrow/pull/181)
- Remove the FIFO prioritizer; use least-loaded over all streams by
default by [@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/186](https://togithub.com/open-telemetry/otel-arrow/pull/186)
- Lint fixes from open collector-contrib PRs. by
[@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/188](https://togithub.com/open-telemetry/otel-arrow/pull/188)
- Release otel-arrow v0.23.0 by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/187](https://togithub.com/open-telemetry/otel-arrow/pull/187)

**Full Changelog**:
open-telemetry/otel-arrow@v0.22.0...v0.23.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any
time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zNTEuMiIsInVwZGF0ZWRJblZlciI6IjM3LjM1MS4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJyZW5vdmF0ZWJvdCJdfQ==-->

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: opentelemetrybot <[email protected]>
Co-authored-by: Alex Boten <[email protected]>
codeboten referenced this pull request in open-telemetry/opentelemetry-collector-contrib May 16, 2024
…3.0 (#33050)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[github.com/open-telemetry/otel-arrow](https://togithub.com/open-telemetry/otel-arrow)
| `v0.22.0` -> `v0.23.0` |
[![age](https://developer.mend.io/api/mc/badges/age/go/github.com%2fopen-telemetry%2fotel-arrow/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/go/github.com%2fopen-telemetry%2fotel-arrow/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/go/github.com%2fopen-telemetry%2fotel-arrow/v0.22.0/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/go/github.com%2fopen-telemetry%2fotel-arrow/v0.22.0/v0.23.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

> [!WARNING]
> Some dependencies could not be looked up. Check the Dependency
Dashboard for more information.

---

### Release Notes

<details>
<summary>open-telemetry/otel-arrow
(github.com/open-telemetry/otel-arrow)</summary>

###
[`v0.23.0`](https://togithub.com/open-telemetry/otel-arrow/releases/tag/v0.23.0)

[Compare
Source](https://togithub.com/open-telemetry/otel-arrow/compare/v0.22.0...v0.23.0)

##### What's Changed

- Update go.mod files post v0.22, restore CI by
[@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/180](https://togithub.com/open-telemetry/otel-arrow/pull/180)
- (boundedqueue): new semaphore implementation by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/174](https://togithub.com/open-telemetry/otel-arrow/pull/174)
- (concurrentbatchprocessor): propagate metadataKeys correctly when
using `multiShardBatcher` by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/184](https://togithub.com/open-telemetry/otel-arrow/pull/184)
- (otelarrowreceiver): asynchronous stream operations by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/181](https://togithub.com/open-telemetry/otel-arrow/pull/181)
- Remove the FIFO prioritizer; use least-loaded over all streams by
default by [@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/186](https://togithub.com/open-telemetry/otel-arrow/pull/186)
- Lint fixes from open collector-contrib PRs. by
[@&#8203;jmacd](https://togithub.com/jmacd) in
[https://github.com/open-telemetry/otel-arrow/pull/188](https://togithub.com/open-telemetry/otel-arrow/pull/188)
- Release otel-arrow v0.23.0 by
[@&#8203;moh-osman3](https://togithub.com/moh-osman3) in
[https://github.com/open-telemetry/otel-arrow/pull/187](https://togithub.com/open-telemetry/otel-arrow/pull/187)

**Full Changelog**:
open-telemetry/otel-arrow@v0.22.0...v0.23.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any
time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zNTEuMiIsInVwZGF0ZWRJblZlciI6IjM3LjM2My41IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJyZW5vdmF0ZWJvdCJdfQ==-->

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: opentelemetrybot <[email protected]>
Co-authored-by: Alex Boten <[email protected]>
jmacd added a commit that referenced this pull request May 31, 2024
…ments & restructuring (#205)

Restructure receiver code to improve readability. There are a number of
metrics that are incremented when a batch starts being processed and are
decremented when the batch is finished, but the control flow that
maintained the balance of these updates was convoluted.

The root-cause of #204 is that Arrow batches meant for a consumer to be
processed in order were processed out-of-order. There was a large
function body which served two purposes: consume Arrow data of the
appropriate kind, enter data for the pipeline to consume next. This had
to be split into two parts and should have been done as part of #181.
(I, as reviewer, missed this and find, in hindsight, that the code is
not easy to follow.)

This improves the code structure by moving all stateful aspects of
starting/finishing a request into a new `inFlightData` object which has
a deferrable method to finish the request. Here, we keep:

1. The `inFlightWG` done count
2. The active requests metric
3. The active items metric
4. The active bytes metric
5. The bytes-acquired from the semaphore
6. A per-request span covering Arrow decode
7. Netstat-related instrumentation

Authorization now happens before acquiring from the semaphore. 

A number of `fmt.Errorf()` calls are replaced with `status.Errorf(...)`
and a specific error code. The tests are updated to be more specific.
Several Arrow tests were accidentally canceling the test before an
expected error condition was actually tested, they have been audited and
improved.

One new concurrent-receiver test was added.

Fixes #204.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow receiver load handling improvements
3 participants