Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] Fix data race in batch series state if called concurrently #36524

Conversation

ArthurSens
Copy link
Member

Description

Fixes a data race when batching time series during export. The problem was that multiple go routines could call PushMetrics simultaneously. Each time it was, we would re-use the state address. This address could be written without any locks or atomic operations[1][2][3].

Link to tracking issue

The problem was discovered during the review period at #35184

Testing

The first commit adds a test that calls PushMetrics by several go routines. The second commit fixes the test.

@ArthurSens
Copy link
Member Author

I chose atomic instead of mutexes because locks are usually slower.

Int64 was choose at random, I'm not sure which size to pick 😅

@ArthurSens ArthurSens force-pushed the prwexporter-batchSeries-concurrencybug branch from 910ad54 to 54b9015 Compare November 25, 2024 18:52
@ArthurSens ArthurSens changed the title Prwexporter batch series concurrencybug [exporter/prometheusremotewrite] Fix data race in batch series state if called concurrently Nov 25, 2024
@ArthurSens ArthurSens force-pushed the prwexporter-batchSeries-concurrencybug branch from 54b9015 to c79b58c Compare November 25, 2024 18:58
@ArthurSens ArthurSens force-pushed the prwexporter-batchSeries-concurrencybug branch from c79b58c to e41c6a2 Compare November 25, 2024 19:21
@dashpole
Copy link
Contributor

I chose atomic instead of mutexes because locks are usually slower.

Are you up for benchmarking that? I think we have benchmarks.

@@ -229,7 +228,7 @@ func (prwe *prwExporter) handleExport(ctx context.Context, tsMap map[string]*pro
}

// Calls the helper function to convert and batch the TsMap to the desired format
requests, err := batchTimeSeries(tsMap, prwe.maxBatchSizeBytes, m, &prwe.batchTimeSeriesState)
requests, err := batchTimeSeries(tsMap, prwe.maxBatchSizeBytes, m, prwe.batchTimeSeriesState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually fix the problem? If the single *batchTimeSeriesState value is used by multiple goroutines calling prwe.PushMetrics() concurrently it seems that moving to atomic integer access will certainly avoid data races that would be detected by the runtime but wouldn't necessarily make the changes to that state valid. What happens if there are multiple batches processed concurrently that have significantly different sizes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are multiple batches processed concurrently that have significantly different sizes?

Good point; they would still share the same state, and their results would be conflicting. I think I need to go back to the drawing board and think a bit more about how to solve this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rereading the code, and my understanding is that concurrent requests with very distinct batch sizes would constantly fight for the size of the subsequent request.

Can we even do something useful with the batchStateSize if we allow multiple workers? It sounds like this optimization only works for a single worker scenario 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rereading the code, and my understanding is that concurrent requests with very distinct batch sizes would constantly fight for the size of the subsequent request.

That's how I understand it, as well. I also think it likely that each of the three sizes tracked by this state would be decorrelated, though I'm not sure that's any more problematic.

Can we even do something useful with the batchStateSize if we allow multiple workers? It sounds like this optimization only works for a single worker scenario 🤔

I'm not sure this optimization is safe with multiple workers. Would it make more sense to use a sync.Pool of backing stores that can eventually grow to the needed size and get periodically reaped to avoid one-off large batches causing leaks? Similar to what is done in #35184?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the sync.Pool sounds like worth exploring! We could also remove the state altogether and see how bad the benchmarks will look.

I'm trying things out and running benchmarks, I'll open new PRs once I have something to show :)

ArthurSens added a commit to ArthurSens/opentelemetry-collector-contrib that referenced this pull request Nov 26, 2024
Signed-off-by: Arthur Silva Sens <[email protected]>
@ArthurSens
Copy link
Member Author

I did some benchmarks comparing Atomic vs Mutex, but probably not relevant anymore

I chose atomic instead of mutexes because locks are usually slower.

Are you up for benchmarking that? I think we have benchmarks.

Yep, so I'm comparing two branches in this benchmark. The first column is this branch, and the second column uses mutexes.

arthursens$ benchstat base=atomic.txt new=mutex.txt 
goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter
cpu: Apple M2 Pro
                          │     base     │                new                │
                          │    sec/op    │   sec/op     vs base              │
Execute/numSample=100-2     47.47µ ±  3%   47.00µ ± 6%       ~ (p=0.093 n=6)
Execute/numSample=1000-2    169.3µ ± 15%   163.7µ ± 8%       ~ (p=0.310 n=6)
Execute/numSample=10000-2   1.326m ± 10%   1.408m ± 5%       ~ (p=0.240 n=6)
geomean                     220.1µ         221.3µ       +0.54%

                          │     base     │                new                 │
                          │     B/op     │     B/op      vs base              │
Execute/numSample=100-2     6.081Ki ± 0%   6.081Ki ± 0%       ~ (p=1.000 n=6)
Execute/numSample=1000-2    22.12Ki ± 0%   22.12Ki ± 0%       ~ (p=0.675 n=6)
Execute/numSample=10000-2   38.52Ki ± 0%   38.51Ki ± 0%       ~ (p=0.121 n=6)
geomean                     17.30Ki        17.30Ki       -0.01%

                          │    base    │                new                 │
                          │ allocs/op  │ allocs/op   vs base                │
Execute/numSample=100-2     79.00 ± 0%   79.00 ± 0%       ~ (p=1.000 n=6) ¹
Execute/numSample=1000-2    83.00 ± 0%   83.00 ± 0%       ~ (p=1.000 n=6) ¹
Execute/numSample=10000-2   83.00 ± 0%   83.00 ± 0%       ~ (p=1.000 n=6) ¹
geomean                     81.64        81.64       +0.00%
¹ all samples are equal

The main difference between mutex and atomic operations is that the mutex blocks the whole batching operation for a few nanoseconds, so we never try to write the same memory address simultaneously. Atomic uses CAS, ensuring we only overwrite a memory address if another thread has not overwritten the value.

The results don't show much difference 🤷 , so it's just a matter of preference about how the code looks.

If we compare Atomic with main, there's not a remarkable difference as well.

arthursens$ benchstat base=main.txt new=atomic.txt 
goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter
cpu: Apple M2 Pro
                          │     base     │                new                 │
                          │    sec/op    │    sec/op     vs base              │
Execute/numSample=100-2     47.28µ ±  1%   47.47µ ±  3%       ~ (p=0.485 n=6)
Execute/numSample=1000-2    168.3µ ±  6%   169.3µ ± 15%       ~ (p=0.937 n=6)
Execute/numSample=10000-2   1.447m ± 10%   1.326m ± 10%       ~ (p=0.240 n=6)
geomean                     225.8µ         220.1µ        -2.52%

                          │     base     │                new                 │
                          │     B/op     │     B/op      vs base              │
Execute/numSample=100-2     6.081Ki ± 0%   6.081Ki ± 0%       ~ (p=1.000 n=6)
Execute/numSample=1000-2    22.12Ki ± 0%   22.12Ki ± 0%       ~ (p=0.513 n=6)
Execute/numSample=10000-2   38.50Ki ± 0%   38.52Ki ± 0%       ~ (p=0.394 n=6)
geomean                     17.30Ki        17.30Ki       +0.02%

                          │    base    │                new                 │
                          │ allocs/op  │ allocs/op   vs base                │
Execute/numSample=100-2     79.00 ± 0%   79.00 ± 0%       ~ (p=1.000 n=6) ¹
Execute/numSample=1000-2    83.00 ± 0%   83.00 ± 0%       ~ (p=1.000 n=6) ¹
Execute/numSample=10000-2   83.00 ± 0%   83.00 ± 0%       ~ (p=1.000 n=6) ¹
geomean                     81.64        81.64       +0.00%
¹ all samples are equal

@ArthurSens
Copy link
Member Author

Ok, I've created two alternative PRs to this one:

I couldn't see any relevant differences in the benchmarks, so I'd love to understand if I did anything wrong

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 14, 2024
@ArthurSens
Copy link
Member Author

Superseded by #36601

@ArthurSens ArthurSens closed this Dec 16, 2024
@ArthurSens ArthurSens deleted the prwexporter-batchSeries-concurrencybug branch December 16, 2024 20:19
bogdandrutu pushed a commit that referenced this pull request Dec 26, 2024
… batch state (#36601)

#### Description
This is an alternative for
#36524
and #36600

This PR does a couple of things:

* Add a test written by @edma2 that shows a data race to the batch state
when running multiple consumers.
* Add a benchmark for PushMetrics, with options to run with a stable
number of metrics or varying metrics.
* Fix the data race by introducing a `sync.Pool` of batch states.

#### Benchmark results

results comparing `main`,
#36600
and this PR:

```console
arthursens$ benchstat main.txt withoutState.txt syncpool.txt 
goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter
cpu: Apple M2 Pro
                            │  main.txt   │          withoutState.txt           │         syncpool.txt         │
                            │   sec/op    │    sec/op     vs base               │   sec/op     vs base         │
PushMetricsVaryingMetrics-2   8.066m ± 5%   13.821m ± 9%  +71.36% (p=0.002 n=6)   8.316m ± 6%  ~ (p=0.065 n=6)

                            │   main.txt   │           withoutState.txt            │            syncpool.txt            │
                            │     B/op     │     B/op       vs base                │     B/op      vs base              │
PushMetricsVaryingMetrics-2   5.216Mi ± 0%   34.436Mi ± 0%  +560.17% (p=0.002 n=6)   5.548Mi ± 0%  +6.36% (p=0.002 n=6)

                            │  main.txt   │       withoutState.txt       │         syncpool.txt         │
                            │  allocs/op  │  allocs/op   vs base         │  allocs/op   vs base         │
PushMetricsVaryingMetrics-2   56.02k ± 0%   56.05k ± 0%  ~ (p=0.721 n=6)   56.04k ± 0%  ~ (p=0.665 n=6)
```

---------

Signed-off-by: Arthur Silva Sens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants