Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a bitmap inside OrderingSender instead of imposing a strict add order #948

Closed
akoshelev opened this issue Feb 8, 2024 · 2 comments
Closed

Comments

@akoshelev
Copy link
Collaborator

Currently, OrderingSender allows data to be inserted in order. That is, record 0 needs to add it first, then record 1, record 2 etc. This works fine with seq_join using a single thread. #742 allowed us to use more than one thread and we are seeing stalls on send because there are more tasks within the active window that are ready to insert data, but need to wait before next pointer is moved ahead.

In multi-threaded environment, a better option would be to keep track of non-empty slots. Contiguous region at the beginning of the buffer is ready to sent and the high watermark. We used to have an implementation of send buffer that worked that way FixedSizeByteVec but decided to go other way.

We should consider bringing it back at least for cases when multi-threading feature is enabled. When we have more than one thread, spawning up to active_work() multiplies in parallel will make them want to insert their data at approximately the same time and serializing writes is not particularly helpful here. @andyleiserson and I were profiling vectorization changes this morning and stumbled upon this issue as potential source of contention that prevents us from getting full benefits of parallelism here.

mul-tx-rx

@andyleiserson
Copy link
Collaborator

After additional investigation, I no longer believe the evidence points to this being a performance bottleneck.

Here is some timing data (min/avg/max are in μs; total is in ms):

Elapsed time: 743ms
Total busy duration: 5910ms
Multiply: count 196608 min 22 avg 169 max 13109 total 33284
Tx: count 196608 min 0 avg 29 max 12947 total 5796
Rx: count 196608 min 1 avg 116 max 7880 total 22848
PRSS: count 196608 min 13 avg 14 max 376 total 2932
Circuit: count 48 min 627553 avg 706053 max 743124 total 33890

Key observations:

  • The total worker busy duration of 5.9s is 7.95 times the elapsed wall clock time, indicating good parallel CPU utilization.
  • 2.9s is spent in PRSS generation. This is roughly 50% of the total busy time, which is consistent with profiling.
  • The total time spent in multiply operations is a bit more than the sum of PRSS, Tx, and Rx time. The tasks are blocked/parked for much of the Tx and Rx time, so the total measured time for these is several multiples of the worker busy time.

@akoshelev
Copy link
Collaborator Author

this looks convincing, although rx numbers do not look spectacular, especially avg. Large p100 for tx also indicate some sort of tweaking that is required for active work - we probably go too far, but we need to profile with real IO to conclude that it is a bottleneck.

I would probably close this issue tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants