Use a bitmap inside `OrderingSender` instead of imposing a strict add order #948

akoshelev · 2024-02-08T19:54:35Z

Currently, OrderingSender allows data to be inserted in order. That is, record 0 needs to add it first, then record 1, record 2 etc. This works fine with seq_join using a single thread. #742 allowed us to use more than one thread and we are seeing stalls on send because there are more tasks within the active window that are ready to insert data, but need to wait before next pointer is moved ahead.

In multi-threaded environment, a better option would be to keep track of non-empty slots. Contiguous region at the beginning of the buffer is ready to sent and the high watermark. We used to have an implementation of send buffer that worked that way FixedSizeByteVec but decided to go other way.

We should consider bringing it back at least for cases when multi-threading feature is enabled. When we have more than one thread, spawning up to active_work() multiplies in parallel will make them want to insert their data at approximately the same time and serializing writes is not particularly helpful here. @andyleiserson and I were profiling vectorization changes this morning and stumbled upon this issue as potential source of contention that prevents us from getting full benefits of parallelism here.

The text was updated successfully, but these errors were encountered:

andyleiserson · 2024-02-14T02:06:17Z

After additional investigation, I no longer believe the evidence points to this being a performance bottleneck.

Here is some timing data (min/avg/max are in μs; total is in ms):

Elapsed time: 743ms
Total busy duration: 5910ms
Multiply: count 196608 min 22 avg 169 max 13109 total 33284
Tx: count 196608 min 0 avg 29 max 12947 total 5796
Rx: count 196608 min 1 avg 116 max 7880 total 22848
PRSS: count 196608 min 13 avg 14 max 376 total 2932
Circuit: count 48 min 627553 avg 706053 max 743124 total 33890

Key observations:

The total worker busy duration of 5.9s is 7.95 times the elapsed wall clock time, indicating good parallel CPU utilization.
2.9s is spent in PRSS generation. This is roughly 50% of the total busy time, which is consistent with profiling.
The total time spent in multiply operations is a bit more than the sum of PRSS, Tx, and Rx time. The tasks are blocked/parked for much of the Tx and Rx time, so the total measured time for these is several multiples of the worker busy time.

akoshelev · 2024-02-14T05:22:29Z

this looks convincing, although rx numbers do not look spectacular, especially avg. Large p100 for tx also indicate some sort of tweaking that is required for active work - we probably go too far, but we need to profile with real IO to conclude that it is a bottleneck.

I would probably close this issue tomorrow

akoshelev closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a bitmap inside `OrderingSender` instead of imposing a strict add order #948

Use a bitmap inside `OrderingSender` instead of imposing a strict add order #948

akoshelev commented Feb 8, 2024

andyleiserson commented Feb 14, 2024

akoshelev commented Feb 14, 2024

Use a bitmap inside OrderingSender instead of imposing a strict add order #948

Use a bitmap inside OrderingSender instead of imposing a strict add order #948

Comments

akoshelev commented Feb 8, 2024

andyleiserson commented Feb 14, 2024

akoshelev commented Feb 14, 2024

Use a bitmap inside `OrderingSender` instead of imposing a strict add order #948

Use a bitmap inside `OrderingSender` instead of imposing a strict add order #948