You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, OrderingSender allows data to be inserted in order. That is, record 0 needs to add it first, then record 1, record 2 etc. This works fine with seq_join using a single thread. #742 allowed us to use more than one thread and we are seeing stalls on send because there are more tasks within the active window that are ready to insert data, but need to wait before next pointer is moved ahead.
In multi-threaded environment, a better option would be to keep track of non-empty slots. Contiguous region at the beginning of the buffer is ready to sent and the high watermark. We used to have an implementation of send buffer that worked that way FixedSizeByteVec but decided to go other way.
We should consider bringing it back at least for cases when multi-threading feature is enabled. When we have more than one thread, spawning up to active_work() multiplies in parallel will make them want to insert their data at approximately the same time and serializing writes is not particularly helpful here. @andyleiserson and I were profiling vectorization changes this morning and stumbled upon this issue as potential source of contention that prevents us from getting full benefits of parallelism here.
The text was updated successfully, but these errors were encountered:
After additional investigation, I no longer believe the evidence points to this being a performance bottleneck.
Here is some timing data (min/avg/max are in μs; total is in ms):
Elapsed time: 743ms
Total busy duration: 5910ms
Multiply: count 196608 min 22 avg 169 max 13109 total 33284
Tx: count 196608 min 0 avg 29 max 12947 total 5796
Rx: count 196608 min 1 avg 116 max 7880 total 22848
PRSS: count 196608 min 13 avg 14 max 376 total 2932
Circuit: count 48 min 627553 avg 706053 max 743124 total 33890
Key observations:
The total worker busy duration of 5.9s is 7.95 times the elapsed wall clock time, indicating good parallel CPU utilization.
2.9s is spent in PRSS generation. This is roughly 50% of the total busy time, which is consistent with profiling.
The total time spent in multiply operations is a bit more than the sum of PRSS, Tx, and Rx time. The tasks are blocked/parked for much of the Tx and Rx time, so the total measured time for these is several multiples of the worker busy time.
this looks convincing, although rx numbers do not look spectacular, especially avg. Large p100 for tx also indicate some sort of tweaking that is required for active work - we probably go too far, but we need to profile with real IO to conclude that it is a bottleneck.
Currently,
OrderingSender
allows data to be inserted in order. That is, record 0 needs to add it first, then record 1, record 2 etc. This works fine withseq_join
using a single thread. #742 allowed us to use more than one thread and we are seeing stalls on send because there are more tasks within the active window that are ready to insert data, but need to wait beforenext
pointer is moved ahead.In multi-threaded environment, a better option would be to keep track of non-empty slots. Contiguous region at the beginning of the buffer is ready to sent and the high watermark. We used to have an implementation of send buffer that worked that way FixedSizeByteVec but decided to go other way.
We should consider bringing it back at least for cases when
multi-threading
feature is enabled. When we have more than one thread, spawning up toactive_work()
multiplies in parallel will make them want to insert their data at approximately the same time and serializing writes is not particularly helpful here. @andyleiserson and I were profiling vectorization changes this morning and stumbled upon this issue as potential source of contention that prevents us from getting full benefits of parallelism here.The text was updated successfully, but these errors were encountered: