-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX plugin: Add multi-receive posting to avoid excessive flush #125
Conversation
src/ucx_plugin.c
Outdated
UCP_OP_ATTR_FLAG_NO_IMM_CMPL; | ||
params.cb.recv = recv_handler_nbx; | ||
params.user_data = &pending; | ||
params.datatype = ucp_dt_make_contig(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to pass datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed for all
src/ucx_plugin.c
Outdated
params.cb.recv = recv_handler_nbx; | ||
params.user_data = &pending; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need a callback - can't we just wait for the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucx_plugin.c
Outdated
UCP_OP_ATTR_FIELD_DATATYPE; | ||
params.cb.recv = recv_handler_nbx; | ||
params.user_data = &req->pending; | ||
params.datatype = ucp_dt_make_contig(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to pass datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucx_plugin.c
Outdated
params.op_attr_mask = UCP_OP_ATTR_FIELD_CALLBACK | | ||
UCP_OP_ATTR_FIELD_USER_DATA | | ||
UCP_OP_ATTR_FIELD_DATATYPE; | ||
params.cb.recv = recv_handler_nbx; | ||
params.user_data = &req->pending; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be initialized outside the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucx_plugin.c
Outdated
if (req->used == 0) { | ||
req->worker = comm->worker; | ||
req->pending = 0; | ||
req->count = 0; | ||
req->used = 1; | ||
return req; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems too slow to iterate over the array like that
can we use a freelist? or bitmap (FFS) if num of requests is <=64 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added freelist
req = ucx_request_get(comm); | ||
if (req == NULL) { | ||
return ncclInternalError; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for send flow we don't need this since there is always a single buffer to send
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if this is what you meant but i think we need it as-is because:
- we want to return
*request = req
with always same type as other callbacks - api does not seem to guarantee that we have only one req instance in-flight, so we still need to allocate one
- our callback
->test()
that does progress will need to be called with a valid request
src/ucx_plugin.c
Outdated
if (size != NULL) { | ||
/* Posted receives have completed */ | ||
for (int i = 0; i < req->count; i++) { | ||
size[i] = req->size[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use memcpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ucp_params.request_size = sizeof(ucx_request_t); | ||
ucp_params.request_init = request_init; | ||
ucp_params.field_mask = UCP_PARAM_FIELD_FEATURES; | ||
ucp_params.features = UCP_FEATURE_TAG | UCP_FEATURE_RMA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not using RMA in this plugin. is it better to drop this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich reminded me that we still need it for iflush/ucp_get
@@ -1,5 +1,5 @@ | |||
/************************************************************************* | |||
* Copyright (c) 2016-2020, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
* Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: worth updating in none or all files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucx_plugin.c
Outdated
UCP_OP_ATTR_FIELD_DATATYPE; | ||
params.cb.send = check_handler; | ||
params.user_data = comm; | ||
params.datatype = ucp_dt_make_contig(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can avoid passing datatype here as well
(ucp_dt_make_contig(1)
is a default option when UCP_OP_ATTR_FIELD_DATATYPE
is not passed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
params.op_attr_mask |= UCP_OP_ATTR_FIELD_MEMORY_TYPE; | ||
params.memory_type = mh->mem_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we anyway use "modern" UCP API here why not to pass memory handle for preregistered buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's postpone this change, as it may imply additional limitations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plan is to check that independently from multi-recv fix. but for my understanding, currently, the memh is found internally through rcache from passed addr/len, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should be found by rcache
6884b03
to
264e50b
Compare
What?
Add to UCX plugin the capability to post multiple receives at once.
Why?
Posting 8 receives at once reduces the flushing by 8 times, which fixes perf on alltoall_perf (nccl-test).
Selene luna, alltoall_perf, 4 nodes, 32 gpus, tested 1 gpu x 8 processes on 4 nodes, tested 8 gpus x 1 process on 4 nodes
How?
Tested
nccl-test alltoall, allreduce
Versions:
Command: