Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX plugin: Add multi-receive posting to avoid excessive flush #125

Merged
merged 11 commits into from
Oct 18, 2023

Conversation

tvegas1
Copy link
Collaborator

@tvegas1 tvegas1 commented Oct 6, 2023

What?

Add to UCX plugin the capability to post multiple receives at once.

Why?

Posting 8 receives at once reduces the flushing by 8 times, which fixes perf on alltoall_perf (nccl-test).

Selene luna, alltoall_perf, 4 nodes, 32 gpus, tested 1 gpu x 8 processes on 4 nodes, tested 8 gpus x 1 process on 4 nodes

  • Internal IB / UCX Before / UCX After: 31.84 / 24.27 / 31.89

How?

  • Add multi-request handling
  • Remove support for UCX <1.10 (was not building since 1+ year)
  • Merge comm descriptor

Tested

nccl-test alltoall, allreduce

Versions:

  • UCX 96dcb9fc20353cc59478256e33eba152751fa4b4
  • nccl-rdma-sharp-plugins d387b8a
  • nccl 4365458757e4107ecbf629b2fd6e0e19a5d237c2 (~v2.18.5-1)

Command:

NCCL_PLUGIN_P2P=ucx
NCCL_UCX_TLS=rc,cuda_copy,cuda_ipc
all_reduce_perf -b 8 -e 8G -f 2 -g 8 -t 1 -w 100

@tvegas1
Copy link
Collaborator Author

tvegas1 commented Oct 6, 2023

@bureddy, @brminich, @yosefe
please look and instruct for regs to run

src/ucx_plugin.c Outdated
UCP_OP_ATTR_FLAG_NO_IMM_CMPL;
params.cb.recv = recv_handler_nbx;
params.user_data = &pending;
params.datatype = ucp_dt_make_contig(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass datatype

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed for all

src/ucx_plugin.c Outdated
Comment on lines 615 to 616
params.cb.recv = recv_handler_nbx;
params.user_data = &pending;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need a callback - can't we just wait for the request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/ucx_plugin.c Outdated
UCP_OP_ATTR_FIELD_DATATYPE;
params.cb.recv = recv_handler_nbx;
params.user_data = &req->pending;
params.datatype = ucp_dt_make_contig(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass datatype

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/ucx_plugin.c Outdated
Comment on lines 782 to 786
params.op_attr_mask = UCP_OP_ATTR_FIELD_CALLBACK |
UCP_OP_ATTR_FIELD_USER_DATA |
UCP_OP_ATTR_FIELD_DATATYPE;
params.cb.recv = recv_handler_nbx;
params.user_data = &req->pending;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be initialized outside the loop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/ucx_plugin.c Outdated
Comment on lines 569 to 574
if (req->used == 0) {
req->worker = comm->worker;
req->pending = 0;
req->count = 0;
req->used = 1;
return req;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems too slow to iterate over the array like that
can we use a freelist? or bitmap (FFS) if num of requests is <=64 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added freelist

Comment on lines +719 to +722
req = ucx_request_get(comm);
if (req == NULL) {
return ncclInternalError;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for send flow we don't need this since there is always a single buffer to send

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if this is what you meant but i think we need it as-is because:

  • we want to return *request = req with always same type as other callbacks
  • api does not seem to guarantee that we have only one req instance in-flight, so we still need to allocate one
  • our callback ->test() that does progress will need to be called with a valid request

src/ucx_plugin.c Outdated
if (size != NULL) {
/* Posted receives have completed */
for (int i = 0; i < req->count; i++) {
size[i] = req->size[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use memcpy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ucp_params.request_size = sizeof(ucx_request_t);
ucp_params.request_init = request_init;
ucp_params.field_mask = UCP_PARAM_FIELD_FEATURES;
ucp_params.features = UCP_FEATURE_TAG | UCP_FEATURE_RMA;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using RMA in this plugin. is it better to drop this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich reminded me that we still need it for iflush/ucp_get

@@ -1,5 +1,5 @@
/*************************************************************************
* Copyright (c) 2016-2020, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: worth updating in none or all files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/ucx_plugin.c Outdated
UCP_OP_ATTR_FIELD_DATATYPE;
params.cb.send = check_handler;
params.user_data = comm;
params.datatype = ucp_dt_make_contig(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid passing datatype here as well
(ucp_dt_make_contig(1) is a default option when UCP_OP_ATTR_FIELD_DATATYPE is not passed)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

params.op_attr_mask |= UCP_OP_ATTR_FIELD_MEMORY_TYPE;
params.memory_type = mh->mem_type;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we anyway use "modern" UCP API here why not to pass memory handle for preregistered buffers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's postpone this change, as it may imply additional limitations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan is to check that independently from multi-recv fix. but for my understanding, currently, the memh is found internally through rcache from passed addr/len, is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be found by rcache

@bureddy bureddy merged commit 7845726 into Mellanox:master Oct 18, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants