-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature split local and remote buffer tracking #272
Feature split local and remote buffer tracking #272
Conversation
…and-remote-buffer-tracking
…. Additionally adds back in the baseline YGM_COMM_BUFFER_SIZE enviornment variable if the user does not want to specify the local/remote buffers explicitly. Have tested locally, committing to test on LC when home from campus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just a few non-technical suggestions.
comm_environment() { | ||
if (const char* cc = std::getenv("YGM_COMM_BUFFER_SIZE_KB")) { | ||
buffer_size = convert<size_t>(cc) * 1024; | ||
comm_environment(size_t nodes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this function take a reference to the layout
that can be used to get the number of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, makes it more clear. (addressed in 79b8625)
} | ||
switch (routing) { | ||
case routing_type::NONE : | ||
local_buffer_size = round_to_nearest_kb(total_buffer_size / nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert to floating point arithmetic. Very unlikely to run into issues from having total_buffer_size < nodes
, but this will prevent any issues when someone tries testing something really weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should do the same for the rest of this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 79b8625
if (const char* cc = std::getenv("YGM_COMM_LOCAL_BUFFER_SIZE_KB")) { | ||
local_buffer_size = convert<size_t>(cc) * 1024; | ||
if (std::getenv("YGM_COMM_REMOTE_BUFFER_SIZE_KB") == nullptr) | ||
std::cerr << "YGM_COMM_REMOTE_BUFFER_SIZE_KB not set, using default of" << remote_buffer_size << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's an extra space after default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "recommended value" instead of "default", since the default value will vary depending on other parameters the user provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 79b8625.
} else { | ||
throw std::runtime_error("comm_enviornment -- unknown routing type"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put in a line break and a comment about the recommended local and remote buffer sizes coming from looking at the volume of traffic passing through local and remote buffers in an all-to-all pattern with uniform message sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot this would only display lines above the one I'm commenting on, but this is between setting the routing type and figuring out the buffer sizes to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments in 94a6f49 does this make it more clear?
include/ygm/comm.hpp
Outdated
@@ -189,6 +189,8 @@ class comm { | |||
|
|||
std::pair<uint64_t, uint64_t> barrier_reduce_counts(); | |||
|
|||
void queue_next_send(std::deque<int> &dest_queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to something like flush_next_send
. We don't have a queue of sends that we're about to send out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 79b8625
Addressed the comments from Trevor. |
I ran this in some disjoint set code I have. On 8 nodes using This is a complicated and recursive communication pattern that seems to be working well with these changes. I want to finish my testing with NR routing later today before merging this in just to make sure I'm getting the expected performance. |
Summary
This feature aims to split how the comm buffer handles local and remote communications separately. This should allow for more fine-tuning in our environment parameters as we can set the node local and node remote target msg size separately.
Description of Changes
Unfinished items / Todo
comm::check_if_production_halt_required
. I have this as the pending isend_bytes being less than the sum of the two buffers. In my SHM version, this is just the remote_send_bytes, as we would expect the majority of communication to be through shared memory.