-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy rr node #1084
Proxy rr node #1084
Conversation
By moving ClockRRGraphBuilder earlier in the rr graph flow, several parts of ClockRRGraphBuilder::create_and_append_clock_rr_graph can be avoided, as they were duplicating work that the original build_rr_graph flow was already doing (init_fan, mapping arch switch to rr switch, partition_edges). This new code should also fully preallocate the rr_node array, though this is not required by the code. Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
This should have a negliable performance impact, but this enables future changes to modify how rr nodes and rr edges are storaged. Signed-off-by: Keith Rothman <[email protected]>
@tangxifan FYI, this pattern could be useful for your implementation as well. |
So preliminary examination of the nightly results show that this PR has basically no change in QoR (as expected). I'll post comparisons later today. Weekly QoR data is running now against this PR. I don't have baseline against 2780988, but if the weekly in #1082 completes, we could probably use that data. |
Keith, doesn't this cost us an additional indirection on all rr_node accesses? |
There is no extra indirection, because
Which is really just:
Old data access is then:
The new code we write is:
The new code returns the base pointer + offset, so the same code becomes:
New data access is then:
The compiler can then inline the rest, which is what we see. In cases where t_rr_node is passed, we now pass via value instead of pointer, which will consume an additional register. However I think register pressure is not a large affect. |
@kmurray / @vaughnbetz It actually looks like the flyweight RR node is a slight win on the nightly QoR metrics! Please review when you get a chance. Weekly reg tests are still running, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code structure perspective this change looks good to me.
The QoR result @litghost provided also show no real run-time overhead on the VTR and 'other/small' Titan benchmarks.
I think we'll need to see the results on the full Tita23 set to really evaluate this from a QoR perspective however.
@litghost any thoughts on the potential optimization below?
t_edge_size fan_in_ = 0; | ||
uint16_t capacity_ = 0; | ||
t_rr_node_storage* storage_; | ||
RRNodeId id_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought on a potential optimization.
VPR only has a single RRGraph at any point in time. Instead of paying the overhead for:
t_rr_node_storage* storage_;
in each t_rr_node proxy object, you could just replace it with a call to:
g_vpr_ctx.device().rr_nodes;
where needed, which would make sizeof(t_rr_node) == sizeof(RRNodeId)
.
While in the long-term I think we should really move to better encapsulating the RR graph as a whole this would be a temporary optimization.
I've hit a hickup in the QoR comparisions, in that the quality is different on this PR versus baseline. This is unexpected, as there should be zero impact on quality (e.g. CPD). This is hard to explain right now, so I'm looking for answers. I suspect the use of unstable sort's and partitions are part of the issue, and I'm working on demonstrating that this is the case. Either way, we should hold off on merging this PR until I figure out what is going on. |
OK. Although if unstable sorting/ordering is the cause I'm not too worried (assuming the QoR change is small of course!). Good to verify that is the case. |
I believe I've isolated the QoR change, which was due to an old baseline. The new baseline is running, and I have full QoR from #1084 and #1085. #1096 is running right now, and is showing reasonable performance with good memory behavior. Once I have QoR results on all 4 (baseline, rr proxy, refactor edges, memory clean), we can talk about how to proceed. |
Baseline QoR is in. The above file compares a baseline commit (2780988) with #1084 (proxy rr node) and #1085 (refactor edges). Summary:
Preliminary results from #1096 are in (bitcoin miner, and gaussian blur up through router lookahead and placer delay matrix):
This week I'm working on recovering some of that 5% CPU time increase, I believe reordering some of memory loads to make the new memory pattern prefetchable will close the gap. |
@kmurray / @vaughnbetz FYI |
Description
This is laying the ground work for refactoring rr node and rr edge memory layout and storage. This replaces the t_rr_node object with a proxy object. The proxy then references a backing array that is identical to the current storage method.
By using a proxy object, anything that was taking references to the rr_nodes array stopped working. This PR changed those to use values instead of pointers/references.
This PR builds on #1081
Related Issue
#1079
#1048
Motivation and Context
This is change should be a no-op, which should be reflected in the QoR.
How Has This Been Tested?
Types of changes
Checklist: