Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy rr node #1084

Closed
Closed

Conversation

litghost
Copy link
Collaborator

@litghost litghost commented Jan 24, 2020

Description

This is laying the ground work for refactoring rr node and rr edge memory layout and storage. This replaces the t_rr_node object with a proxy object. The proxy then references a backing array that is identical to the current storage method.

By using a proxy object, anything that was taking references to the rr_nodes array stopped working. This PR changed those to use values instead of pointers/references.

This PR builds on #1081

Related Issue

#1079
#1048

Motivation and Context

This is change should be a no-op, which should be reflected in the QoR.

How Has This Been Tested?

  • Basic and strong CI passes
  • Run and check nightly and weekly QoR

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

By moving ClockRRGraphBuilder earlier in the rr graph flow, several
parts of ClockRRGraphBuilder::create_and_append_clock_rr_graph can be
avoided, as they were duplicating work that the original build_rr_graph
flow was already doing (init_fan, mapping arch switch to rr switch,
partition_edges).

This new code should also fully preallocate the rr_node array, though
this is not required by the code.

Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
This should have a negliable performance impact, but this enables future
changes to modify how rr nodes and rr edges are storaged.

Signed-off-by: Keith Rothman <[email protected]>
@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jan 24, 2020
@litghost
Copy link
Collaborator Author

@tangxifan FYI, this pattern could be useful for your implementation as well.

@litghost
Copy link
Collaborator Author

litghost commented Jan 27, 2020

So preliminary examination of the nightly results show that this PR has basically no change in QoR (as expected). I'll post comparisons later today.

Weekly QoR data is running now against this PR. I don't have baseline against 2780988, but if the weekly in #1082 completes, we could probably use that data.

@vaughnbetz
Copy link
Contributor

Keith, doesn't this cost us an additional indirection on all rr_node accesses?
If we fly-weight the whole rr_node, it doesn't seem like we'll have any nodes that are the same in all respects, and hence the fly-weight would devolve to one unique entry per rr_node, but with an extra indirection to get at the data. Is that right, or am I missing something?

@litghost
Copy link
Collaborator Author

litghost commented Jan 27, 2020

Keith, doesn't this cost us an additional indirection on all rr_node accesses?
If we fly-weight the whole rr_node, it doesn't seem like we'll have any nodes that are the same in all respects, and hence the fly-weight would devolve to one unique entry per rr_node, but with an extra indirection to get at the data. Is that right, or am I missing something?

There is no extra indirection, because t_rr_node is a value object. For example, where-as before we would write:

t_rr_node &node = rr_nodes[idx];

Which is really just:

t_rr_node *node = rr_nodes + idx

Old data access is then:

*(data_type_t*)(node + data field offset)

The new code we write is:

t_rr_node node = rr_nodes[idx];

The new code returns the base pointer + offset, so the same code becomes:

(t_rr_node *first_node, size_t node_offset) = (rr_nodes, idx)

New data access is then:

*(data_type_t*)(first_node + node_offset + data field offset)

The compiler can then inline the rest, which is what we see. In cases where t_rr_node is passed, we now pass via value instead of pointer, which will consume an additional register. However I think register pressure is not a large affect.

@litghost
Copy link
Collaborator Author

litghost commented Jan 27, 2020

@kmurray / @vaughnbetz It actually looks like the flyweight RR node is a slight win on the nightly QoR metrics! Please review when you get a chance. Weekly reg tests are still running,

@litghost litghost changed the title Flyweight rr node Proxy rr node Jan 30, 2020
@litghost litghost mentioned this pull request Jan 30, 2020
10 tasks
Copy link
Contributor

@kmurray kmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code structure perspective this change looks good to me.

The QoR result @litghost provided also show no real run-time overhead on the VTR and 'other/small' Titan benchmarks.

I think we'll need to see the results on the full Tita23 set to really evaluate this from a QoR perspective however.

@litghost any thoughts on the potential optimization below?

t_edge_size fan_in_ = 0;
uint16_t capacity_ = 0;
t_rr_node_storage* storage_;
RRNodeId id_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought on a potential optimization.

VPR only has a single RRGraph at any point in time. Instead of paying the overhead for:

t_rr_node_storage* storage_;

in each t_rr_node proxy object, you could just replace it with a call to:

g_vpr_ctx.device().rr_nodes;

where needed, which would make sizeof(t_rr_node) == sizeof(RRNodeId).

While in the long-term I think we should really move to better encapsulating the RR graph as a whole this would be a temporary optimization.

@litghost
Copy link
Collaborator Author

litghost commented Jan 31, 2020

I think we'll need to see the results on the full Tita23 set to really evaluate this from a QoR perspective however.

I've hit a hickup in the QoR comparisions, in that the quality is different on this PR versus baseline. This is unexpected, as there should be zero impact on quality (e.g. CPD).

This is hard to explain right now, so I'm looking for answers. I suspect the use of unstable sort's and partitions are part of the issue, and I'm working on demonstrating that this is the case. Either way, we should hold off on merging this PR until I figure out what is going on.

@kmurray
Copy link
Contributor

kmurray commented Jan 31, 2020

OK.

Although if unstable sorting/ordering is the cause I'm not too worried (assuming the QoR change is small of course!). Good to verify that is the case.

@litghost
Copy link
Collaborator Author

litghost commented Feb 3, 2020

I've hit a hickup in the QoR comparisions, in that the quality is different on this PR versus baseline. This is unexpected, as there should be zero impact on quality (e.g. CPD).

I believe I've isolated the QoR change, which was due to an old baseline. The new baseline is running, and I have full QoR from #1084 and #1085. #1096 is running right now, and is showing reasonable performance with good memory behavior. Once I have QoR results on all 4 (baseline, rr proxy, refactor edges, memory clean), we can talk about how to proceed.

@litghost
Copy link
Collaborator Author

litghost commented Feb 3, 2020

refactor_comp.xlsx

Baseline QoR is in. The above file compares a baseline commit (2780988) with #1084 (proxy rr node) and #1085 (refactor edges).

Summary:

Preliminary results from #1096 are in (bitcoin miner, and gaussian blur up through router lookahead and placer delay matrix):

This week I'm working on recovering some of that 5% CPU time increase, I believe reordering some of memory loads to make the new memory pattern prefetchable will close the gap.

@litghost
Copy link
Collaborator Author

litghost commented Feb 3, 2020

@kmurray / @vaughnbetz FYI

@litghost litghost closed this Feb 5, 2020
@litghost litghost deleted the flyweight_rr_node branch February 5, 2020 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants