Proxy rr node #1084

litghost · 2020-01-24T22:24:39Z

Description

This is laying the ground work for refactoring rr node and rr edge memory layout and storage. This replaces the t_rr_node object with a proxy object. The proxy then references a backing array that is identical to the current storage method.

By using a proxy object, anything that was taking references to the rr_nodes array stopped working. This PR changed those to use values instead of pointers/references.

This PR builds on #1081

Related Issue

#1079
#1048

Motivation and Context

This is change should be a no-op, which should be reflected in the QoR.

How Has This Been Tested?

Basic and strong CI passes
Run and check nightly and weekly QoR

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

By moving ClockRRGraphBuilder earlier in the rr graph flow, several parts of ClockRRGraphBuilder::create_and_append_clock_rr_graph can be avoided, as they were duplicating work that the original build_rr_graph flow was already doing (init_fan, mapping arch switch to rr switch, partition_edges). This new code should also fully preallocate the rr_node array, though this is not required by the code. Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Keith Rothman <[email protected]>

This should have a negliable performance impact, but this enables future changes to modify how rr nodes and rr edges are storaged. Signed-off-by: Keith Rothman <[email protected]>

litghost · 2020-01-24T22:45:47Z

@tangxifan FYI, this pattern could be useful for your implementation as well.

litghost · 2020-01-27T15:50:18Z

So preliminary examination of the nightly results show that this PR has basically no change in QoR (as expected). I'll post comparisons later today.

Weekly QoR data is running now against this PR. I don't have baseline against 2780988, but if the weekly in #1082 completes, we could probably use that data.

vaughnbetz · 2020-01-27T15:59:34Z

Keith, doesn't this cost us an additional indirection on all rr_node accesses?
If we fly-weight the whole rr_node, it doesn't seem like we'll have any nodes that are the same in all respects, and hence the fly-weight would devolve to one unique entry per rr_node, but with an extra indirection to get at the data. Is that right, or am I missing something?

litghost · 2020-01-27T16:04:32Z

Keith, doesn't this cost us an additional indirection on all rr_node accesses?
If we fly-weight the whole rr_node, it doesn't seem like we'll have any nodes that are the same in all respects, and hence the fly-weight would devolve to one unique entry per rr_node, but with an extra indirection to get at the data. Is that right, or am I missing something?

There is no extra indirection, because t_rr_node is a value object. For example, where-as before we would write:

t_rr_node &node = rr_nodes[idx];

Which is really just:

t_rr_node *node = rr_nodes + idx

Old data access is then:

*(data_type_t*)(node + data field offset)

The new code we write is:

t_rr_node node = rr_nodes[idx];

The new code returns the base pointer + offset, so the same code becomes:

(t_rr_node *first_node, size_t node_offset) = (rr_nodes, idx)

New data access is then:

*(data_type_t*)(first_node + node_offset + data field offset)

The compiler can then inline the rest, which is what we see. In cases where t_rr_node is passed, we now pass via value instead of pointer, which will consume an additional register. However I think register pressure is not a large affect.

litghost · 2020-01-27T16:10:58Z

vtr_reg_nightly/titan_small comp.xlsx
vtr_reg_nightly/titan_other comp.xlsx
complex_switch_comp.xlsx
vtr_reg_qor_chain_depop_comp.xlsx
vtr_reg_qor_chain_comp.xlsx

litghost · 2020-01-27T19:24:31Z

@kmurray / @vaughnbetz It actually looks like the flyweight RR node is a slight win on the nightly QoR metrics! Please review when you get a chance. Weekly reg tests are still running,

kmurray

From a code structure perspective this change looks good to me.

The QoR result @litghost provided also show no real run-time overhead on the VTR and 'other/small' Titan benchmarks.

I think we'll need to see the results on the full Tita23 set to really evaluate this from a QoR perspective however.

@litghost any thoughts on the potential optimization below?

kmurray · 2020-01-31T22:16:03Z

vpr/src/route/rr_node.h

-    t_edge_size fan_in_ = 0;
-    uint16_t capacity_ = 0;
+    t_rr_node_storage* storage_;
+    RRNodeId id_;


A thought on a potential optimization.

VPR only has a single RRGraph at any point in time. Instead of paying the overhead for:

t_rr_node_storage* storage_;

in each t_rr_node proxy object, you could just replace it with a call to:

g_vpr_ctx.device().rr_nodes;

where needed, which would make sizeof(t_rr_node) == sizeof(RRNodeId).

While in the long-term I think we should really move to better encapsulating the RR graph as a whole this would be a temporary optimization.

litghost · 2020-01-31T22:24:36Z

I think we'll need to see the results on the full Tita23 set to really evaluate this from a QoR perspective however.

I've hit a hickup in the QoR comparisions, in that the quality is different on this PR versus baseline. This is unexpected, as there should be zero impact on quality (e.g. CPD).

This is hard to explain right now, so I'm looking for answers. I suspect the use of unstable sort's and partitions are part of the issue, and I'm working on demonstrating that this is the case. Either way, we should hold off on merging this PR until I figure out what is going on.

kmurray · 2020-01-31T23:18:30Z

OK.

Although if unstable sorting/ordering is the cause I'm not too worried (assuming the QoR change is small of course!). Good to verify that is the case.

litghost · 2020-02-03T16:44:57Z

I've hit a hickup in the QoR comparisions, in that the quality is different on this PR versus baseline. This is unexpected, as there should be zero impact on quality (e.g. CPD).

I believe I've isolated the QoR change, which was due to an old baseline. The new baseline is running, and I have full QoR from #1084 and #1085. #1096 is running right now, and is showing reasonable performance with good memory behavior. Once I have QoR results on all 4 (baseline, rr proxy, refactor edges, memory clean), we can talk about how to proceed.

litghost · 2020-02-03T16:56:12Z

refactor_comp.xlsx

Baseline QoR is in. The above file compares a baseline commit (2780988) with #1084 (proxy rr node) and #1085 (refactor edges).

Summary:

Quality is unchanged between all 3 PR's (as expected)
Memory used for Proxy rr node #1084 is unchanged (as expected)
Initial refactoring of edge storage #1085 causes an expected 4% memory increase
Proxy rr node #1084 causes a 20% overall CPU time increase
Initial refactoring of edge storage #1085 causes a 5% overall CPU time increase

Preliminary results from #1096 are in (bitcoin miner, and gaussian blur up through router lookahead and placer delay matrix):

Memory clean up to placer #1096 still causes a 5% overall CPU time increase
Memory clean up to placer #1096 reduces memory usage by 35% - 60%

This week I'm working on recovering some of that 5% CPU time increase, I believe reordering some of memory loads to make the new memory pattern prefetchable will close the gap.

litghost · 2020-02-03T16:57:43Z

@kmurray / @vaughnbetz FYI

litghost added 7 commits January 23, 2020 16:48

Delay update to chan_width and avoid uses global context.

bb636d2

Signed-off-by: Keith Rothman <[email protected]>

Fix some bugs found during check rr graph.

ab5c4ba

Signed-off-by: Keith Rothman <[email protected]>

Fix some compiler warnings.

befbff5

Signed-off-by: Keith Rothman <[email protected]>

Add verification that no incremental node allocation occurred.

16d5bae

Signed-off-by: Keith Rothman <[email protected]>

Move rr node storage behind an object.

d7a9c0a

Signed-off-by: Keith Rothman <[email protected]>

Convert t_rr_node to a fly-weight object.

3c19d3d

This should have a negliable performance impact, but this enables future changes to modify how rr nodes and rr edges are storaged. Signed-off-by: Keith Rothman <[email protected]>

probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jan 24, 2020

litghost requested review from kmurray and vaughnbetz January 24, 2020 22:24

This was referenced Jan 27, 2020

Initial refactoring of edge storage #1085

Closed

RR graph edge storage refactoring #1079

Open

litghost changed the title ~~Flyweight rr node~~ Proxy rr node Jan 30, 2020

litghost mentioned this pull request Jan 30, 2020

Memory clean up to placer #1096

Closed

10 tasks

kmurray reviewed Jan 31, 2020

View reviewed changes

litghost mentioned this pull request Jan 31, 2020

Move call location of ClockRRGraphBuilder and use alloc_and_load_edges. #1081

Merged

7 tasks

litghost closed this Feb 5, 2020

litghost deleted the flyweight_rr_node branch February 5, 2020 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy rr node #1084

Proxy rr node #1084

litghost commented Jan 24, 2020 •

edited

Loading

litghost commented Jan 24, 2020

litghost commented Jan 27, 2020 •

edited

Loading

vaughnbetz commented Jan 27, 2020

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

kmurray left a comment

kmurray Jan 31, 2020

litghost commented Jan 31, 2020 •

edited

Loading

kmurray commented Jan 31, 2020

litghost commented Feb 3, 2020 •

edited

Loading

litghost commented Feb 3, 2020 •

edited

Loading

litghost commented Feb 3, 2020

Proxy rr node #1084

Proxy rr node #1084

Conversation

litghost commented Jan 24, 2020 • edited Loading

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

litghost commented Jan 24, 2020

litghost commented Jan 27, 2020 • edited Loading

vaughnbetz commented Jan 27, 2020

litghost commented Jan 27, 2020 • edited Loading

litghost commented Jan 27, 2020 • edited Loading

litghost commented Jan 27, 2020 • edited Loading

kmurray left a comment

Choose a reason for hiding this comment

kmurray Jan 31, 2020

Choose a reason for hiding this comment

litghost commented Jan 31, 2020 • edited Loading

kmurray commented Jan 31, 2020

litghost commented Feb 3, 2020 • edited Loading

litghost commented Feb 3, 2020 • edited Loading

litghost commented Feb 3, 2020

litghost commented Jan 24, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 27, 2020 •

edited

Loading

litghost commented Jan 31, 2020 •

edited

Loading

litghost commented Feb 3, 2020 •

edited

Loading

litghost commented Feb 3, 2020 •

edited

Loading