RR graph edge storage refactoring #1079

litghost · 2020-01-23T18:00:18Z

Proposed Behaviour

The edge storage should be optimized for router behavior, which is driven by two concerns:

Ability to iterate quickly (see https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/route/route_timing.cpp#L1984-L1986)
Memory locality

The current allocation strategy has poor memory locality, and requires a pointer per node.

Current Behaviour

Currently rr graph edges are stored in per node allocations. This is a reasonable solution when edges are being mutated, but once edges are fixed (either after rr graph construction or if the rr graph is read from file) having one allocation be node is overly flexible.

Possible Solution

Assuming that during writing, node edges are only written and never read, and afterwards node edges are only read and never written, a two phase data structure can be used.

Context

litghost · 2020-01-23T20:42:11Z

FYI @vaughnbetz / @kmurray this is the issue we were talking about

@tangxifan This is issue is relevant to your refactoring. @tangxifan in particular, your first iteration (https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/device/rr_graph_obj.h#L804-L808) preserves this same 1 allocation per rr node.

I'm going to be making several further posts fleshing out my notes on this change.

litghost · 2020-01-23T21:08:43Z

Current usage of the edge interfaces

All rr node edge access is controlled by the t_rr_node object.

There are 3 classes of methods:

Get node to edge relationships (6 methods)
Get data for an edge (2 methods)
Node to edge mutation (4 methods)

Methods: Get node to edge relationships

There are 6 methods in this category:

t_rr_node::edges
t_rr_node::configurable_edges
t_rr_node::non_configurable_edges
t_rr_node::num_edges
t_rr_node::num_configurable_edges
t_rr_node::num_non_configurable_edges

Usage:

num_edges is used heavily (57 callsites).
Both num_non_configurable_edges and num_configurable_edges is never called outside of t_rr_node itself.
configurable_edges and non_configurable_edges are used infrequently:

src/route/route_tree_timing.cpp
388:        for (int iedge : device_ctx.rr_nodes[rr_node].non_configurable_edges()) {

src/route/check_rr_graph.cpp
128:        for (auto edge : device_ctx.rr_nodes[inode].configurable_edges()) {
135:        for (auto edge : device_ctx.rr_nodes[inode].non_configurable_edges()) {

src/route/route_common.cpp
640:    std::vector<t_edge_size> unvisited_non_configurable_edges;
642:    for (auto iedge : device_ctx.rr_nodes[node].non_configurable_edges()) {
648:            unvisited_non_configurable_edges.push_back(iedge);
652:    if (unvisited_non_configurable_edges.size() == 0) {
667:        for (auto iedge : unvisited_non_configurable_edges) {

Notes:

The public versions of t_rr_node::num_configurable_edges and t_rr_node::num_non_configurable_edges can be made (more) expensive, as they are actually never used outside of t_rr_node to begin with. Specially, storing the number of non-configurable edges soley for fast num_non_configurable_edges and num_configurable_edges does not make sense
configurable_edges and non_configurable_edges usage in check_rr_graph is slow path code
non_configurable_edges usage in route_tree_timing.cpp and route_common.cpp are both not in the router expansion inner loop, it is called per-sink after. Making this call slower is likely okay, but needs to be watched for.

Conclusions:

Storing the number of non-configurable edges and/or configurable edges is likely not required. The method non_configurable_edges() can likely compute the start of the partition live by starting at the first node edge, and advancing to end() or the first non-configurable edge. This change needs to be profiled, and can be done before other changes.
If a slower non_configurable_edges is not acceptable, the offset from the first edge to the first non-configurable edge can be located separately from t_rr_node and the core edge data.

litghost · 2020-01-23T21:13:35Z

Methods: Get data for an edge

There are 2 methods in this category:

t_rr_node::edge_sink_node
t_rr_node::edge_switch

These methods are currently basically key'd as (RRNodeId, edge offset). This is not strictly speaking required (or desirable). They should simply be key'd as RREdgeId, with the caveat that RREdgeId is only available after mutation and edge partitioning is complete.

litghost · 2020-01-23T21:33:01Z

Methods: Node to edge mutation

There are 5 methods in this category:

partition_edges
add_edge
set_num_edges
set_edge_sink_node
set_edge_switch

Usage:

partition_edges is used exactly once, during global rr graph node/edge partitioning.
set_num_edges and set_edge_sink_node is used twice:
- alloc_and_load_edges, which is taking a complete edge list and copying it into the t_rr_node data structure
- process_edges, which is taking a complete edge list from disk and copying it into the t_rr_node data structure
set_edge_switch is used in three locations:
- alloc_and_load_edges, which is taking a complete edge list and copying it into the t_rr_node data structure
- process_edges, which is taking a complete edge list from disk and copying it into the t_rr_node data structure
- ClockRRGraphBuilder::add_rr_switches_and_map_to_nodes, which does a strange switch index remap from arch_switch_idx to rr_switch_idx
add_edge is used in 6 locations:
- ClockRib::create_rr_nodes_and_internal_edges_for_one_instance
- ClockSpine::create_rr_nodes_and_internal_edges_for_one_instance
- RoutingToClockConnection::create_switches
- ClockToClockConneciton::create_switches
- ClockToPinsConnection::create_switches
- alloc_and_load_edges calls add_edge in some circumstance when t_rr_node already has edges allocated (when???)

Notes:

The two fast paths (alloc_and_load_edges and process_edges) are bulk edge insertions, and can easily be re-written to any possible implementation
The clock builder code uses an incremental edge accretion pattern (relying on add_edge)

Conclusions:

The number of places that mutate the edge list is very very small:
1. RR graph reader
2. alloc_and_load_edges
3. The rr clock builder

litghost · 2020-01-23T21:53:56Z

Proposed refactoring

There are several parts to this, but here is the key idea:

t_rr_node will never own edge storage
There will be one entity responsible for edge storage
Write methods (Node to edge mutation) will only be allowed prior to partitioning
Read methods (Get node to edge relationships and Get data for an edge) will be forbidden or expensive prior to partitioning
Read methods will be cheap after partitioning

Implementation details (post-partitioning)

The implementation details post-partitioning are likely the easiest to describe. Edge data will be key'd by RREdgeId that are simply a linear index into one or more vectors of edge data. t_rr_node will store the first RREdgeId that belongs to that RRNodeId. The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

During edge partitioning, the flat edge data array will be sorted by the tuple(RRNodeId, IsNonConfigurable). This means that a particular RRNodeId's edge data is consecutive. It also means that configurable edges for a particular RRNodeId will appear before the first non-configurable edge (if any).

To make this concrete in some pseudo-code:

iterator t_rr_node::edge_begin() {
  return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
  return g_edges.begin(this[1].first_node_);
}
iterator t_rr_node::non_configurable_edge_begin() {
  return std::find_if(edge_begin(), edge_end(), [](const V & v) {
    return g_edges.is_non_configurable(v);
  });
}
iterator t_rr_node::non_configurable_edge_end() {
  return edge_end();
}
iterator t_rr_node::configurable_edge_begin() {
  return edge_begin();
}
iterator t_rr_node::configurable_edge_end() {
  return non_configurable_edge_begin();
}
difference_type t_rr_node::num_edges() {
  return std::distance(edge_begin(), edge_end());
}
difference_type t_rr_node::num_configurable_edges() {
  return std::distance(configurable_edge_begin(), configurable_edge_end());
}
difference_type t_rr_node::num_configurable_edges() {
  return std::distance(non_configurable_edge_begin(), non_configurable_edge_end());
}

litghost · 2020-01-23T22:24:00Z

There are effectively 3 edge mutation paths, and the confusing thing is that the RR clock graph generator is located in an odd place and does the graph mutation in a weird way.

The 3 edge mutation implementations are:

alloc_and_load_rr_graph which does several semi-batch edge insertions via t_rr_edge_info_set and alloc_and_load_edges. The reason I say that this path is a semi-batch is that alloc_and_load_edges has a fallback path to use add_edge if a rr node has already been preallocated.
The rr graph readers process_edges, which is fully batch edge insertion.
The code path starting from ClockRRGraphBuilder::create_and_append_clock_rr_graph, which does purely incremental edge insertions, and then a "weird" arch switch index to rr switch index mapping in clock_graph.add_rr_switches_and_map_to_nodes

In general the ClockRRGraphBuilder code is problematic for the two phase approach, but I think it's current problems are simply bugs, and should just be fixed before any refactoring is started. I'll create a new issue documenting the problems that I see, and propose some solutions.

kmurray · 2020-01-29T21:45:06Z

Proposed refactoring

There are several parts to this, but here is the key idea:

* `t_rr_node` will never own edge storage

* There will be one entity responsible for edge storage

* Write methods (`Node to edge mutation`) will only be allowed prior to partitioning

* Read methods (`Get node to edge relationships` and `Get data for an edge`) will be forbidden or expensive prior to partitioning

* Read methods will be cheap after partitioning

This all sounds good to me!

The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

If we take this approach, we'll have to think carefully about how to handle the dummy. There are a number of places which do loops like:

for (size_t inode = 0; inodes < rr_nodes.size(); ++inode) {

which could be tripped up by the dummy at the end.

iterator t_rr_node::edge_begin() {
return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
return g_edges.begin(this[1].first_node_);
}

This is a pretty clever way to eliminate the storage of num_nodes in t_rr_node. However it has two drawbacks which will need to be addressed:

Using array indexing on this (e.g. this[1]) is rather non-intuitive. This will need to be well commented to outline what its doing.
This couples the RR node ordering and RR edge ordering together fairly strongly (e.g. re-ordering RR nodes independently of the edges would break edge iteration). We'll need to very clearly outline (and likely write some automated validators) to check the assumptions/invariants this imposes. (In contrast storing a first_node_ + num_edges_ keeps the node/edge ordering independent.)

My thoughts on whether the this[] or num_edges_ approach make the most sense will likely depend on whether it actually slims down the size of t_rr_node, or we just end up with empty padding when eliminating num_edges_.

* `t_rr_node` will never own edge storage

return g_edges.begin(

The issue of who actually owns the edge data, and where its located is another issue to consider as well.

To be consistent as currently implemented, it would make to stick it along side device_ctx.rr_nodes, so something like device_ctx.rr_edges. Although then that will expose the rr_edges implementation to broader VPR (where it is currently 'hidden' within t_rr_node).

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

litghost · 2020-01-29T21:56:01Z

Proposed refactoring

There are several parts to this, but here is the key idea:
* `t_rr_node` will never own edge storage

* There will be one entity responsible for edge storage

* Write methods (`Node to edge mutation`) will only be allowed prior to partitioning

* Read methods (`Get node to edge relationships` and `Get data for an edge`) will be forbidden or expensive prior to partitioning

* Read methods will be cheap after partitioning
This all sounds good to me!

The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

If we take this approach, we'll have to think carefully about how to handle the dummy. There are a number of places which do loops like:
for (size_t inode = 0; inodes < rr_nodes.size(); ++inode) {
which could be tripped up by the dummy at the end.

iterator t_rr_node::edge_begin() {
return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
return g_edges.begin(this[1].first_node_);
}

This is a pretty clever way to eliminate the storage of num_nodes in t_rr_node. However it has two drawbacks which will need to be addressed:

Using array indexing on this (e.g. this[1]) is rather non-intuitive. This will need to be well commented to outline what its doing.

This couples the RR node ordering and RR edge ordering together fairly strongly (e.g. re-ordering RR nodes independently of the edges would break edge iteration). We'll need to very clearly outline (and likely write some automated validators) to check the assumptions/invariants this imposes. (In contrast storing a first_node_ + num_edges_ keeps the node/edge ordering independent.)

My thoughts on whether the this[] or num_edges_ approach make the most sense will likely depend on whether it actually slims down the size of t_rr_node, or we just end up with empty padding when eliminating num_edges_.
* `t_rr_node` will never own edge storage
return g_edges.begin(

The issue of who actually owns the edge data, and where its located is another issue to consider as well.

To be consistent as currently implemented, it would make to stick it along side device_ctx.rr_nodes, so something like device_ctx.rr_edges. Although then that will expose the rr_edges implementation to broader VPR (where it is currently 'hidden' within t_rr_node).

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

I've side stepped this entire problem by implementing a flyweight RR node and an rr node storage class (see #1084) and then having the rr node storage class handle this gracefully (#1085)

litghost · 2020-01-29T21:56:50Z

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

No question, I've attempted to take an approach to the rr graph refactoring that is more incremental and can allow comparisons to changes along the way.

tangxifan · 2020-01-30T20:16:23Z

FYI @vaughnbetz / @kmurray this is the issue we were talking about

@tangxifan This is issue is relevant to your refactoring. @tangxifan in particular, your first iteration (https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/device/rr_graph_obj.h#L804-L808) preserves this same 1 allocation per rr node.

I'm going to be making several further posts fleshing out my notes on this change.

Thanks for point this out. If separated storage of rr_node and ``rr_edge` can bring some benefits, we can experimentally try this in another version of RRGraph object.
This is why I am in favor of a unified object for routing resource graph. If you want to rework the internal organization, you do need to bother other parts as long as you keep APIs unchanged.

vaughnbetz · 2020-01-31T00:45:59Z

@tangxifan: @litghost had another good point in a recent meeting; possibly has already been discussed with you or posted, but in case not:

Your WIP rr-graph data structure refactoring proposal would result in one more level of indirection in going from a node to its outgoing edges (the key operation in the router). node->start_edge_index->edges vs. the current node->edges (variable names made up here, but hopefully they get across the point).
We'll almost certainly want to avoid adding more levels of indirection, so worth rethinking that design choice (which can be avoided with some data structure changes).

tangxifan · 2020-01-31T17:11:10Z

@tangxifan: @litghost had another good point in a recent meeting; possibly has already been discussed with you or posted, but in case not:

Your WIP rr-graph data structure refactoring proposal would result in one more level of indirection in going from a node to its outgoing edges (the key operation in the router). node->start_edge_index->edges vs. the current node->edges (variable names made up here, but hopefully they get across the point).
We'll almost certainly want to avoid adding more levels of indirection, so worth rethinking that design choice (which can be avoided with some data structure changes).

Got it. Let me try to apply this to the RRGraph data structure.

This was referenced Jan 23, 2020

ClockRRGraphBuilder uses bad allocation strategies and duplicates work #1080

Closed

Move call location of ClockRRGraphBuilder and use alloc_and_load_edges. #1081

Merged

Proxy rr node #1084

Closed

Initial refactoring of edge storage #1085

Closed

litghost mentioned this issue Jan 30, 2020

Memory clean up to placer #1096

Closed

10 tasks

litghost mentioned this issue Feb 7, 2020

Memory and CPU refactoring for rr graph build #1126

Merged

9 tasks

HackerFoo mentioned this issue Feb 19, 2020

Fix Serialization of ptcs #1162

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RR graph edge storage refactoring #1079

RR graph edge storage refactoring #1079

litghost commented Jan 23, 2020

litghost commented Jan 23, 2020 •

edited

Loading

litghost commented Jan 23, 2020 •

edited

Loading

litghost commented Jan 23, 2020

litghost commented Jan 23, 2020

litghost commented Jan 23, 2020 •

edited

Loading

litghost commented Jan 23, 2020

kmurray commented Jan 29, 2020

Proposed refactoring

litghost commented Jan 29, 2020

Proposed refactoring

litghost commented Jan 29, 2020

tangxifan commented Jan 30, 2020

vaughnbetz commented Jan 31, 2020

tangxifan commented Jan 31, 2020

RR graph edge storage refactoring #1079

RR graph edge storage refactoring #1079

Comments

litghost commented Jan 23, 2020

Proposed Behaviour

Current Behaviour

Possible Solution

Context

litghost commented Jan 23, 2020 • edited Loading

litghost commented Jan 23, 2020 • edited Loading

Current usage of the edge interfaces

Methods: Get node to edge relationships

litghost commented Jan 23, 2020

Methods: Get data for an edge

litghost commented Jan 23, 2020

Methods: Node to edge mutation

litghost commented Jan 23, 2020 • edited Loading

Proposed refactoring

Implementation details (post-partitioning)

litghost commented Jan 23, 2020

kmurray commented Jan 29, 2020

Proposed refactoring

litghost commented Jan 29, 2020

Proposed refactoring

litghost commented Jan 29, 2020

tangxifan commented Jan 30, 2020

vaughnbetz commented Jan 31, 2020

tangxifan commented Jan 31, 2020

litghost commented Jan 23, 2020 •

edited

Loading

litghost commented Jan 23, 2020 •

edited

Loading

litghost commented Jan 23, 2020 •

edited

Loading