Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RR graph edge storage refactoring #1079

Open
litghost opened this issue Jan 23, 2020 · 12 comments
Open

RR graph edge storage refactoring #1079

litghost opened this issue Jan 23, 2020 · 12 comments

Comments

@litghost
Copy link
Collaborator

Proposed Behaviour

The edge storage should be optimized for router behavior, which is driven by two concerns:

The current allocation strategy has poor memory locality, and requires a pointer per node.

Current Behaviour

Currently rr graph edges are stored in per node allocations. This is a reasonable solution when edges are being mutated, but once edges are fixed (either after rr graph construction or if the rr graph is read from file) having one allocation be node is overly flexible.

Possible Solution

Assuming that during writing, node edges are only written and never read, and afterwards node edges are only read and never written, a two phase data structure can be used.

Context

@litghost
Copy link
Collaborator Author

litghost commented Jan 23, 2020

FYI @vaughnbetz / @kmurray this is the issue we were talking about

@tangxifan This is issue is relevant to your refactoring. @tangxifan in particular, your first iteration (https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/device/rr_graph_obj.h#L804-L808) preserves this same 1 allocation per rr node.

I'm going to be making several further posts fleshing out my notes on this change.

@litghost
Copy link
Collaborator Author

litghost commented Jan 23, 2020

Current usage of the edge interfaces

All rr node edge access is controlled by the t_rr_node object.

There are 3 classes of methods:

  • Get node to edge relationships (6 methods)
  • Get data for an edge (2 methods)
  • Node to edge mutation (4 methods)

Methods: Get node to edge relationships

There are 6 methods in this category:

  • t_rr_node::edges
  • t_rr_node::configurable_edges
  • t_rr_node::non_configurable_edges
  • t_rr_node::num_edges
  • t_rr_node::num_configurable_edges
  • t_rr_node::num_non_configurable_edges

Usage:

  • num_edges is used heavily (57 callsites).
  • Both num_non_configurable_edges and num_configurable_edges is never called outside of t_rr_node itself.
  • configurable_edges and non_configurable_edges are used infrequently:
src/route/route_tree_timing.cpp
388:        for (int iedge : device_ctx.rr_nodes[rr_node].non_configurable_edges()) {

src/route/check_rr_graph.cpp
128:        for (auto edge : device_ctx.rr_nodes[inode].configurable_edges()) {
135:        for (auto edge : device_ctx.rr_nodes[inode].non_configurable_edges()) {

src/route/route_common.cpp
640:    std::vector<t_edge_size> unvisited_non_configurable_edges;
642:    for (auto iedge : device_ctx.rr_nodes[node].non_configurable_edges()) {
648:            unvisited_non_configurable_edges.push_back(iedge);
652:    if (unvisited_non_configurable_edges.size() == 0) {
667:        for (auto iedge : unvisited_non_configurable_edges) {

Notes:

  • The public versions of t_rr_node::num_configurable_edges and t_rr_node::num_non_configurable_edges can be made (more) expensive, as they are actually never used outside of t_rr_node to begin with. Specially, storing the number of non-configurable edges soley for fast num_non_configurable_edges and num_configurable_edges does not make sense
  • configurable_edges and non_configurable_edges usage in check_rr_graph is slow path code
  • non_configurable_edges usage in route_tree_timing.cpp and route_common.cpp are both not in the router expansion inner loop, it is called per-sink after. Making this call slower is likely okay, but needs to be watched for.

Conclusions:

  • Storing the number of non-configurable edges and/or configurable edges is likely not required. The method non_configurable_edges() can likely compute the start of the partition live by starting at the first node edge, and advancing to end() or the first non-configurable edge. This change needs to be profiled, and can be done before other changes.
  • If a slower non_configurable_edges is not acceptable, the offset from the first edge to the first non-configurable edge can be located separately from t_rr_node and the core edge data.

@litghost
Copy link
Collaborator Author

Methods: Get data for an edge

There are 2 methods in this category:

  • t_rr_node::edge_sink_node
  • t_rr_node::edge_switch

These methods are currently basically key'd as (RRNodeId, edge offset). This is not strictly speaking required (or desirable). They should simply be key'd as RREdgeId, with the caveat that RREdgeId is only available after mutation and edge partitioning is complete.

@litghost
Copy link
Collaborator Author

Methods: Node to edge mutation

There are 5 methods in this category:

  • partition_edges
  • add_edge
  • set_num_edges
  • set_edge_sink_node
  • set_edge_switch

Usage:

  • partition_edges is used exactly once, during global rr graph node/edge partitioning.
  • set_num_edges and set_edge_sink_node is used twice:
    • alloc_and_load_edges, which is taking a complete edge list and copying it into the t_rr_node data structure
    • process_edges, which is taking a complete edge list from disk and copying it into the t_rr_node data structure
  • set_edge_switch is used in three locations:
    • alloc_and_load_edges, which is taking a complete edge list and copying it into the t_rr_node data structure
    • process_edges, which is taking a complete edge list from disk and copying it into the t_rr_node data structure
    • ClockRRGraphBuilder::add_rr_switches_and_map_to_nodes, which does a strange switch index remap from arch_switch_idx to rr_switch_idx
  • add_edge is used in 6 locations:
    • ClockRib::create_rr_nodes_and_internal_edges_for_one_instance
    • ClockSpine::create_rr_nodes_and_internal_edges_for_one_instance
    • RoutingToClockConnection::create_switches
    • ClockToClockConneciton::create_switches
    • ClockToPinsConnection::create_switches
    • alloc_and_load_edges calls add_edge in some circumstance when t_rr_node already has edges allocated (when???)

Notes:

  • The two fast paths (alloc_and_load_edges and process_edges) are bulk edge insertions, and can easily be re-written to any possible implementation
  • The clock builder code uses an incremental edge accretion pattern (relying on add_edge)

Conclusions:

  • The number of places that mutate the edge list is very very small:
    1. RR graph reader
    2. alloc_and_load_edges
    3. The rr clock builder

@litghost
Copy link
Collaborator Author

litghost commented Jan 23, 2020

Proposed refactoring

There are several parts to this, but here is the key idea:

  • t_rr_node will never own edge storage
  • There will be one entity responsible for edge storage
  • Write methods (Node to edge mutation) will only be allowed prior to partitioning
  • Read methods (Get node to edge relationships and Get data for an edge) will be forbidden or expensive prior to partitioning
  • Read methods will be cheap after partitioning

Implementation details (post-partitioning)

The implementation details post-partitioning are likely the easiest to describe. Edge data will be key'd by RREdgeId that are simply a linear index into one or more vectors of edge data. t_rr_node will store the first RREdgeId that belongs to that RRNodeId. The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

During edge partitioning, the flat edge data array will be sorted by the tuple(RRNodeId, IsNonConfigurable). This means that a particular RRNodeId's edge data is consecutive. It also means that configurable edges for a particular RRNodeId will appear before the first non-configurable edge (if any).

To make this concrete in some pseudo-code:

iterator t_rr_node::edge_begin() {
  return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
  return g_edges.begin(this[1].first_node_);
}
iterator t_rr_node::non_configurable_edge_begin() {
  return std::find_if(edge_begin(), edge_end(), [](const V & v) {
    return g_edges.is_non_configurable(v);
  });
}
iterator t_rr_node::non_configurable_edge_end() {
  return edge_end();
}
iterator t_rr_node::configurable_edge_begin() {
  return edge_begin();
}
iterator t_rr_node::configurable_edge_end() {
  return non_configurable_edge_begin();
}
difference_type t_rr_node::num_edges() {
  return std::distance(edge_begin(), edge_end());
}
difference_type t_rr_node::num_configurable_edges() {
  return std::distance(configurable_edge_begin(), configurable_edge_end());
}
difference_type t_rr_node::num_configurable_edges() {
  return std::distance(non_configurable_edge_begin(), non_configurable_edge_end());
}

@litghost
Copy link
Collaborator Author

There are effectively 3 edge mutation paths, and the confusing thing is that the RR clock graph generator is located in an odd place and does the graph mutation in a weird way.

The 3 edge mutation implementations are:

  • alloc_and_load_rr_graph which does several semi-batch edge insertions via t_rr_edge_info_set and alloc_and_load_edges. The reason I say that this path is a semi-batch is that alloc_and_load_edges has a fallback path to use add_edge if a rr node has already been preallocated.
  • The rr graph readers process_edges, which is fully batch edge insertion.
  • The code path starting from ClockRRGraphBuilder::create_and_append_clock_rr_graph, which does purely incremental edge insertions, and then a "weird" arch switch index to rr switch index mapping in clock_graph.add_rr_switches_and_map_to_nodes

In general the ClockRRGraphBuilder code is problematic for the two phase approach, but I think it's current problems are simply bugs, and should just be fixed before any refactoring is started. I'll create a new issue documenting the problems that I see, and propose some solutions.

@kmurray
Copy link
Contributor

kmurray commented Jan 29, 2020

Proposed refactoring

There are several parts to this, but here is the key idea:

* `t_rr_node` will never own edge storage

* There will be one entity responsible for edge storage

* Write methods (`Node to edge mutation`) will only be allowed prior to partitioning

* Read methods (`Get node to edge relationships` and `Get data for an edge`) will be forbidden or expensive prior to partitioning

* Read methods will be cheap after partitioning

This all sounds good to me!

The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

If we take this approach, we'll have to think carefully about how to handle the dummy. There are a number of places which do loops like:

for (size_t inode = 0; inodes < rr_nodes.size(); ++inode) {

which could be tripped up by the dummy at the end.

iterator t_rr_node::edge_begin() {
return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
return g_edges.begin(this[1].first_node_);
}

This is a pretty clever way to eliminate the storage of num_nodes in t_rr_node. However it has two drawbacks which will need to be addressed:

  1. Using array indexing on this (e.g. this[1]) is rather non-intuitive. This will need to be well commented to outline what its doing.
  2. This couples the RR node ordering and RR edge ordering together fairly strongly (e.g. re-ordering RR nodes independently of the edges would break edge iteration). We'll need to very clearly outline (and likely write some automated validators) to check the assumptions/invariants this imposes. (In contrast storing a first_node_ + num_edges_ keeps the node/edge ordering independent.)

My thoughts on whether the this[] or num_edges_ approach make the most sense will likely depend on whether it actually slims down the size of t_rr_node, or we just end up with empty padding when eliminating num_edges_.

* `t_rr_node` will never own edge storage

return g_edges.begin(

The issue of who actually owns the edge data, and where its located is another issue to consider as well.

To be consistent as currently implemented, it would make to stick it along side device_ctx.rr_nodes, so something like device_ctx.rr_edges. Although then that will expose the rr_edges implementation to broader VPR (where it is currently 'hidden' within t_rr_node).

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

@litghost
Copy link
Collaborator Author

Proposed refactoring

There are several parts to this, but here is the key idea:

* `t_rr_node` will never own edge storage

* There will be one entity responsible for edge storage

* Write methods (`Node to edge mutation`) will only be allowed prior to partitioning

* Read methods (`Get node to edge relationships` and `Get data for an edge`) will be forbidden or expensive prior to partitioning

* Read methods will be cheap after partitioning

This all sounds good to me!

The end() RREdgeId t_rr_node will simply be this[1].first_edge_id. This does imply a dummy t_rr_node at the end of the rr_node vector.

If we take this approach, we'll have to think carefully about how to handle the dummy. There are a number of places which do loops like:

for (size_t inode = 0; inodes < rr_nodes.size(); ++inode) {

which could be tripped up by the dummy at the end.

iterator t_rr_node::edge_begin() {
return g_edges.begin(this[0].first_node_);
}
iterator t_rr_node::edge_end() {
return g_edges.begin(this[1].first_node_);
}

This is a pretty clever way to eliminate the storage of num_nodes in t_rr_node. However it has two drawbacks which will need to be addressed:

  1. Using array indexing on this (e.g. this[1]) is rather non-intuitive. This will need to be well commented to outline what its doing.
  2. This couples the RR node ordering and RR edge ordering together fairly strongly (e.g. re-ordering RR nodes independently of the edges would break edge iteration). We'll need to very clearly outline (and likely write some automated validators) to check the assumptions/invariants this imposes. (In contrast storing a first_node_ + num_edges_ keeps the node/edge ordering independent.)

My thoughts on whether the this[] or num_edges_ approach make the most sense will likely depend on whether it actually slims down the size of t_rr_node, or we just end up with empty padding when eliminating num_edges_.

* `t_rr_node` will never own edge storage

return g_edges.begin(

The issue of who actually owns the edge data, and where its located is another issue to consider as well.

To be consistent as currently implemented, it would make to stick it along side device_ctx.rr_nodes, so something like device_ctx.rr_edges. Although then that will expose the rr_edges implementation to broader VPR (where it is currently 'hidden' within t_rr_node).

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

I've side stepped this entire problem by implementing a flyweight RR node and an rr node storage class (see #1084) and then having the rr node storage class handle this gracefully (#1085)

@litghost
Copy link
Collaborator Author

This ownership issue was part of the original motivation for the RRGraph object refactoring, where the graph owns both the nodes and edges, and hides the details of the underlying data layout.

No question, I've attempted to take an approach to the rr graph refactoring that is more incremental and can allow comparisons to changes along the way.

@tangxifan
Copy link
Contributor

FYI @vaughnbetz / @kmurray this is the issue we were talking about

@tangxifan This is issue is relevant to your refactoring. @tangxifan in particular, your first iteration (https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/device/rr_graph_obj.h#L804-L808) preserves this same 1 allocation per rr node.

I'm going to be making several further posts fleshing out my notes on this change.

Thanks for point this out. If separated storage of rr_node and ``rr_edge` can bring some benefits, we can experimentally try this in another version of RRGraph object.
This is why I am in favor of a unified object for routing resource graph. If you want to rework the internal organization, you do need to bother other parts as long as you keep APIs unchanged.

@litghost litghost mentioned this issue Jan 30, 2020
10 tasks
@vaughnbetz
Copy link
Contributor

@tangxifan: @litghost had another good point in a recent meeting; possibly has already been discussed with you or posted, but in case not:

Your WIP rr-graph data structure refactoring proposal would result in one more level of indirection in going from a node to its outgoing edges (the key operation in the router). node->start_edge_index->edges vs. the current node->edges (variable names made up here, but hopefully they get across the point).
We'll almost certainly want to avoid adding more levels of indirection, so worth rethinking that design choice (which can be avoided with some data structure changes).

@tangxifan
Copy link
Contributor

@tangxifan: @litghost had another good point in a recent meeting; possibly has already been discussed with you or posted, but in case not:

Your WIP rr-graph data structure refactoring proposal would result in one more level of indirection in going from a node to its outgoing edges (the key operation in the router). node->start_edge_index->edges vs. the current node->edges (variable names made up here, but hopefully they get across the point).
We'll almost certainly want to avoid adding more levels of indirection, so worth rethinking that design choice (which can be avoided with some data structure changes).

Got it. Let me try to apply this to the RRGraph data structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants