-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converter from rr_nodes to RRGraph object #1048
base: master
Are you sure you want to change the base?
Converter from rr_nodes to RRGraph object #1048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the golden files change?
This is mainly temporary during our code construction process. To enforce incremental pull requests, we will first adapt downstream routers to use RRGraph objects and then refactor the builders for RRGraph (You can see more details in our design document https://docs.google.com/document/d/1LMIlpYoppFtuSi_OZZJNLZ7edWFuuC0js4hm1Ah0n2M/edit?usp=sharing). Therefore, at this stage, I will create a Actually, I put this issue open for discussion. Feel free to shot how you want to keep the current golden results. |
I quickly ran this PR on the Titan23 design
which implies that the new object is using substantially more memory (5.7x) than the previous data structure. We probably need to look into why this is. I would expect the new RR graph object to use somewhat more (maybe 20-30%) but not this much more. My best guess is that there is some memory fragmentation occurring which is the problem. (I've previously seen these types of effects with the old RR graph as well). Looking at the creation code, it seems like you aren't doing any reservations for the per-node edge lists. While you do reserve the edge elements themselves, we don't reserve the RRGraph::node_in_edges_ or RRGraph::node_out_edges_ sub vectors. That is probably the first thing I would try, as they will otherwise be constantly reallocated whenever we do RRGraph::add_edge(). After that I've had good success using the Valgrind's 'Massif' heap profiler to figure out memory usage hot spots. |
Thanks for the insights. Very helpful! I will follow this idea and optimize the
|
Yes, I believe it does. Since rr_node_indices is created at the same time as the RR graph (in the traditional RR graph builder) it should be included in the +612MiB
Yes, they will have some overhead. I'd focus on the edge lists first since I expect they will be the bigger issue. You currently are using: std::map<int, RRNodeId> old_to_new_rr_node; Since both the key and value are small (2x32-bits) the overhead for a std::map BST node will be large (at least 2x64 bit pointers per key-value pair). Since we know the old RR node indices are contiguous integers you could achieve the same look-up effect with: std::vector<RRNodeId> old_to_new_rr_node(rr_nodes.size(), -1); which would be much lower overhead (no pointer overhead per node). Note that we sized it exactly to avoid dynamically growing its size which can cause memory fragmentation. The edge look-up is more challenging since the key is not a contiguous ID range: std::map<std::pair<int, int>, RREdgeId> old_to_new_rr_edge; // Key: Since we know precisely how many edges there are in the graph, you could do something like: std::vector<std::pair<std::pair<int,short>,RREdgeId> old_to_new_rr_edges_vec(num_edges);
for (int inode = 0; inode < rr_nodes.size(); ++i) {
for (short iedge : rr_nodes[inode].edges()) {
RREdgeId edge_id = rr_graph.add_node();
old_to_new_rr_edges_vec.emplace_back({{inode,iedge},edge_id});
}
}
vtr::flat_map<std::pair<int,short>,RREdgeId> old_to_new_rr_edges(std::move(old_to_new_rr_edges_vec)); Which uses a vtr::flat_map (a map implemented as a sorted vector, rather than a BST), constructed directly from an exactly sized vector. That would again avoid the 2x64bit pointer overhead for each BST node used by std::map. |
057ded1
to
e0a212a
Compare
a8a3784
to
62ede3a
Compare
Re-running on
The object data structure only uses 2x the baseline RR graph (down from 5.7x). |
Indeed, I found that the However, we have to admit that
Any thoughts are warmly welcomed. |
Yes this is true. Previously when I've looked at this I've found it helpful to work out on a spread sheet what the memory usage is. Its much easier to experiment with a spreadsheet than re-coding everything.Take a look at this spreadsheet which works out roughly where we spend memory. The major culprits are things which are O(edges) as there are far more edges than anything else. Optimizations which seem to be no-brainers:
/* Fast look-up to search a node by its type, coordinator and ptc_num
* Indexing of fast look-up: [0..xmax][0..ymax][0..NUM_TYPES-1][0..ptc_max][0..NUM_SIDES-1]
*/
typedef std::vector<std::vector<std::vector<std::vector<std::vector<RRNodeId>>>>> NodeLookup;
mutable NodeLookup node_lookup_;
to /* Fast look-up to search a node by its type, coordinator and ptc_num
* Indexing of fast look-up: [0..xmax][0..ymax][0..NUM_TYPES-1][0..ptc_max][0..NUM_SIDES-1]
*/
typedef vtr::NdMatrix<std::vector<std::vector<RRNodeId>>,3> NodeLookup;
mutable NodeLookup node_lookup_; Since the width, height and number of RR types are fixed (i.e. known) at RR graph construction time. This would avoid the std::vector overhead (24 bytes for each instance). We'd want to keep the inner two dimensions as std::vector, since they are not fully dense. Potential functionality preserving optimizations:
Since NUM_SIDES=4 we can get away with using 1 byte to store the size of the inner dimension, and since the number of PTCs is at most a few hundred we can use 2 bytes for the size. Potential Optimizations with more trade-offs:
Small optimizations which probably don't offer a big pay off:
|
If I had to push for the simplest path forward it would be to just apply the obvious optimizations, followed by:
and if that is insufficient:
|
Discussing this with Kevin ... Checking if an edge is valid: checks if the id is in range (0 < id < edge_ids.size) and probes the unordered_map to see if it is invalid. Compressing rr-graph to remove invalid edges: You can walk through the hash table and put the invalid edge ids in a vector, sort it, use that to walk (once) through the edge_ids and reindex them as we can see immediately by how much to reduce the value of every edge id. Should do the same thing for node_ids as it gives savings too. Expected savings from the spreadsheet Kevin attached: 11.8% from edges + 1.5% from nodes = 13.3% memory footprint. Should have no impact on runtime. |
From the spreadsheet, bidirectional edges are costing about 21.8% memory footprint. Convenient though. Suggest optimizing everything else and then we can think about that one, as deleting them is a real trade-off. |
I like the single node edge array + count delimiter option. 13% expected memory reduction, no impact on typical users of the rr-graph. We lose constant time insertion of an arbitrary type of edge as we don't have unique vectors for each category, but only the rr-graph builder would care about that, and it could just set up the edge_* arrays, and then the end of the rr-graph builder could walk them to set up the node_edges* data. |
… to identify groups
This avoids exposing the details of how we are tracking invalid edges through-out the RRGraph implementation code.
Previously used a std::unordered_map but the value was unused.
This should be more run-time and memory efficient than creating a vector of the entire range and returning it.
Users of the RRGraph shouldn't care how the edge/node iteration is implemented so move the implementation below the public methods to improve readability.
We now use a single array to store the edges associated with a node. The array is sorted into sub-ranges (e.g. incoming/outgoing, configurable/nonconfigurable) which still allows efficient iteration through the edge subranges. We store the sizes of the various sub-ranges to enable quickly determination of the delimiters between sub-ranges. We also use a raw pointer for the edge array (rather than a vtr::vector) which further saves memory. The create_edge() routine now no longer inserts the edge into the node data. Instead a single call is made to rebuild_node_edges() which will walk the various RRGraph::edge_* members to precisely allocate the relevant node_edges_ and node_num*edges_ data members. The trade-off of this change is that the various node_*_edges() will not work properly after an edge is added to the RR graph until rebuild_node_edges() is called. Currently it rebuilds the node edges from scratch, and so calls to it should be minimized (the preferred approach is to add all edges and then call rebuild_node_edges() once). If needed, support for incrementally rebuilding node_edges could be added (while the use of a single array per node type in general require insertion not at the end of the array, an O(n) operation, the number of edges per node is typically bounded by a small constant, so it would still be reasonably efficient); it is not currently required and so is left as *potential* future work should the need for it arise.
Instead clients should use the .size() member of the relevant range.
It seems that we have some errors in getting titan benchmarks.
|
All the benchmarks passed the I am reworking the spreadsheet to record golden memory results and current memory usage. |
Yes, that seems reasonable. Also, your changes in 11f719a put the |
Thanks for the advice. Just made the modification. |
We create a detailed routing (wires, switches) rr-graph for the placer, so we can profile / search the routing graph to produce delay lookups that the placer can quickly access. |
I have summarized in the spreadsheet about the peak memory usage of VPR flow and rr_graph. In the spreadsheet, you may notice that even when I turn on router only, the memory statistics on |
For the Titan benchmarks, what is the difference between the two router-only runs? I see two columns, with and without refactoring, for the router only runs. |
Sorry. That is a typo. The left part is full VPR flow-run while the right part is router-only run. |
Hi all, I understand that the memory footprint has become a critical concern here. To address that, I suggest unit tests before deployment. My plan is as follows:
As such, we can be confident that before deployment
After this, we can continue our incremental refactoring. |
Hi all,
|
Continued Effort on refactoring the Routing Resource Graph (RRG) to a unified
RRGraph
object.This change brings the
RRGraph
to theDeviceContext
of VPR engine, and it will create anRRGraph
object by porting therr_nodes
information (classic routing resource graph) to the refactored object.This change has not impact any routers, analyzers and drawers which are major clients of routing resource graph.
But later, changes will be made to these downstream functions.
Detailed design document:
https://docs.google.com/document/d/1LMIlpYoppFtuSi_OZZJNLZ7edWFuuC0js4hm1Ah0n2M/edit?usp=sharing
Description
An
RRGraph
object has been added toDeviceContext
data structure, in parallel to the classicrr_nodes
. This object when replace the legacy data structures, i.e.,rr_nodes
,rr_node_indices
,rr_switches
,rr_segments
etc. when refactoring is done.A loading function
create_rr_graph()
is introduced, which loads therr_node
information toRRGraph
object. This function duplicates therr_node
information.Multiple bug fixing to checking codes for
RRGraph
.Added default constructors in
vtr_geometry.h
to resolve compatibility issues in Clang compilers.Changes in golden_results for
vtr_strong
regression tests, which are mainly for relaxing the memory usage. This is actually temporary! Since we duplicate the routing resource graph, it is expected that the memory usage will explode.On my local machine test, four tests in
vtr_strong
fails:stratixiv_arch.timing.xml/ucsb_152_tap_fir_stratixiv_arch_timing.blif/common max_vpr_mem: previous_golden = 1011288 current_golden = 1368492
non_column.xml/raygentop.v/common max_vpr_mem: previous_golden = 91316 current_golden = 193908
non_column_tall_aspect_ratio.xml/raygentop.v/common max_vpr_mem: previous_golden = 87860 current_golden = 191944
multiple_io_types.xml/raygentop.v/common max_vpr_mem: previous_golden = 417392 current_golden = 1258700
The peak memory usage will exceed the current QoR because routing resource graph is one of most memory consuming data structure in VPR. During refactoring, we duplicate the data structure and this leads to memory overhead.
To ease the integration, I have done a run to rewrite the golden results. Feel free to bug this if you want only a few to be relaxed.
For sure, the golden results will be improved when refactoring is done. The duplication on routing resource graph will not exist.
Related Issue
Continued Effort on refactoring the Routing Resource Graph (RRG) to a unified RRGraph object.
#990
Motivation and Context
With an aim to group the discrete routing-related data structures into a unified object, which can ease router development and improve runtime and memory usage.
RRGraph object is designed to in a general way how routing resources are connected in a FPGA fabric, replacing the classical rr_node and rr_edge data structures.
How Has This Been Tested?
Passed basic and strong regression tests with relaxation on memory usage. (See details in Description section 5)
CentOS, g++-8.2
TravisCI
Types of changes
Checklist: