Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wire lookahead runtime scales poorly with number of switch/segment types #2811

Open
petergrossmann21 opened this issue Nov 15, 2024 · 5 comments

Comments

@petergrossmann21
Copy link
Contributor

During testing of the delay modeling approach I am developing (related to #2664, although I am not using that feature just yet to help stage the debugging process), my testing flow has matured to the point where I can comment more definitively on the impact of adding a large number of switch and segment types on router lookahead runtime. This was flagged as a possible concern by @rs-dhow in previous discussions.

It appears that when the router lookahead algorithm has a large number of switch/segment types to manage (I am generating segments and switches in matched pairs, so the number of segment types and switch types in my architecture models are equal), the lookahead time does indeed increase. I'm putting together some slides and additional experiments to describe the results in more detail.

Expected Behaviour

For a small architecture (1500 LUTs), I would expect little to no increase in router lookahead runtime if I increase the segment count by a significant amount.

Current Behaviour

For a small architecture (~1500 LUTs), I observe that when the switch list and segment lists counts increase from 2 to ~200, the router lookahead runtime increases from ~6 seconds to ~50 seconds and dominates total router runtime. I expect to be able to show further degradation with larger architectures; experiments to quantify such a trend are in flight.

Possible Solution

I have not looked into possible solutions yet, but this is central enough to my IC development needs that I could invest time into developing a PR to resolve the issue. A collaboration with more seasoned VPR developers is likely the most effective approach, even if I do most of the heavy lifting.

Steps to Reproduce

  1. Please contact me to make arrangments for test case sharing, as I have not yet open-sourced all components used in generating the results described above.

Context

The goal of the delay model I am attempting to generate is to harvest ASIC EDA software output for use in delay annotation of the components of an eFPGA that will be productized. The delay information is unique on a per tile-type, per routing channel basis, so the number of unique delay parameters is O(1K) to O(10K). For accurate delay modeling, VPR's algorithms thus need to scale well as the number of segment types and switch types become large.

Your Environment

  • VTR revision used: de31f09
  • Operating System and version: Ubuntu 20.04
  • Compiler version: GCC 10.5.0
@soheilshahrouz
Copy link
Contributor

The size of map router lookahead grows lineaerly with the number of segment types.

/* provides delay/congestion estimates to travel specified distances
* in the x/y direction */
// This is a 6D array storing the cost to travel from a node of type CHANX/CHANY to a point that is dx, dy further, and is on the "layer_num" layer.
// To store this information, the first index is the layer number that the node under consideration is on, the second index is the layer number of the target node, the third index represents the type of channel (X/Y)
// that the node under consideration belongs to, the forth is the segment type (specified in the architecture file under the "segmentlist" tag), the fourth is the
// target "layer_num" mentioned above, the fifth is dx, and the last one is dy.
typedef vtr::NdMatrix<util::Cost_Entry, 6> t_wire_cost_map; //[0..num_layers][0..num_layers][0..1][[0..num_seg_types-1][0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1]
//[0..1] entry distinguish between CHANX/CHANY start nodes respectively
// The first index is the layer number that the node under consideration is on, and the second index
// is the layer number that the target node is on.

Router lookahead computation involves iterating over all segment types, sampling nodes for each type, and building cost maps. This likely explains why computation time increases as the number of segment types grows.

//Profile each wire segment type
for (int from_layer_num = 0; from_layer_num < grid.get_num_layers(); from_layer_num++) {
for (const auto& segment_inf : segment_inf_vec) {
std::vector<e_rr_type> chan_types;
if (segment_inf.parallel_axis == X_AXIS)
chan_types.push_back(CHANX);
else if (segment_inf.parallel_axis == Y_AXIS)
chan_types.push_back(CHANY);
else //Both for BOTH_AXIS segments and special segments such as clock_networks we want to search in both directions.
chan_types.insert(chan_types.end(), {CHANX, CHANY});
for (e_rr_type chan_type : chan_types) {
util::t_routing_cost_map routing_cost_map = util::get_routing_cost_map(longest_seg_length,
from_layer_num,
chan_type,
segment_inf,
std::unordered_map<int, std::unordered_set<int>>(),
true);
if (routing_cost_map.empty()) {
continue;
}
/* boil down the cost list in routing_cost_map at each coordinate to a representative cost entry and store it in the lookahead
* cost map */
set_lookahead_map_costs(from_layer_num, segment_inf.seg_index, chan_type, routing_cost_map);
/* fill in missing entries in the lookahead cost map by copying the closest cost entries (cost map was computed based on
* a reference coordinate > (0,0) so some entries that represent a cross-chip distance have not been computed) */
fill_in_missing_lookahead_entries(segment_inf.seg_index, chan_type);
}
}
}
}

Two potential solutions are:

  • Compute cost maps for different segment types in parallel.
  • Use a smaller set of representative segment types to generate the lookahead.

With the first solution, the memory usage of the router lookahead still increases linearly with the number of segment types. However, if each segment type is confined to a limited area of the device, this approach reduces reliance on the translation-invariance assumption in the router lookahead.

@vaughnbetz
Copy link
Contributor

If the key reason for having many different segment types (hundreds to thousands) is to have different delays on them, the most direct solution would be to allow the RRgraph to store a different delay (or delay offset) per wire (not wire type). We would hide the details behind a rr_node.delay() or some such method. This would allow you to use a small number of truly distinct segment types, each of which gives the average delay of a wire of that type (this makes things backward-compatible: if you don't need per wire delays, just don't specify them as the average == the delay of node(i)).

Then we'd need to make some method(s) to load those delay offsets per rr_node. They could be stored in a vector from [0..rr_node.size()-1] so only those who want them pay the memory for them. One method could load every delay for every rr_node. Another might load them from a smaller file that gives a list of segment type / track number and delay offsets.

Do you need R and C to be different for each of these wire types, or just the delay? If we need more per wire data, we might want to have rr_node.phys_info_id instead of delay_offset, but then use a similar approach to load all the per wire unique data, with the segment level data again being defined to be the average.

@petergrossmann21
Copy link
Contributor Author

@vaughnbetz a couple of follow-on notes:

  1. Upon further review of my setup, it turns out that the reported runtimes are for a build of vpr with BUILD_TYPE=debug. I will need to do some reporting on results with a normal build as well; early indicators are that a standard build mutes this issue, although to @soheilshahrouz 's point the linear scaling will always be there.
  2. That said, I think that from a hardware perspective allowing the RR graph to store different delays per wire rather than per wire type is much more intuitive. The solution I have been prototyping feels like a workaround to cope with your proposed feature not being implemented.
  3. I personally am not using R and C because my EDA flow for delay parameter collection lumps gate delays and wire delays together, so if I set R and C to a nonzero value I end up counting wire delay twice. I can imagine that this may not be generally true. A user wishing to adopt a per-wire delay annotation method where the intrinsic gate delay and wire delay are captured separately would want to be able to set R and C values on individual wires (for example, by parsing a SPEF file and then transferring its RC data to VPR).

@tangxifan
Copy link
Contributor

Upvote for the feature where each MUX has individual RC values. Prefer to have Tdel which offers an easy way to annotate. Using RC modeling will cause extra effort and increase difficulty level when performing timing extraction from layouts.

@vaughnb-cerebras
Copy link

This should be 8x faster in recent builds (we changed the router lookahead builder config. But the discussion of how to handle many different wire delays (due to post-layout-extracted delays) is still a good one that we'll keep talking about in the weekly developer meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants