Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoC Turn Model Routing Algorithms #2496

Merged
merged 55 commits into from
Apr 1, 2024
Merged

NoC Turn Model Routing Algorithms #2496

merged 55 commits into from
Apr 1, 2024

Conversation

soheilshahrouz
Copy link
Contributor

Description

This PR adds four new NoC routing algrotihms from Turn Model paper. XY routing algorithm can only generate a single route for a source/destination location pair. These new algorithms can exploit path diversity in a mesh topology and generate mulitple routes. This PR also adds a new class to check deadlock freedom in generated NoC routes.

Motivation and Context

After adding NoC congestion modeling to VPR, it turned out XY routing algorithm is not able to get rid of congestion in some benchmarks. Turn Model routing algorithms exploit path diversity and distribute traffic flows more evenly across NoC links. When combined with NoC congestion modeling, these algorithm are more effective in reducing NoC link congestion.

The main reason we do not route traffic flows arbitrarily is to avoid deadlock. Turn model routing algorithms guarantee deadlock freedom. However, deadlock freedom is not currently checked in VPR. This PR implements a class to represent channel dependency graph, which is used to detect possible deadlocks in a NoC routing solution.

Traffic flow files for old synthetic benchmarks is updated in this PR. In some old benchmarks, a congestion-free solution was not possible because some traffic flows had a bandwidth higher than the link bandwidth. This PR updates traffic flow bandwidths in flow files so that a congestion-free solution is feasible.

New synthetic benchmarks have been added to represent designs that access HBM channels. In these benchmarks, some NoC routers are locked down at the boundary of device to imitate an HBM channel accessible through a NoC router.

How Has This Been Tested?

QoR has been measured on modified synthetic NoC benchmarks.

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

std::hash<vtr::StrongID>() was implemented as the identity function. The generated hash values were very small, causing west-first algorithm to almost choose the right direction when down/up was also available.
…elect_next_direction.

The size of size_t is machine and implementation dependent.
A cycle in NoC routing configuration may cause deadlock when packets wait on each other in a cycle.
Odd-even algorithm needs to know where the source router was, so we need to add it as a new argument to get_legal_directions.
I assumed the placer would place 64 logical router in an 8x8 grid. There are 64*63=4032 traffic flows. Within the 8x8 grid, there are 112 links. If these traffic flows are routed with minimal paths, generated routes would traverse 21504 edges. Ideally, the placement algorithm evenly distributes this aggregate bandwidth of 21504*BW over 112 links.
I forgot that there are two links between to neighboring routers. So the total number edges withing an 8x8 grid is 224.
To minimize aggregate bandwidth, the placer would 32 routers in a 6x6 grid. In such a placement, traffic flow routes traverse 3680 links. There are 120 links within a 6x6 grid. Assuming that an aggregate bandwidth if 3680*BW is evenly distributed among all 120 links, we can increase each traffic flow's bandwidths upto 3.26e4 without causing congestion.
The link bandwidth is 1e6 and the central router sends data to other 31 routers. Ideally, total traffic bandwidth is divided equally over 4 links of the central router. If each of these 4 links carry data for 8 routers, we need to divide the link bandwidth by 8.

1e6 / 8 = 1.25e5
@github-actions github-actions bot added VPR VPR FPGA Placement & Routing Tool lang-cpp C/C++ code libvtrutil labels Mar 5, 2024
@soheilshahrouz
Copy link
Contributor Author

This is the link to QoR spreadsheet for synthetic benchmarks. It is still being updated with new results. When completed, I'll post the summary here.

@vaughnbetz
Copy link
Contributor

Looks good.
Should compare to the master branch on Titan with all this code, and on the NoC benchmarks.

@vaughnbetz
Copy link
Contributor

Also nice to give a normalized / summary tab.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly done ... let's edit more tomorrow or Wed.

vpr/src/noc/channel_dependency_graph.h Outdated Show resolved Hide resolved
vpr/src/noc/channel_dependency_graph.h Outdated Show resolved Hide resolved
vpr/src/noc/channel_dependency_graph.cpp Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Outdated Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Outdated Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Outdated Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Outdated Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Show resolved Hide resolved
vpr/src/noc/turn_model_routing.h Outdated Show resolved Hide resolved
@soheilshahrouz
Copy link
Contributor Author

Aggregate bandwidth compared to master branch without congestion modeling

Congestion Weighting Factor 0.0 0.25 0.5 1.0 2.0 5.0
XY 1.00 1.04 1.04 1.04 1.06 1.08
West First 1.00 1.08 1.08 1.10 1.11 1.12
Odd Even 1.00 1.04 1.06 1.07 1.08 1.09
North Last 1.00 1.07 1.09 1.11 1.12 1.14
Negative First 1.00 1.07 1.10 1.11 1.12 1.13

In all algorithms, when the congestion weighting grows, the aggregate bandwith increases. To avoid congestion, the placement engine places some routers farer from each other to expose more path diversity, leading to more links being traversed by certain traffic flows, thus resulting in higher overall bandwidth.

When the congestion weighting factor is zero, all algorithms have the same aggregate bandwidth. This is because they all employ minimal routing, generate routes with the same number of links. Since other NoC-related cost terms only depend on the total number of traversed links by each traffic flow, regardless of which links form the route, all algorithms behave the same as XY routing in master without any congestion modeling.

@soheilshahrouz
Copy link
Contributor Author

Aggregate latency compated to master branch with no congestion modeling

Congestion Weighting Factor 0.0 0.25 0.5 1.0 2.0 5.0
XY 1.00 1.01 1.02 1.02 1.03 1.04
West First 1.00 1.07 1.07 1.08 1.08 1.10
Odd Even 1.00 1.03 1.05 1.06 1.07 1.07
North Last 1.00 1.04 1.06 1.07 1.09 1.10
Negative First 1.00 1.06 1.08 1.09 1.10 1.11

The aggregate latency shows a similar behavior as the aggregate bandwith. Therefore, the same arguments can be repeated about the aggregate latency.

@soheilshahrouz
Copy link
Contributor Author

Normalized average congestion ratio

Congestion Weighting Factor 0.0 0.25 0.5 1.0 2.0 5.0
XY 1.00 0.35 0.35 0.35 0.36 0.34
West First 0.88 0.11 0.10 0.08 0.08 0.06
Odd Even 0.79 0.10 0.11 0.10 0.08 0.06
North Last 0.83 0.16 0.16 0.13 0.11 0.13
Negative First 0.84 0.10 0.09 0.08 0.08 0.07

Even when congestion ratio is zero, Turn Model algorithms have lower congestion ratio. Even though these algortihms have the same placement as XY algorithm, they can exploit path diversity to some extent and reduce congestion compared to XY algorithm that cannot benefit from path diversity at all.

Setting congestion weighting factor to a non-zero value helps the placement algorithm to reduce the congestion in all algorithms. Increasing the congestion weighting factor slightly improves congestion. However, it should be noted that this comes at the cost of higher aggregate bandwidth and latency.

Odd-even algorithm is performing better than other algorithms. It is abled to reduce congestion by 94% while increasing the aggregate bandwidth only by 9%. Some other algorithms can achieve the same congestion improvement but increase the aggregate bandwith by at least 12%.

@soheilshahrouz
Copy link
Contributor Author

Placement time compared to master without congestion

Congestion Weighting Factor 0.0 0.25 0.5 1.0 2.0 5.0
XY 1.06 1.07 1.10 1.10 1.09 1.09
West First 1.12 1.11 1.11 1.11 1.15 1.18
Odd Even 1.16 1.17 1.18 1.16 1.17 1.16
North Last 1.15 1.13 1.14 1.14 1.16 1.16
Negative First 1.15 1.16 1.16 1.14 1.15 1.16

Congestion modeling and virtual method overrides in new routing algorithms are the main reasons of runtime increase. The number of swaps during the plecement stays almost constant.

@vaughnbetz
Copy link
Contributor

Thanks, these results look good.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good ... some commenting suggestion and a few namespace suggestions attached.

vpr/src/noc/negative_first_routing.h Show resolved Hide resolved
vpr/src/noc/odd_even_routing.h Show resolved Hide resolved
vpr/src/noc/odd_even_routing.h Show resolved Hide resolved
vpr/src/noc/odd_even_routing.cpp Show resolved Hide resolved
vpr/src/noc/odd_even_routing.cpp Show resolved Hide resolved
@@ -0,0 +1,994 @@
<traffic_flows>
<single_flow src=".*noc_router_adapter_block_1[^\d].*" dst=".*noc_router_adapter_block_2[^\d].*" bandwidth="3.26e4" latency_cons="21e-9" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment what the purpose / style of this benchmark is.

@soheilshahrouz
Copy link
Contributor Author

@vaughnbetz
I applied your comments. The modifier code is ready for review. Please take a look into it when you have time.

@vaughnbetz
Copy link
Contributor

Thanks. Can you also confirm there is no CPU time, memory use or quality change when NoC optimization is off? That should be the case, but I like to make sure. A run on one significant size circuit under controlled circumstances should be enough to confirm.

@soheilshahrouz
Copy link
Contributor Author

I added a new sheet to compare the QoR for titan benchmarks on this branch with master. Here is the summary:

Branch pack_time place_time place_wl palce_cpd place_mem
master 522.97 1182.15 2567328.32 16.16 5249.09
this branch 516.90 1149.20 2567328.32 16.16 5251.66
ratio 0.988 0.972 1 1 1.00049

@vaughnbetz vaughnbetz merged commit 633193b into master Apr 1, 2024
100 checks passed
@vaughnbetz vaughnbetz deleted the noc_turn_model_routing branch April 1, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external_libs lang-cpp C/C++ code libvtrutil VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants