-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Routing tables: A proposal #4
Comments
One drawback of this approach is that it doesn't directly allow address modification. This is great for switches that don't need to modify their network address, but not so great for routers between networks. Hence, if a packet comes into the router from IP1 on network 1 destined for IP2 on network 2, then the network addresses should be changed as the packet switches from network 1 to network 2. Specifically, the network addresses (i.e. Ethernet MAC) from the prior link need to be adjusted to those of the new link where the source would now be the router's MAC address, and the destination would be that of IP2 (assuming it exists on network 2). While it would be possible to map the "target" field to control a set of address modifiers and thus "correct" this lack, such packet modifications would need to be external to this routing algorithm at present. A second drawback is found in the "Increment" instruction. For one, it places the performance metrics all over the table's memory area. This could be "fixed" by replacing the immediate field in the increment instruction with a performance table address. Likewise, there's no condition on the increment instruction--such as "If Z, increment." Hence, it would be difficult to capture metrics from the RZ and RNZ instructions separate from any metrics available for the RA instruction. Fixing this would require 2 more variants of the increment instruction. |
As I see this in a block diagram sense, it would look something like the following: Packets would come in using the AXIN protocol. A small buffer would exist within the router to buffer the packet header until the destination target had been determined. If this buffer ever overflowed before the target was known, the packet would be discarded and an error condition would be generated--in much the same way as with illegal instructions. Dan |
Our Wireguard-1GE is actually, for multiple reasons, using 128 bits in the Data Plane Engine (DPE) core. We have in the past used even 1K-bit wide tables of the keys to search for, so that multiple addresses can be compared on the same clock cycle. The search key capacity and time to complete are critical parameters of a real-life networking product. We have for that reason also always used the BRAM for the search key storage. Reaching out to an external DRAM is typically too expensive for performance when one has to match against hundreds of keys in the time allotment of a minimum Ethernet packet or, even worse, an ATM cell. A good state diagram for this processor would be worth thousand words 😉 |
How about "SORT" instruction? Many search algorithms count on certain table organization, and typically must reorganize table whenever an entry is entered or deleted. Here is one of them. |
"SORT" is a rather complicated operation for hardware, but much easier in software. It makes sense that the table might require some form or kind of sorting within it, but it also makes more sense that said sorting be done before the table is made active to the hardware. |
That's exactly why we are rooting for the Harvard architecture of this Search-and-Route (SAR) processor! By clearly separating Data from Instructions space, the design would facilitate co-processing scenarios, where a standard CPU would occasionally come in to make the life of the HW FSM easier. These interventions would likely be for the following reasons:
Then, we could take the Instructions to a high level that even forgoes the need for a compiler... such as by promoting them to Commands:
Config settings:
and Status:
This setup is more amenable to co-processing arrangements, effectively transforming the SAR Proc into SAR CoProc, or Accellerator. That would arguably lead to a better performing system, getting away with instruction fetches, caches, bug catches and other degrading complications that come from programmability at machine level. This also plays nicely with FPGA kind of programmability that can be exploited as an alternative to a SAR Proc program, or reduce the SAR CoProc config options to only Table_Depth. Everything else can be a different FPGA program that's compiled from the pin-compatible Plug-and-Play RTL module that the user can swap in or out of his SOC, by linking in the corresponding file. We should also define the capacity and performance objectives for the Search part of the SAR CoProc. While commercial implementations accommodate thousands of entries, we could start with modest 1K at 10GE arrival rates... |
I'm not sure I follow. Can you slow down, back up, and explain your thoughts please? Thank you. |
Rather then designing a specialized processor (Master), look at this design task through the lenses of a co-processor (intelligent and capable peripheral, but still a Slave). Forget about instruction fetches, assembler, compiler, and all that jazz, and design it as an FSM that an external CPU can:
and which can:
Make it operate directly on local BRAM-based tables that the external CPU can manage and sort in order to facilitate the searches. The bulk of its programmability, that is adaptability to the given use-case, comes from the softness of FPGA gates. |
@eniokaljic, while the maximalist target capacity of Header Lookup Engine for Wireguard VPN apps in the 1Gbps performance bracket is
|
High level of programmability is typically not a trait of embedded system. They tend to be specific to a problem at hand. Esp. so in the consumer space. On the other hand, the proposed Routing Processor seems to be catering to the P4 (Programming Protocol-independent Packet Processing) DSL kind of use cases. If that's indeed the goal, reviewing this reference might provide further guidance: |
Indeed, the programmability of the lookup engine is imperative for the programmable data plane, because it increases the flexibility of the lookup implementation with the high granularity that reaches the application level. SDN (Software-Defined Networks) advocates moving programmability from the data plane to the control plane, while P4 tries to keep programmability within the data plane for better performance, i.e. faster packet processing and lower communication overhead between the data and control planes. There are many approaches to this problem (survey) and some are based on instruction-driven processing of packet headers like one proposed in this project. For a 10G Ethernet switch, that level of programmability is not necessary, and more attention should be paid to performance. 10GbE interface achieves a peak packet rate of 14 million pps (calculations). If we assume that the 10GbE switch has at least 4 ports and is implemented in an architecture with a shared lookup engine, the expected lookup engine load is 59.5 million pps. This leaves the lookup engine with a time budget of 16.8 ns. As the number of switch ports increases, the problem is exacerbated. For this reason, such switches use an architecture with dedicated lookup engines for each line interface. However, this opens a new set of problems related to the caching and synchronization of lookup tables. Perhaps, solutions based on TCAM (associative lookup) could be considered for this 10GbE switch. |
... @ZipCPU it is worth taking time to read fully through the above two references. The provide both the big picture / context, and implementation options. When it comes to the latter, target capacity and performance are the first things you ought to define. That will drive further design decisions. As @eniokaljic noted:
|
P4 in action |
While the current routing algorithm used by the eth10g controller is blazing fast, is not very configurable. Worse, it requires a lot of logic to manage. Further, the current routing method will only work on Ethernet headers, not IPv3 or IPv4 addressing, or IP/UDP or IP/TCP protocol information (ports). For this reason, a highly configurable memory-based routing table is of interest.
Many routing methods exist in the literature, often parameterized by how the table is structured. A user configurable structure, which can be optimized in many ways, might therefore be valuable.
I therefore propose the following routing table approach. It's based upon 32-bit "instruction" words, which will feed a small FSM based router.
The router shall maintain 3 registers: a "program counter", controlling where the next table word is read from, a "shift register", containing a 128b window into the current packet, and a 128b accumulator containing the results from comparing table values with the 128b window into the current packet. Once comparisons are complete, the router can either 1) jump to a new instruction on a match, 2) jump to a new instruction on a failed match, 3) route to a given address a) always, b) on a match, or c) on a failed match.
The current draft instruction set is shown below:
MAC Example
To match against an Ethernet MAC 86:33:24:17:bf:dd and send it to target "8", the table would have instructions:
IP Example
To match against an IP packet and route subnetwork 192.168.3.X to target 3, the table would use two steps. The first would determine if the packet were (or were not) an IP packet:
The IP table would then have:
Routing to a second destination, such as 192.168.17.X, to target 1 would only take two further instructions:
All other packets might then be routed to target 0.
RA 0. (Route Always to target zero)
In these examples, the "targets" will be defined with local meaning. For example, target zero might be set up as a catch all to be sent to a CPU for inspection, and potentially updating the routing table.
Performance monitoring
Performance counters are maintained within the table. Upon hitting a performance counter, the performance counter instruction will be re-read, and atomically incremented. How this happens will be dependent upon the bus in use. For example, on WB, the Wishbone cycle line will be held high while the performance counter is ready, a one added to it, and then the counter is written back to memory. With AXI, an exclusive access operation can be used to guarantee an atomic increment. This helps to insure that both CPU and router, and potentially other concurrent routers, will always have a consistent view of the performance counters.
Potential issues
The most obvious "issue" with this router is that it is memory based. Memory will need to be read to determine the optimal routing. While this is taking place, the packet will be forced to "stall" and wait for routing to complete. As long as packets are stored in virtual FIFOs (as they are within this design), such stalls shouldn't cause packets to drop, but they will delay packets moving through the system.
Because this approach is designed to look and feel like an instruction stream, a standard CPU instruction fetch may be used to feed it. This lends open the possibility of repurposing a pre-built cache.
As with any CPU, branches will cause pipeline stalls and hence delays.
There is no requirement in this proposal that any specific type of memory be used. The memory used could therefore come from on-chip memory, or equivalently an off-chip SDRAM type of memory, with expected performance advantages and disadvantages naturally following.
Conclusions
This table based "instruction-set" solution solves several problems with the current approach, and so I offer it here for anyone to review and/or discuss.
Dan
The text was updated successfully, but these errors were encountered: