Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing tables: A proposal #4

Open
ZipCPU opened this issue Nov 12, 2024 · 14 comments
Open

Routing tables: A proposal #4

ZipCPU opened this issue Nov 12, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@ZipCPU
Copy link
Owner

ZipCPU commented Nov 12, 2024

While the current routing algorithm used by the eth10g controller is blazing fast, is not very configurable. Worse, it requires a lot of logic to manage. Further, the current routing method will only work on Ethernet headers, not IPv3 or IPv4 addressing, or IP/UDP or IP/TCP protocol information (ports). For this reason, a highly configurable memory-based routing table is of interest.

Many routing methods exist in the literature, often parameterized by how the table is structured. A user configurable structure, which can be optimized in many ways, might therefore be valuable.

I therefore propose the following routing table approach. It's based upon 32-bit "instruction" words, which will feed a small FSM based router.

The router shall maintain 3 registers: a "program counter", controlling where the next table word is read from, a "shift register", containing a 128b window into the current packet, and a 128b accumulator containing the results from comparing table values with the 128b window into the current packet. Once comparisons are complete, the router can either 1) jump to a new instruction on a match, 2) jump to a new instruction on a failed match, 3) route to a given address a) always, b) on a match, or c) on a failed match.

The current draft instruction set is shown below:

tblrouter

MAC Example

To match against an Ethernet MAC 86:33:24:17:bf:dd and send it to target "8", the table would have instructions:

CMP (shift=12, IMM=0x863324) -- match against the first 24 bits
OR (shift=9, IMM=0x17bfdd) -- continue matching the second set of 24 bits
RZ 8 -- on "zero" (i.e. a match), route this packet to target "8"

IP Example

To match against an IP packet and route subnetwork 192.168.3.X to target 3, the table would use two steps. The first would determine if the packet were (or were not) an IP packet:

CMP (shift=2, IMM=0x0800)
MASK (shift=2, IMM=0x0ffff) -- make sure bits [23:16] of the prior command don't affect the comparison
JZ (ip_table) -- to jump to an IPv4 processing table

The IP table would then have:

STEP 14 (Shift 14 bytes through, bypassing the rest of the ethernet header)
CMP (shift=1, IMM=192.168.3) (Compare the IPv4 destination, bits [8 +: 24], against the given address)
RZ 3 (Route on a successful comparison to target "3")

Routing to a second destination, such as 192.168.17.X, to target 1 would only take two further instructions:

CMP (shift=1, IMM=192.168.17)  (Compare bits[8 +: 24] against the given destination address)
RZ 1 (Route on zero to target 1)

All other packets might then be routed to target 0.

RA 0. (Route Always to target zero)

In these examples, the "targets" will be defined with local meaning. For example, target zero might be set up as a catch all to be sent to a CPU for inspection, and potentially updating the routing table.

Performance monitoring

Performance counters are maintained within the table. Upon hitting a performance counter, the performance counter instruction will be re-read, and atomically incremented. How this happens will be dependent upon the bus in use. For example, on WB, the Wishbone cycle line will be held high while the performance counter is ready, a one added to it, and then the counter is written back to memory. With AXI, an exclusive access operation can be used to guarantee an atomic increment. This helps to insure that both CPU and router, and potentially other concurrent routers, will always have a consistent view of the performance counters.

Potential issues

The most obvious "issue" with this router is that it is memory based. Memory will need to be read to determine the optimal routing. While this is taking place, the packet will be forced to "stall" and wait for routing to complete. As long as packets are stored in virtual FIFOs (as they are within this design), such stalls shouldn't cause packets to drop, but they will delay packets moving through the system.

Because this approach is designed to look and feel like an instruction stream, a standard CPU instruction fetch may be used to feed it. This lends open the possibility of repurposing a pre-built cache.

As with any CPU, branches will cause pipeline stalls and hence delays.

There is no requirement in this proposal that any specific type of memory be used. The memory used could therefore come from on-chip memory, or equivalently an off-chip SDRAM type of memory, with expected performance advantages and disadvantages naturally following.

Conclusions

This table based "instruction-set" solution solves several problems with the current approach, and so I offer it here for anyone to review and/or discuss.

Dan

@ZipCPU ZipCPU added the enhancement New feature or request label Nov 12, 2024
@ZipCPU ZipCPU self-assigned this Nov 12, 2024
@ZipCPU
Copy link
Owner Author

ZipCPU commented Nov 13, 2024

One drawback of this approach is that it doesn't directly allow address modification. This is great for switches that don't need to modify their network address, but not so great for routers between networks. Hence, if a packet comes into the router from IP1 on network 1 destined for IP2 on network 2, then the network addresses should be changed as the packet switches from network 1 to network 2. Specifically, the network addresses (i.e. Ethernet MAC) from the prior link need to be adjusted to those of the new link where the source would now be the router's MAC address, and the destination would be that of IP2 (assuming it exists on network 2).

While it would be possible to map the "target" field to control a set of address modifiers and thus "correct" this lack, such packet modifications would need to be external to this routing algorithm at present.

A second drawback is found in the "Increment" instruction. For one, it places the performance metrics all over the table's memory area. This could be "fixed" by replacing the immediate field in the increment instruction with a performance table address. Likewise, there's no condition on the increment instruction--such as "If Z, increment." Hence, it would be difficult to capture metrics from the RZ and RNZ instructions separate from any metrics available for the RA instruction. Fixing this would require 2 more variants of the increment instruction.

@chili-chips-ba
Copy link

Assuming that this is a typo
image
and that the processor is not using 30-bit, but rather 32-bit instructions, we wonder why only 32 bits, given that the accumulator is 128 bits. With 128-bit VLIW, the processor would be able to load the complete address, along with shift value, all in one cycle.

We may also consider a bit in this long instruction word for the control of RxFIFO advancement. Consider that the RxFIFO may come in different form-factors, such as 32 or 64 bits, where some of address shifts are not in space but in time, by advancing the RxFIFO.

Also consider expanding the instruction set so that this processor can be used to transfer the complete packet from RxFIFO to the calculated destination, that is serve as a mini DMA.

What's the vision for the compiler framework of this network processor? Do you see it programmed manually, in machine code, or in some kind of Assembly, or in a new, custom DSL, or in one of the already existing opensource DSLs with networking focus?!

@ZipCPU
Copy link
Owner Author

ZipCPU commented Nov 14, 2024

  1. Yes, thank you, that was a typo. All instructions in my vision are 32b. The JZ and JNZ instructions therefore support 30b offsets to memory. This gives effectively allows the router to access up to 1GB of memory. If we forced all accesses to be aligned (a reasonable assumption), this might allow the router access to a full 4GB of memory.
  2. Why not 128b VLIW instructions? Well, first, I hadn't thought about going that wide. Thinking about it now, I would note A) 1GbE only receives and processes 8b at a time. 128b VLIW instructions would be overkill for such an approach, and would therefore limit this solution to wider pipelines. B) Yes, this particular Eth10G design is set up for 10GbE instructions, and therefore processes either 64b/clock at 156MHz, 128b/clock at 78MHz, or 512b/clock at the memory rate of 100MHz. I see this as working on the memory bus clock, so it would be receiving and processing either 512b every five clocks or so or 128b/clock, depending on which side of the width adjuster it got placed on. C) Yes, a buffer/FIFO would be necessary. D) A 128b VLIW wouldn't be as useful as it sounds, since you still wouldn't be able to match all 128b, and since you'd need a masking flag. Hence, you might only match 96b with a 128b VLIW. Perhaps more interesting might be a variable length instruction word, allowing an 8b prefix followed by an immediate with up to 128b. Whether or not this would be "better" would still be up for analysis. E) Many architectures I've worked with don't have 128b paths to memory. This particular architecture has a 512b path, so ... this architecture is a bit unusual in that respect.
  3. This routine would work in tandem with the RxFIFO. If the RxFIFO fed into the router, then the RxFIFO would nominally advance on every STEP instruction, and packets would be flushed through the system on either a ROUTE (RZ, RNZ, RA) or an illegal instruction (ILL).
  4. I don't really see a need for a mini-DMA here. As I just mentioned, the Route and ILLegal instructions already serve the purpose of advancing the entire packet through the system.
  5. As for the compiler, I foresee a couple of implementations. A) Initially, the tool set would consist of a basic assembler--such as one might write in an afternoon with FLEX. This assembler would also have a disassembly option. This initial tool would serve for initial bringup and subsequent debug--at least to the point where you knew the router worked. B) After working with the router for a while, I foresee the creation of a software library that can then write and compile instructions from a higher level language. C) Eventually, I'd expect the software library to translate tables in other formats into this machine language. D) The language proposed here is not really meant for the user, but rather for the router. As such, it fits the role of a machine language: Nobody would write this language. An assembly would be available for debug. Eventually, other higher level languages would do all the work.

As I see this in a block diagram sense, it would look something like the following:

20241114-prorouter-blockd

Packets would come in using the AXIN protocol. A small buffer would exist within the router to buffer the packet header until the destination target had been determined. If this buffer ever overflowed before the target was known, the packet would be discarded and an error condition would be generated--in much the same way as with illegal instructions.

Dan

@chili-chips-ba
Copy link

"... 1GbE only receives and processes 8b at a time. 128b VLIW instructions would be overkill for such an approach..."

Our Wireguard-1GE is actually, for multiple reasons, using 128 bits in the Data Plane Engine (DPE) core.

image

We have in the past used even 1K-bit wide tables of the keys to search for, so that multiple addresses can be compared on the same clock cycle.

The search key capacity and time to complete are critical parameters of a real-life networking product. We have for that reason also always used the BRAM for the search key storage. Reaching out to an external DRAM is typically too expensive for performance when one has to match against hundreds of keys in the time allotment of a minimum Ethernet packet or, even worse, an ATM cell.


A good state diagram for this processor would be worth thousand words 😉

@chili-chips-ba
Copy link

chili-chips-ba commented Nov 16, 2024

How about "SORT" instruction?

Many search algorithms count on certain table organization, and typically must reorganize table whenever an entry is entered or deleted. Here is one of them.

@ZipCPU
Copy link
Owner Author

ZipCPU commented Nov 16, 2024

"SORT" is a rather complicated operation for hardware, but much easier in software. It makes sense that the table might require some form or kind of sorting within it, but it also makes more sense that said sorting be done before the table is made active to the hardware.

@chili-chips-ba
Copy link

chili-chips-ba commented Nov 16, 2024

That's exactly why we are rooting for the Harvard architecture of this Search-and-Route (SAR) processor!

By clearly separating Data from Instructions space, the design would facilitate co-processing scenarios, where a standard CPU would occasionally come in to make the life of the HW FSM easier.

These interventions would likely be for the following reasons:

  1. add or delete a Data Table entry
  2. sort the Data Table afterwards, per requirements of the SEARCH algorithm used by the HW FSM of the SAR Proc.

Then, we could take the Instructions to a high level that even forgoes the need for a compiler... such as by promoting them to

Commands:

  • Reset
  • Run / Pause_N

Config settings:

  • Table_Depth
  • (optional) Net_Protocol (MAC, IP, VLAN, ...)
  • (optional) Search_Algo (LPM, BBT, ...)

and Status:

  • Ready / Busy_N

This setup is more amenable to co-processing arrangements, effectively transforming the SAR Proc into SAR CoProc, or Accellerator.

That would arguably lead to a better performing system, getting away with instruction fetches, caches, bug catches and other degrading complications that come from programmability at machine level.

This also plays nicely with FPGA kind of programmability that can be exploited as an alternative to a SAR Proc program, or reduce the SAR CoProc config options to only Table_Depth. Everything else can be a different FPGA program that's compiled from the pin-compatible Plug-and-Play RTL module that the user can swap in or out of his SOC, by linking in the corresponding file.


We should also define the capacity and performance objectives for the Search part of the SAR CoProc. While commercial implementations accommodate thousands of entries, we could start with modest 1K at 10GE arrival rates...

@ZipCPU
Copy link
Owner Author

ZipCPU commented Nov 17, 2024

I'm not sure I follow. Can you slow down, back up, and explain your thoughts please?

Thank you.

@chili-chips-ba
Copy link

chili-chips-ba commented Nov 17, 2024

Rather then designing a specialized processor (Master), look at this design task through the lenses of a co-processor (intelligent and capable peripheral, but still a Slave).

Forget about instruction fetches, assembler, compiler, and all that jazz, and design it as an FSM that an external CPU can:

  1. minimally configure
  2. reset, start and stop

and which can:

  1. return "busy" to the CPU.

Make it operate directly on local BRAM-based tables that the external CPU can manage and sort in order to facilitate the searches.

The bulk of its programmability, that is adaptability to the given use-case, comes from the softness of FPGA gates.

@chili-chips-ba
Copy link

chili-chips-ba commented Nov 23, 2024

@eniokaljic, while the maximalist target capacity of Header Lookup Engine for Wireguard VPN apps in the 1Gbps performance bracket is 128 entries (see this), what would it be for an ordinary Ethernet Switch in the 10Gbps performance bracket?

Tip: WireGuard does not natively support DHCP. It is therefore primarily used in the static configurations, that is with fixed assignment of peers.

That's why, contrary to an ordinary Ethernet Switch, the Wireguard VPN Switch does not require especially high number of entries for its header lookup tables.

This ordinary-vs-VPN distinction aside, the 10GE Switch should also handle many more users than the 1GE is exposed to. That alone creates another strong pressure point on the number of entries, that is the header lookup capacity.

Moreover, the absolute time allowance for the header lookup is now less, as the 10GE packets are arriving faster than 1GE. The 10GE solution should therefore also be able to handle a long stream of the shortest possible Ethernet packets at close to the wire speed. There is a limit that the VirtualFIFO is of help, beyond which it will eventually overrun. What may come to rescue is that, contrary to ATM, Ethernet is more tolerant to data loss. That however impacts the overall Switch performance, and would further degrade its already pathetic 3Gbps throughput...

@chili-chips-ba
Copy link

High level of programmability is typically not a trait of embedded system. They tend to be specific to a problem at hand. Esp. so in the consumer space.

On the other hand, the proposed Routing Processor seems to be catering to the P4 (Programming Protocol-independent Packet Processing) DSL kind of use cases.

If that's indeed the goal, reviewing this reference might provide further guidance:

@eniokaljic
Copy link

eniokaljic commented Nov 24, 2024

Indeed, the programmability of the lookup engine is imperative for the programmable data plane, because it increases the flexibility of the lookup implementation with the high granularity that reaches the application level. SDN (Software-Defined Networks) advocates moving programmability from the data plane to the control plane, while P4 tries to keep programmability within the data plane for better performance, i.e. faster packet processing and lower communication overhead between the data and control planes. There are many approaches to this problem (survey) and some are based on instruction-driven processing of packet headers like one proposed in this project.

For a 10G Ethernet switch, that level of programmability is not necessary, and more attention should be paid to performance. 10GbE interface achieves a peak packet rate of 14 million pps (calculations). If we assume that the 10GbE switch has at least 4 ports and is implemented in an architecture with a shared lookup engine, the expected lookup engine load is 59.5 million pps. This leaves the lookup engine with a time budget of 16.8 ns. As the number of switch ports increases, the problem is exacerbated. For this reason, such switches use an architecture with dedicated lookup engines for each line interface. However, this opens a new set of problems related to the caching and synchronization of lookup tables. Perhaps, solutions based on TCAM (associative lookup) could be considered for this 10GbE switch.

@chili-chips-ba
Copy link

chili-chips-ba commented Nov 24, 2024

... @ZipCPU it is worth taking time to read fully through the above two references. The provide both the big picture / context,
image

and implementation options.

When it comes to the latter, target capacity and performance are the first things you ought to define. That will drive further design decisions.

As @eniokaljic noted:

  • more ports mean more trouble

with 3 ports being a bare minimum for a Switch, its current 4 ports are not impressive

  • raw performance comes first, and 3Gbps for a nominally 10Gbps Switch is not acceptable.

@chili-chips-ba
Copy link

P4 in action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants