The bmv2 framework lets developers implement their own P4-programmable architecture as a software switch. The simple_switch architecture is the de-facto architecture for most users, as it is roughly equivalent to the "abstract switch model" described in the P4_14 spec.
The P4_16 language has been designed such that there can be multiple architectures, e.g. one or more architectures for a switch device, one or more architectures for a NIC device, etc. The v1model architecture is included with the p4c compiler, and was designed to be identical to the P4_14 switch architecture, enabling straightforward auto-translation from P4_14 programs to P4_16 programs that use the v1model architecture. There are a few tiny differences between P4_14 and v1model called out below, primarily in the names of a few metadata fields.
The P4_16 language also now has a Portable Switch Architecture (PSA)
defined in its own specification. As of
January 2022, a partial implementation of the PSA architecture has
been done, but it is not yet complete. It will be implemented in a
separate executable program named psa_switch
, separate from the
simple_switch
program described here.
This document aims at providing P4 programmers with important information regarding the simple_switch architecture.
The simple_switch process was developed first, and supports a Thrift-based control API that was custom developed for BMv2, and to our knowledge is not supported by other switches, unless they are based on simple_switch.
The simple_switch_grpc process was developed based on simple_switch,
and supports the P4Runtime API, as a server/switch.
simple_switch_grpc can optionally also support the simple_switch
Thrift API (see
here).
Enabling this option might be useful to someone because it supports
some configuration changes that the P4Runtime API does not, e.g. the
set_queue_rate
command (among others).
Thus if you wish to use the P4Runtime API to control the configuration of the switch, you must use simple_switch_grpc.
See the following presentation for more details on how the P4Runtime API is implemented in simple_switch_grpc:
- "P4 Runtime Implementation in P4.org projects" by Antonin Bas on April 7, 2021
For a P4_16 program using the v1model architecture and including the
file v1model.p4
, all of the fields below are part of the struct with
type standard_metadata_t
.
For a P4_14 program, all of the fields below are part of the header
named standard_metadata
, which is predefined for you.
The following abbreviations are used to mark field names below, to indicate where they appear:
- sm14 - the field is defined in v1.0.4 of the P4_14 language specification, Section 6 titled "Standard Intrinsic Metadata".
- v1m - the field is defined in the include file
p4include/v1model.p4
of the p4c repository, intended to be included in P4_16 programs compiled for the v1model architecture.
Here are the fields:
ingress_port
(sm14, v1m) - For new packets, the number of the ingress port on which the packet arrived to the device. Read only.packet_length
(sm14, v1m) - The length of the packet in bytes.egress_spec
(sm14, v1m) - Can be assigned a value in ingress code to control which output port a packet will go to. The P4_14 primitivedrop
, and the v1model primitive actionmark_to_drop
, have the side effect of assigning an implementation specific value DROP_PORT to this field (511 decimal for simple_switch by default, but can be changed through the--drop-port
target-specific command-line option), such that ifegress_spec
has that value at the end of ingress processing, the packet will be dropped and not stored in the packet buffer, nor sent to egress processing. See the "after-ingress pseudocode" for relative priority of this vs. other possible packet operations at end of ingress. If your P4 program assigns a value of DROP_PORT toegress_spec
, it will still behave according to the "after-ingress pseudocode", even if you never callmark_to_drop
(P4_16) ordrop
(P4_14). Note that if you never assign a value to this field, nor drop the packet, during ingress processing, its default initial value is 0, and the packet will be unicast to output port 0.egress_port
(sm14, v1m) - Only intended to be accessed during egress processing, read only. While you can read its value during ingress processing, you should think of its value as uninitialized garbage during ingress. This field is assigned a predictable value just before the packet begins egress processing, equal to the output port that this packet is destined to (if it is not dropped during egress processing). There is no compile-time error or warning if you do assign to this field, but if you wish your P4 code to be easier to port to other programmable switches, do not ever assign it a value. See Appendix D.2 "No output port change during egress" of the PNA specification for some explanation of why.egress_instance
(sm14) - Renamedegress_rid
in simple_switch. Seeegress_rid
below.instance_type
(sm14, v1m) - Contains a value that can be read by your P4 code. In ingress code, the value can be used to distinguish whether the packet is newly arrived from a port (NORMAL
), it was the result of a resubmit primitive action (RESUBMIT
), or it was the result of a recirculate primitive action (RECIRC
). In egress processing, can be used to determine whether the packet was produced as the result of an ingress-to-egress clone primitive action (INGRESS_CLONE
), egress-to-egress clone primitive action (EGRESS_CLONE
), multicast replication specified during ingress processing (REPLICATION
), or none of those, so a normal unicast packet from ingress (NORMAL
). Until such time as similar constants are pre-defined for you, you may copy this list of constants into your code.parser_status
(sm14) orparser_error
(v1m) -parser_status
is the name in the P4_14 language specification. It has been renamed toparser_error
in v1model. The value 0 (sm14) or error.NoError (P4_16 + v1model) means no error. Otherwise, the value indicates what error occurred during parsing.parser_error_location
(sm14) - Not present in v1model.p4, and not implemented in simple_switch.checksum_error
(v1m) - Read only. 1 if a call to theverify_checksum
primitive action finds a checksum error, otherwise 0. Calls toverify_checksum
should be in theVerifyChecksum
control in v1model, which is executed after the parser and before ingress.
Each architecture usually defines its own intrinsic metadata fields, which are used in addition to the standard metadata fields to offer more advanced features. In the case of simple_switch, we have two separate intrinsic metadata headers. These headers are not strictly required by the architecture as it is possible to write a P4 program and run it through simple_switch without them being defined. However, their presence is required to enable some features of simple_switch. For most of these fields, there is no strict requirement as to the bitwidth, but we recommend that you follow our suggestions below. Some of these intrinsic metadata fields can be accessed (read and / or write) directly, others should only be accessed through primitive actions.
For a P4_16 program using the v1model architecture and including the
file v1model.p4
, all of the fields below are part of the struct with
type standard_metadata_t
. There is no need to define your own
struct type for these fields.
For a P4_14 program, we recommend that you define and instantiate this metadata header for every P4 program that you write for the simple_switch architecture, with the following code:
header_type intrinsic_metadata_t {
fields {
ingress_global_timestamp : 48;
egress_global_timestamp : 48;
mcast_grp : 16;
egress_rid : 16;
}
}
metadata intrinsic_metadata_t intrinsic_metadata;
ingress_global_timestamp
: a timestamp, in microseconds, set when the packet shows up on ingress. The clock is set to 0 every time the switch starts. This field can be read directly from either pipeline (ingress and egress) but should not be written to. See Section BMv2 timestamp implementation notes for more details.egress_global_timestamp
: a timestamp, in microseconds, set when the packet starts egress processing. The clock is the same as foringress_global_timestamp
. This field should only be read from the egress pipeline, but should not be written to.mcast_grp
: needed for the multicast feature. This field needs to be written in the ingress pipeline when you wish the packet to be multicast. A value of 0 means no multicast, and callingmark_to_drop
will set the value to 0. A nonzero value should be that of a valid multicast group configured through BMv2 runtime interfaces. See the "after-ingress pseudocode" for relative priority of this vs. other possible packet operations at end of ingress.egress_rid
: needed for the multicast feature. This field is only valid in the egress pipeline and can only be read from. It is used to uniquely identify multicast copies of the same ingress packet.
For a P4_16 program using the v1model architecture and including the
file v1model.p4
, all of the fields below are part of the struct with
type standard_metadata_t
. There is no need to define your own
struct type for these fields.
For a P4_14 program, you only need to define this P4 header if you want to access queueing information - as a reminder, the packet is queued between the ingress and egress pipelines. Note that this header is "all or nothing". Either you do not define it or you define it with all its fields (there are 4 of them). We recommend that you use the following P4_14 code:
header_type queueing_metadata_t {
fields {
enq_timestamp : 48;
enq_qdepth : 19;
deq_timedelta : 32;
deq_qdepth : 19;
qid : 8;
}
}
metadata queueing_metadata_t queueing_metadata;
Of course, all of these fields can only be accessed from the egress pipeline and they are read-only.
enq_timestamp
: a timestamp, in microseconds, set when the packet is first enqueued.enq_qdepth
: the depth of the queue when the packet was first enqueued, in units of number of packets (not the total size of packets).deq_timedelta
: the time, in microseconds, that the packet spent in the queue.deq_qdepth
: the depth of queue when the packet was dequeued, in units of number of packets (not the total size of packets).qid
: when there are multiple queues servicing each egress port (e.g. when priority queueing is enabled), each queue is assigned a fixed unique id, which is written to this field. Otherwise, this field is set to 0. If priority queueing is enabled, the qid also describes the priority level of each queue. Starting with 0 that has the lowest priority untilnumber_of_priority_queues - 1
that has the highest priority. The number of priority queues for each port can be defined by adding--priority-queues
when runningsimple_switch
.
TBD: qid
is not currently part of type standard_metadata_t
in v1model.
Perhaps it should be added?
We mostly support the standard P4_14 primitive actions. One difference is that
optional parameters are not supported in bmv2, so all parameters are always
required (see resubmit
for example).
The full list of primitives can be seen in this C++ source
file.
After-ingress pseudocode - the short version:
if (a clone primitive action was called) {
create clone(s) of the packet with details configured for the clone session
}
if (digest to generate) { // because your code called generate_digest
send a digest message to the control plane software
}
if (resubmit was called) {
start ingress processing over again for the original packet
} else if (mcast_grp != 0) { // because your code assigned a value to mcast_grp and
// did not call mark_to_drop afterwards
multicast the packet to the output port(s) configured for group mcast_grp
} else if (egress_spec == DROP_PORT) { // e.g. because your code called drop/mark_to_drop
Drop packet.
} else {
unicast the packet to the port equal to egress_spec
}
After-ingress pseudocode - for determining what happens to a packet after ingress processing is complete. The longer more detailed version:
if (a clone primitive action was called) {
// This condition will be true if your code called the `clone` or
// `clone_preserving_field_list` extern function from a P4_16
// program, or the `clone_ingress_pkt_to_egress` primitive action
// in a P4_14 program, during ingress processing.
Create zero or more clones of the packet. The cloned packet(s)
will be enqueued in the packet buffer, destined for the egress
port(s) configured in the clone session whose numeric id was given
as the value of the `session` parameter when the last clone
primitive action was called.
Each cloned packet will later perform egress processing,
independently from whatever the original packet does next, and if
multiple cloned packets are created via the same clone operation,
independently from each other.
The contents of the cloned packet(s) will be the same as the
packet when it most recently began ingress processing, where the
clone operation was performed, without any modifications that may
have been made during the execution of that ingress code. (That
may not be the packet as originally received by the switch, if the
packet reached this occurrence of ingress processing via a
recirculate operation, for example.)
If it was a clone_preserving_field_list (P4_16) or
clone_ingress_pkt_to_egress (P4_14) action, also preserve the
final ingress values of the metadata fields specified in the field
list argument. The value of instance_type will be
PKT_INSTANCE_TYPE_INGRESS_CLONE for the cloned packet(s).
Each cloned packet will be processed by your parser code again.
In many cases this will result in exactly the same headers being
parsed as when the packet was most recently parsed, but if your
parser code uses the value of standard_metadata.instance_type to
affect its behavior, it could be different.
After this parsing is done, each clone will continue processing at
the beginning of your egress code.
// fall through to code below
}
if (digest to generate) {
// This condition will be true if your code called the
// generate_digest primitive action during ingress processing.
Send a digest message to the control plane that contains the
values of the fields in the specified field list.
// fall through to code below
}
if (resubmit was called) {
// This condition will be true if your code called the
// `resubmit_preserving_field_list` extern function from a P4_16
// program, or the `resubmit` primitive from a P4_14 program,
// during ingress processing.
Start ingress over again for this packet, with its original
unmodified packet contents and metadata values. Preserve the
final ingress values of any user-defined metadata fields
determined by the field list given as an argument to the last
resubmit operation called. The resubmitted packet will have
instance_type equal to PKT_INSTANCE_TYPE_RESUBMIT.
} else if (mcast_grp != 0) {
// This condition will be true if your code made an assignment to
// `standard_metadata.mcast_grp` during ingress processing AND did
// did not call `mark_to_drop` afterwards.
// There are no special primitive actions built in to simple_switch
// for you to call to do this -- use a normal P4_16 assignment
// statement, or P4_14 `modify_field()` primitive action.
Make 0 or more copies of the packet (including all metadata) based upon the list of
(egress_port, egress_rid) values configured by the control plane
for the mcast_grp value. Enqueue each one in the appropriate
packet buffer queue. The instance_type of each will be
PKT_INSTANCE_TYPE_REPLICATION.
} else if (egress_spec == DROP_PORT) {
// This condition will be true if your code called the
// mark_to_drop (P4_16) or drop (P4_14) primitive action during
// ingress processing.
Drop packet.
} else {
Enqueue one copy of the packet destined for egress_port equal to
egress_spec.
}
After-egress pseudocode - the short version:
if (a clone primitive action was called) {
create clone(s) of the packet with details configured for the clone session
}
if (egress_spec == DROP_PORT) { // e.g. because your code called drop/mark_to_drop
Drop packet.
} else if (recirculate was called) {
after deparsing, deparsed packet starts over again as input to parser
} else {
Send the packet to the port in egress_port.
}
After-egress pseudocode - for determining what happens to a packet after egress processing is complete. The longer more detailed version:
if (a clone primitive action was called) {
// This condition will be true if your code called the `clone` or
// `clone_preserving_field_list` extern function from a P4_16
// program, or the `clone_egress_pkt_to_egress` primitive action
// in a P4_14 program, during egress processing.
Create zero or more clones of the packet. The cloned packet(s)
will be enqueued in the packet buffer, destined for the egress
port(s) configured in the clone session whose numeric id was given
as the value of the `session` parameter when the last clone
primitive action was called.
Each cloned packet will later perform egress processing,
independently from whatever the original packet does next, and if
multiple cloned packets are created via the same clone operation,
independently from each other.
The contents of the cloned packet(s) will be as they are at the
end of egress processing, including any changes made to the values
of fields in headers, and whether headers were made valid or
invalid. Your deparser code will _not_ be executed for
egress-to-egress cloned packets, nor will your parser code be
executed for them.
If it was a clone_preserving_field_list (P4_16) or
clone_egress_pkt_to_egress (P4_14) action, also preserve the final
egress values of the metadata fields specified in the field list
argument. The value of instance_type will be
PKT_INSTANCE_TYPE_EGRESS_CLONE for the cloned packet(s). Each
cloned packet will have the its standard_metadata fields
initialized in the same way that all packets that begin egress
processing do, e.g. egress_port, egress_spec,
egress_global_timestamp, etc.
Each cloned packet will continue processing at the beginning of
your egress code.
// fall through to code below
}
if (egress_spec == DROP_PORT) {
// This condition will be true if your code called the
// mark_to_drop (P4_16) or drop (P4_14) primitive action during
// egress processing.
Drop packet.
} else if (recirculate was called) {
// This condition will be true if your code called the
// `recirculate_preserving_field_list` extern function from a
// P4_16 program, or the `recirculate` primitive action from a
// P4_14 program, during egress processing.
Start processing the packet over again. That is, take the packet
as constructed by the deparser, with any modifications made to the
packet during both ingress and egress processing, and give that
packet as input to the parser. Preserve the final egress values
of any fields whose field list id is given as an argument to the
last recirculate primitive action called. The instance_type of
the recirculated packet will be PKT_INSTANCE_TYPE_RECIRC.
} else {
Send the packet to the port in egress_port. Since egress_port is
read only during egress processing, note that its value must have
been determined during ingress processing for normal packets. One
exception is that a clone primitive action executed during egress
processing will have its egress_port determined from the port that
the control plane configured for that clone session).
}
simple_switch
supports table key fields with any of the following match_kind
values:
exact
- from P4_16 language specificationlpm
- from P4_16 language specificationternary
- from P4_16 language specificationoptional
- defined inv1model.p4
range
- defined inv1model.p4
selector
- defined inv1model.p4
selector
is only supported for tables with an action selector
implementation.
If a table has more than one lpm
key field, it is rejected by the p4c
BMv2
back end. This could be generalized slightly, as described below, but that
restriction is in place as of the January 2022 version of p4c
.
If a table has at least one range
field, it is implemented internally as a
range
table in BMv2. Because a single search key could match multiple entries,
every entry must be assigned a numeric priority by the control plane software
when it is installed. If multiple installed table entries match the same search
key, one among them with the maximum numeric priority will "win", and its action
performed.
Note that winner is one with maximum numeric priority value if you use the P4Runtime API to specify the numeric priorities. Check the documentation of your control plane API if you use a different one, as some might choose to use the convention that minimum numeric priority values win over larger ones.
A range
table may have an lpm
field. If so, the prefix length is used to
determine whether a search key matches the entry, but the prefix length does
not determine the relative priority among multiple matching table
entries. Only the numeric priority supplied by the control plane software
determines that. Because of this, it would be reasonable for a range
table to
support multiple lpm
key fields, but as of January 2022 this is not supported.
If a range table has entries defined via a const entries
table property, then
the relative priority of the entries are highest priority first, to lowest
priority last, based upon the order they appear in the P4 program.
If a table has no range
field, but at least one ternary
or optional
field,
it is implemented internally as a ternary
table in BMv2. As for range
tables, a single search key can be matched by multiple table entries, and thus
every entry must have a numeric priority assigned by the control plane
software. The same note about lpm
fields described above for range
tables
also applied to ternary
tables, as well as the note about entries specified
via const entries
.
If a table has no range
, ternary
, nor optional
fields, but at least one
lpm
field, there must be exactly one lpm
field. There can be 0 or more
exact
fields. While there can be multiple installed table entries that match a
single search key, with these restrictions there can be at most one matching
table entry of each possible prefix length of the lpm
field (because no two
table entries installed at the same time are allowed to have the same search
key). The matching entry with the longest prefix length is always the
winner. The control plane cannot specify a priority when installing entries for
such a table -- it is always determined by the prefix length.
If a longest prefix match table has entries defined via a const entries
table
property, then the relative priority of the entries are determined by the prefix
lengths, not by the order they appear in the P4 program.
If a table has only exact
fields, it is implemented internally as an exact
table in BMv2. For any search key, there can be at most one matching table
entry, because duplicate search keys are not allowed to be installed. Thus no
numeric priority is ever needed to determine the "winning" matching table entry.
BMv2 (and many other P4 implementations) implements the match portion of such a
table's functionality using a hash table.
If an exact match table has entries defined via a const entries
table
property, there can be at most one matching entry for any search key, so the
relative order that entries appear in the P4 program is unimportant in
determining which entry will win.
The table below shows, for each match_kind
value of P4_16 table key fields,
which syntax is allowed for specifying the set of matching field values for a
table entry in a const entries
list.
A restriction on all allowed cases is that lo
, hi
, val
, and mask
must be
a legal possible value for the field, i.e. not outside of that field's range of
values. All of them can be arithmetic expressions containing compile time
constant values.
range |
ternary |
optional |
lpm |
exact |
|
---|---|---|---|---|---|
lo .. hi |
yes (Note 1) | no | no | no | no |
val &&& mask |
no | yes (Note 2) | no | yes (Note 3) | no |
val |
yes (Note 4) | yes (Note 5) | yes | yes (Note 5) | yes |
_ or default |
yes (Note 6) | yes (Note 6) | yes (Note 6) | yes (Note 6) | no |
Note 1: Restriction: lo <= hi
. A run time search key value k
matches if lo <= k <= hi
.
Note 2: Restriction: val == (val & mask)
. Bit positions of mask
equal to 1
are exact match bit positions, and bit positions where mask
is 0 are "wild
card" or don't care bit position. A run time search key value k
matches if (k & mask) == (val & mask)
.
Note 3: Restriction: val == (val & mask)
and mask
is a "prefix mask",
i.e. it has all 1 bit positions consecutive and in the most significant bit
positions of the field. WARNING: If you attempt to specify a prefix as val / prefix_length
in a P4_16 program (the syntax used by some command line
interfaces for specifying a prefix, such as simple_switch_CLI
), that is
actually an arithmetic expression that divides val
by prefix_length
, and
thus falls into the val
case, which is exact match. There is no warning from
the compiler because it is perfectly legal syntax for a division operation.
Note 4: Equivalent to the range val .. val
, so it behaves as exact match on
val
.
Note 5: Equivalent to val &&& mask
where mask
is 1 in all bit positions of
the field, so it behaves as exact match on val
.
Note 6: Matches any allowed value for the field. Equivalent to
min_posible_field_value .. max_possible_field_value
for range
fields, or 0 &&& 0
for ternary
and lpm
fields.
Below is a portion of a P4_16 program that demonstrates most of the allowed
combinations of match_kind
and syntax for specifying matching sets of values.
header h1_t {
bit<8> f1;
bit<8> f2;
}
struct headers_t {
h1_t h1;
}
// ... later ...
control ingress(inout headers_t hdr,
inout metadata_t m,
inout standard_metadata_t stdmeta)
{
action a(bit<9> x) { stdmeta.egress_spec = x; }
table t1 {
key = { hdr.h1.f1 : range; }
actions = { a; }
const entries = {
1 .. 8 : a(1);
6 .. 12 : a(2); // ranges are allowed to overlap between entries
15 .. 15 : a(3);
17 : a(4); // equivalent to 17 .. 17
// It is not required to have a "match anything" rule in a table,
// but it is allowed (except for exact match fields), and several of
// these examples have one.
_ : a(5);
}
}
table t2 {
key = { hdr.h1.f1 : ternary; }
actions = { a; }
// There is no requirement to specify ternary match criteria using
// hexadecimal values. I personally prefer it to make the mask bit
// positions more obvious.
const entries = {
0x04 &&& 0xfc : a(1);
0x40 &&& 0x72 : a(2);
0x50 &&& 0xff : a(3);
0xfe : a(4); // equivalent to 0xfe &&& 0xff
_ : a(5);
}
}
table t3 {
key = {
hdr.h1.f1 : optional;
hdr.h1.f2 : optional;
}
actions = { a; }
const entries = {
// Note that when there are two or more fields in the key of a
// table, const entries key select expressions must be surrounded by
// parentheses.
(47, 72) : a(1);
( _, 72) : a(2);
( _, 75) : a(3);
(49, _) : a(4);
_ : a(5);
}
}
table t4 {
key = { hdr.h1.f1 : lpm; }
actions = { a; }
const entries = {
0x04 &&& 0xfc : a(1);
0x40 &&& 0xf8 : a(2);
0x04 &&& 0xff : a(3);
0xf9 : a(4); // equivalent to 0xf9 &&& 0xff
_ : a(5);
}
}
table t5 {
key = { hdr.h1.f1 : exact; }
actions = { a; }
const entries = {
0x04 : a(1);
0x40 : a(2);
0x05 : a(3);
0xf9 : a(4);
}
}
// ... more code here ...
}
This section is up to date as long as you are using a version of p4c after this pull request was merged:
- p4lang/p4c#2902
- Merged with this commit on 2021-Dec-06: https://github.com/p4lang/p4c/commit/2be448f018e88095a13e8a68e1c7cc1641038cd8
After those changes, all recirculate, resubmit, and clone operations should work correctly, even if they specify user-defined metadata fields to be preserved.
An example use of the @field_list(index_number)
annotation required
in order to cause user-defined metadata fields to be preserved with
one of these operations is given in the
v1model.p4
include file that is part of p4c
. See the documentation of the
resubmit_preserving_field_list
extern function.
Note that the @field_list
annotation is only supported for
user-defined metadata fields. It is not supported for parsed packet
header fields, nor for standard metadata fields. If you wish to
preserve any of these other values, you should copy their values to
user-defined metadata fields that have the @field_list
annotation on
them.
Also note that there may be implementation-specific oddities in p4c
if you preserve user-defined metadata fields that are inside of
structs or headers that are nested one or more levels deep within the
user-defined metadata struct.
This section is an incomplete description of the v1model architecture
for P4_16. Much of the earlier parts of this document are also a
description of the P4_16 v1model architecture, but in some cases they
also serve to document the P4_14 behavior of simple_switch
. This
section is only for things specific to P4_16 plus the v1model
architecture.
In some cases, the details documented here are for how the v1model architecture is implemented in BMv2 simple_switch. Such details are prefaced with the "BMv2 v1model implementation". There may be other implementations of the v1model architecture that make different implementation choices or restrictions.
There is a type called H
that is a parameter to the package
V1Switch
definition in v1model.p4
, excerpted below:
package V1Switch<H, M>(Parser<H, M> p,
VerifyChecksum<H, M> vr,
Ingress<H, M> ig,
Egress<H, M> eg,
ComputeChecksum<H, M> ck,
Deparser<H> dep
);
This type H
is restricted to be a struct that contains member fields
of one of the following types, and no others:
- header
- header_union
- header stack
The P4_16 language specification version 1.1 does not support if
statements inside of parsers. Because the p4c
compiler in-lines
calls to functions at the location of the call, this restriction
extends to function bodies of functions called from a parser. There
are two potential workarounds:
- If the conditional execution can be done just as effectively by
waiting to do it during ingress control processing, you can use
if
statements there. - In some cases you can implement the desired conditional execution
effect using
transition select
statements to branch to different parser states, each of which can have its own distinct code.
In the v1model
architecture, a packet reaching the reject
parser
state, e.g. because it failed a verify
call, is not automatically
dropped. Such a packet will begin Ingress
processing with the value
of the parser_error
standard metadata field equal to the error that
occurred. Your P4 code can direct such packets to be dropped if you
wish, but you may also choose to write code that will do other things
with such packets, e.g. send a clone of them to a control CPU for
further analysis or error logging.
p4c
plus simple_switch
does not support transitioning to the
reject
state explicitly in the source code. It can only do so via
failing a call to verify
.
The VerifyChecksum
control is executed just after the Parser
completes, and just before the Ingress
control begins. p4c
plus
simple_switch
only support a sequence of calls to the
verify_checksum
or verify_checksum_with_payload
extern functions
inside such controls. See the v1model.p4
include file for their
definition. The first argument to these functions is a boolean, which
can be an arbitrary boolean condition used to make the checksum
calculation conditional on that expression being true.
In the v1model
architecture, a packet having an incorrect checksum
is not automatically dropped. Such a packet will begin Ingress
processing with the value of the checksum_error
standard metadata
field equal to 1. If your program has multiple calls to
verify_checksum
and/or verify_checksum_with_payload
, there is no
way supported to determine which of the calls was the one that found
an incorrect checksum. Your P4 code can direct such packets to be
dropped if you wish, but this will not automatically be done for you.
simple_switch
does not support doing apply
on the same table more
than once for a single execution of the Ingress
control, nor for a
single execution of the Egress
control.
In some cases, you can work around this restriction by having multiple
tables with similar definitions, and apply
each of them once per
execution of the Ingress
control (or Egress
). If you want two
such tables to contain the same set of table entries and actions, then
you must write your control plane software to keep their contents the
same, e.g. always add an entry to T2
every time you add an entry to
T1
, etc.
This restriction is one that could be generalized in simple_switch
,
but realize that some high speed ASIC implementations of P4 may also
impose this same restriction, because the restriction can be imposed
by the design of the hardware.
After 2024-Nov-08 when this change was made to the p4c
BMv2 back
end:
arbitrary if
statements, even if
statements nested within other
if
statements, should be supported correctly.
Before that change was made to p4c
, there were significant
restrictions on the kinds of if
statements that were supported
within the bodies of a P4 action
, when compiling for BMv2. For
example, the following action
definition was not supported:
action foo() {
meta.b = meta.b + 5;
if (hdr.ethernet.etherType == 7) {
hdr.ethernet.dstAddr = 1;
} else {
mark_to_drop(standard_metadata);
}
}
The ComputeChecksum
control is executed just after the Egress
control completes, and just before the Deparser
control begins.
p4c
plus simple_switch
only support a sequence of calls to the
update_checksum
or update_checksum_with_payload
extern functions
inside such controls. The first argument to these functions is a
boolean, which can be an arbitrary boolean condition used to make the
checksum update action conditional on that expression being true.
The Deparser
control is restricted to contain only a sequence of
calls to the emit
method of the packet_out
object.
The most straightforward approach to avoiding the restrictions in the
ComputeChecksum
and Deparser
controls is to write the more general
code you want near the end of the Egress
control.
Both p4c and simple_switch support register arrays with elements that
are arbitrary values of type bit<W>
or int<W>
, but not other
types, e.g. structs would be convenient for this purpose in some
programs, but this is not supported.
As an example of a way to work around this limitation, suppose you
wanted a struct with three fields x, y, and z with types bit<8>
,
bit<3>
, and bit<6>
. You can emulate this by making a register
array with elements of type bit<17>
(the total width of all 3
fields), and use P4_16 bit slicing operations to separate the 3 fields
from the 17-bit value after reading from the register array, and P4_16
bit vector concatenation operations to combine 3 fields together,
forming a 17-bit result, before writing the 17-bit value to the
register array.
register< bit<17> >(512) my_register_array;
bit<9> index;
// ... other code here ...
// This example action is written assuming that some code executed
// earlier (not shown here) has assigned a correct desired value
// to the variable 'index'.
action update_fields() {
bit<17> tmp;
bit<8> x;
bit<3> y;
bit<6> z;
my_register_array.read(tmp, (bit<32>) index);
// Use bit slicing to extract out the logically separate parts
// of the 17-bit register entry value.
x = tmp[16:9];
y = tmp[8:6];
z = tmp[5:0];
// Whatever modifications you wish to make to x, y, and z go
// here. This is just an example code snippet, not known to
// do anything useful for packet processing.
if (y == 0) {
x = x + 1;
y = 7;
z = z ^ hdr.ethernet.etherType[3:0];
} else {
// leave x unchanged
y = y - 1;
z = z << 1;
}
// Concatenate the updated values of x, y, and z back into a
// 17-bit value for writing back to the register.
tmp = x ++ y ++ z;
my_register_array.write((bit<32>) index, tmp);
}
While p4c and simple_switch support as wide a bit width W as you wish
for array elements for packet processing, the Thrift API (used by
simple_switch_CLI
, and perhaps some switch controller software) only
supports control plane read and write operations for array elements up
to 64 bits wide (see the type BmRegisterValue
in file
standard.thrift
,
which is a 64-bit integer as of January 2022). The P4Runtime API does
not have this limitation, but there is no P4Runtime implementation of
register read and write operations yet for simple_switch:
p4lang/PI#376
The BMv2 v1model implementation supports parallel execution. It uses
locking of all register objects accessed within an action to guarantee
that the execution of all steps within an action are atomic, relative
to other packets executing the same action, or any action that
accesses some of the same register objects. You need not use the
@atomic
annotation in your P4_16 program in order for this level of
atomicity to be guaranteed for you.
BMv2 v1model as of January 2022 ignores @atomic
annotations in
your P4_16 programs. Thus even if you use such annotations, this does
not cause BMv2 to treat any block of code larger than one action call
as an atomic transaction.
The BMv2 v1model implementation of the random
function supports the
lo
and hi
parameters being run-time variables, i.e. they need not
be compile time constants.
Also, they need not be limited to the constraint that (hi - lo + 1)
is a power of 2.
Type T
is restricted to be bit<W>
for W <= 64
.
The BMv2 v1model implementation of the hash
function supports base
and max
parameters being run-time variables, i.e. the need not be
compile time constants.
Also, max
need not be limited to the constraint that it is a power
of 2.
Call the hash value that is calculated from the data H. The value
written to the out parameter named result
is: (base + (H % max))
if max >= 1
, otherwise base
.
The type O
of the result
, T
of base
, and M
of max
are
restricted to be bit<W>
for W <= 64
.
If a table t
has a direct_counter
object c
associated with it by
having a table property counters = c
in its definition, the BMv2
v1model implementation behaves as if every action for that table
contains exactly one call to c.count()
, whether it has none, one, or
more than one.
If a table t
has a direct_meter
object m
associated with it by
having a table property meters = m
in its definition, then at least
one of that table's actions must have a call to
m.read(result_field);
, for some field result_field
with type
bit<W>
where W >= 2
.
When this is done, the BMv2 v1model implementation behaves as if all
actions for table t
have such a call in them, even if they do not.
It is not supported, and the p4c
compiler gives an error message, if
you have two actions for table t
where one has the action
m.read(result_field1)
and another has the action
m.read(result_field2)
, or if both of those calls are in the same
action. All calls to the read()
method for m
must have the same
result parameter where the result is written.
All timestamps in BMv2 as of 2022-Jan-18 are in units of
microseconds, and they all begin at 0 when the process begins,
e.g. when simple_switch
or simple_switch_grpc
begins.
Thus if one starts multiple simple_switch
processes, either on the
same system, or different systems, it is highly unlikely that their
timestamps will be within 1 microsecond of each other. Timestamp
values from switch #1 could easily be X microseconds later than, or
earlier than, switch #2, and this time difference between pairs of
switches could change from one time of starting a multi-switch system,
versus another time. While it is possible to start pairs of
simple_switch
processes where X would be in the range -10 to +10
microseconds, that would almost certainly be a lucky and unlikely
chance that leads to such close timestamps. Differences within plus
or minus 2,000,000 microseconds should be fairly straightforward to
achieve, if you go to at least a little bit of effort to start the
processes at the same second as each other.
Many commercial switch and NIC vendors implement standards like PTP
(Precision Time
Protocol), but
simple_switch
does not.
If one wants to use simple_switch
and have some form of more closely
synchronized timestamps between different switches, you have several
options:
- Modify
simple_switch
code so that theingress_global_timestamp
and other timestamp values are calculated from the local system time, e.g. perhaps using a system call likegettimeofday
. Many systems today use NTP (Network Time Protocol) to synchronize their time to some time server, and according to the Wikipedia article it can achieve better than one millisecond accuracy on local area networks under ideal conditions, and to within tens of milliseconds over the public Internet. If your use case can accomodate this much difference in the timestamps on different switches, this method should require fairly low development effort to implement. - Implement PTP in some combination of P4 program and control plane software, and run that on simple_switch. This is likely to be a significant amount of time and effort to achieve, and given the variable latencies in software processing of packets in virtual and/or physical Ethernet interfaces on typical hosts, it seems unlikely that synchronization closer than tens or hundreds of microseconds would be achievable. Commercial PTP implementations can achieve more tight synchronization because the PTP packet handling is implemented within the Ethernet packet processing hardware, very close to the physical transmission and reception of the data between switches.
- Implement a newer synchronization algorithm like HUYGENS. This is likely to be at least as much development effort as implementing PTP, but seems likely to be able to achieve tighter time synchronization.
The BMv2 v1model implementation supports the table property
support_timeout
. If not specified for a table, its value is false.
If you assign this table property the value of true when defining a
table, as shown in the table definition below, then BMv2 will maintain
independent state for each entry of the table to detect when it has
been longer than a configured time interval since the entry was last
matched.
// Definitions of actions have been omitted in this code snippet.
// Their definitions do not affect the behavior of the
// `support_timeout` table property.
table mac_da_fwd {
key = {
hdr.ethernet.dstAddr: exact;
}
actions = {
set_port;
my_drop;
}
support_timeout = true;
default_action = my_drop;
}
See Sections
"Idle-timeout"
and "Table Idle Timeout
Notification"
in the P4Runtime Specification for how a controller can configure the
time interval for each entry it adds to such a table, and the contents
of IdleTimeoutNotification messages sent from the switch to a
P4Runtime controller when a table entry has not been matched for
longer than its configured idle_timeout_ns
duration.
The BMv2 v1model implementation of this feature does a "background
sweep" of all entries in all tables with support_timeout = true
every 1 second, so effectively idle timeout configurations are rounded
up to the next multiple of 1 second.
If a table entry has not been matched for long enough, and thus BMv2
has sent an IdleTimeoutNotification message to the controller for that
entry, and the entry continues not to be matched after that, BMv2 will
attempt to send another IdleTimeoutNotification for the same entry
every 2 seconds afterwards (every other sweep interval), regardless of
the idle_timeout_ns
configured for the entry.
Note: This section is for historical reference purposes. It applies for versions of p4c before this pull request was merged:
- p4lang/p4c#2902
- Merged with this commit on 2021-Dec-06: https://github.com/p4lang/p4c/commit/2be448f018e88095a13e8a68e1c7cc1641038cd8
Recirculate, resubmit, and clone operations that do not preserve metadata, i.e. they have an empty list of fields whose value should be preserved, work as expected when using p4c and simple_switch, with either P4_14 as the source language, or P4_16 plus the v1model architecture.
Unfortunately, at least some of these operations that attempt to preserve metadata do not work correctly -- they still cause the packet to be recirculated, resubmitted, or cloned, but they do not preserve the desired metadata field values.
This is a known issue that has been discussed at length by the p4c developers and P4 language design work group, and the decision made is:
- Long term, the P4_16 Portable Switch Architecture uses a different mechanism for specifying metadata to preserve than the v1model architecture uses, and should work correctly. As of October 2019 the implementation of PSA is not complete, so this does not help one write working code today.
- If someone wishes to volunteer to make changes to p4c and/or
simple_switch for improving this situation, please contact the P4
language design work group and coordinate with them.
- The preferred form of help would be to complete the Portable Switch Architecture implementation.
Note that this issue affects not only P4_16 programs using the v1model
architecture, but also P4_14 programs compiled using the p4c
compiler, because internally p4c
translates P4_14 to P4_16 code
before the part of the compiler that causes the problem to occur.
The fundamental issue is that P4_14 has the field_list
construct,
which is a restricted kind of named reference to other fields. When
these field lists are used in P4_14 for recirculate, resubmit, and
clone operations that preserve metadata, they direct the target not
only to read those field values, but also later write them (for the
packet after recirculating, resubmitting, or cloning). P4_16 field
lists using the syntax {field_1, field_2, ...}
are restricted to
accessing the current values of the fields at the time the statement
or expression is executed, but do not represent any kind of reference,
and by the P4_16 language specification cannot cause the values of
those fields to change at another part of the program.
Below are some details for anyone trying, by hook or by crook, to make preservation of metadata work with the current p4c implementation, without the future improvements mentioned above.
There is no known simple rule to determine which invocations of these operations will preserve metadata correctly and which will not. If you want at least some indication, examine the BMv2 JSON file produced as output from the compiler.
The Python program bmv2-json-check.py attempts to determine whether any field list used to preserve metadata fields for recirculate, resubmit, or clone operations looks like it has the name of a compiler-generated temporary variable. Warning: its methods are fairly basic, and thus there is no guarantee that if the script says there is no problem, metadata preservation will work correctly, or conversely that if the script says there are suspicious things found, metadata preservation will not work correctly. It automates what is described below.
Every recirculate, resubmit, and clone operation is represented in JSON data inside the BMv2 JSON file. Search for one of these strings, with the double quote marks around them:
"recirculate"
- only parameter is a field_list id"resubmit"
- only parameter is a field_list id"clone_ingress_pkt_to_egress"
- second parameter is a field_list id"clone_egress_pkt_to_egress"
- second parameter is a field_list id
You can find the fields in each field list in the section of the BMv2
JSON file with the key "field_lists"
. Whatever the field names are
there, simple_switch will preserve the values of fields with those
names. The primary issue is whether those fields correspond to the
same storage locations that the compiler uses to represent the fields
you want to preserve.
In some cases they are, which is often the case if the field names in the BMv2 JSON file look the same as, or similar to, the field names in your P4 source code.
In some cases they represent different storage locations,
e.g. compiler-generated temporary variables used to hold a copy of the
field you want to preserve. A field name beginning with tmp.
is a
hint of this as of p4c 2019-Oct up through 2021-Dec-05, but this is a
p4c implementation detail that could change.