Skip to content

Latest commit

 

History

History
572 lines (440 loc) · 19.1 KB

zson.md

File metadata and controls

572 lines (440 loc) · 19.1 KB
sidebar_position sidebar_label
2
ZSON

ZSON Specification

1. Introduction

ZSON is the human-readable, text-based serialization format of the super-structured Zed data model.

ZSON builds upon the elegant simplicity of JSON with "type decorators". Where the type of a value is not implied by its syntax, a parenthesized type decorator is appended to the value thus establishing a well-defined type for every value expressed in ZSON text.

ZSON is also a superset of JSON in that all JSON documents are valid ZSON values.

2. The ZSON Format

A ZSON text is a sequence of UTF-8 characters organized either as a bounded input or an unbounded stream.

The input text is organized as a sequence of one or more Zed values optionally separated by and interspersed with whitespace. Single-line (//) and multi-line (/* ... */) comments are treated as whitespace and ignored.

All subsequent references to characters and strings in this section refer to the Unicode code points that result when the stream is decoded. If a ZSON input includes data that is not valid UTF-8, the input is invalid.

2.1 Names

ZSON names encode record fields, enum symbols, and named types. A name is either an identifier or a quoted string. Names are referred to as <name> below.

An identifier is case-sensitive and can contain Unicode letters, $, _, and digits (0-9), but may not start with a digit. An identifier cannot be true, false, or null.

2.2 Type Decorators

A value may be explicitly typed by tagging it with a type decorator. The syntax for a decorator is a parenthesized type:

<value> ( <type> )

For union values, multiple decorators might be required to distinguish the union-member type from the possible set of union types when there is ambiguity, as in

123 (float64) (int64,float64)

In contrast, this union value is unambiguous:

123.0 (int64,float64)

The syntax of a union value decorator is

<value> ( <type> ) [ ( <type> ) ...]

where the rightmost type must be a union type if more than one decorator is present.

A decorator may also define a named type:

<value> ( =<name> )

which declares a new type with the indicated type name using the implied type of the value.

It is an error for the decorator to be type incompatible with its referenced value.

Note that the = sigil here disambiguates between the case that a new type is defined, which may override a previous definition of a different type with the same name, from the case that an existing named type is merely decorating the value.

2.3 Primitive Values

The type names and format for Zed primitive values is as follows:

Type Value Format
uint8 decimal string representation of any unsigned, 8-bit integer
uint16 decimal string representation of any unsigned, 16-bit integer
uint32 decimal string representation of any unsigned, 32-bit integer
uint64 decimal string representation of any unsigned, 64-bit integer
uint128 decimal string representation of any unsigned, 128-bit integer
uint256 decimal string representation of any unsigned, 256-bit integer
int8 decimal string representation of any signed, 8-bit integer
int16 decimal string representation of any signed, 16-bit integer
int32 decimal string representation of any signed, 32-bit integer
int64 decimal string representation of any signed, 64-bit integer
int128 decimal string representation of any signed, 128-bit integer
int256 decimal string representation of any signed, 256-bit integer
duration a duration string representing signed 64-bit nanoseconds
time an RFC 3339 UTC date/time string representing signed 64-bit nanoseconds from epoch
float16 a non-integer string representing an IEEE-754 binary16 value
float32 a non-integer string representing an IEEE-754 binary32 value
float64 a non-integer string representing an IEEE-754 binary64 value
float128 a non-integer string representing an IEEE-754 binary128 value
float256 a non-integer string representing an IEEE-754 binary256 value
decimal32 a non-integer string representing an IEEE-754 decimal32 value
decimal64 a non-integer string representing an IEEE-754 decimal64 value
decimal128 a non-integer string representing an IEEE-754 decimal128 value
decimal256 a non-integer string representing an IEEE-754 decimal256 value
bool the string true or false
bytes a sequence of bytes encoded as a hexadecimal string prefixed with 0x
string a double-quoted or backtick-quoted UTF-8 string
ip a string representing an IP address in IPv4 or IPv6 format
net a string in CIDR notation representing an IP address and prefix length as defined in RFC 4632 and RFC 4291.
type a string in canonical form as described in Section 2.5
null the string null

The format of a duration string is an optionally-signed concatenation of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m", representing a 64-bit nanosecond value. Valid time units are "ns" (nanosecond), "us" (microsecond), "ms" (millisecond), "s" (second), "m" (minute), "h" (hour), "d" (day), "w" (7 days), and "y" (365 days). Note that each of these time units accurately represents its calendar value, except for the "y" unit, which does not reflect leap years and so forth. Instead, "y" is defined as the number of nanoseconds in 365 days.

The format of floating point values is a non-integer string conforming to any floating point representation that cannot be interpreted as an integer, e.g., 1. or 1.0 instead of 1 or 1e3 instead of 1000. Unlike JSON, a floating point number can also be one of: Inf, +Inf, -Inf, or Nan.

A floating point value may be expressed with an integer string provided a type decorator is applied, e.g., 123 (float64).

Decimal values require type decorators.

A string may be backtick-quoted with the backtick character `. None of the text between backticks is escaped, but by default, any newlines followed by whitespace are converted to a single newline and the first newline of the string is deleted. To avoid this automatic deletion and preserve indentation, the backtick-quoted string can be preceded with =>.

Of the 30 primitive types, eleven of them represent implied-type values: int64, time, duration, float64, bool, bytes, string, ip, net, type, and null. Values for these types are determined by the format of the value and thus do not need decorators to clarify the underlying type, e.g.,

123 (int64)

is the same as 123.

Values that do not have implied types must include a type decorator to clarify its type or appear in a context for which its type is defined (i.e., as a field value in a record, as an element in an array, etc.).

While a type value may represent a complex type, the value itself is a singleton and thus always a primitive type. A type value is encoded as:

  • a left angle bracket <, followed by
  • a type as encoded below, followed by
  • a right angle bracket >.

A time value corresponds to 64-bit Unix epoch nanoseconds and thus not all possible RFC 3339 date/time strings are valid. In addition, nanosecond epoch times overflow on April 11, 2262. For the world of 2262, a new epoch can be created well in advance and the old time epoch and new time epoch can live side by side with the old using a named type for the new epoch time defined as the old time type. An app that requires more than 64 bits of timestamp precision can always use a typedef of a bytes type and do its own conversions to and from the corresponding bytes values.

2.3.1 Strings

Double-quoted string syntax is the same as that of JSON as described in RFC 8259. Notably, the following escape sequences are recognized:

Sequence Unicode Character
\" quotation mark U+0022
\\ reverse solidus U+005C
\/ solidus U+002F
\b backspace U+0008
\f form feed U+000C
\n line feed U+000A
\r carriage return U+000D
\t tab U+0009
\uXXXX U+XXXX

In \uXXXX sequences, each X is a hexadecimal digit, and letter digits may be uppercase or lowercase.

The behavior of an implementation that encounters an unrecognized escape sequence in a string type is undefined.

\u followed by anything that does not conform to the above syntax is not a valid escape sequence. The behavior of an implementation that encounters such invalid sequences in a string type is undefined.

These escaping rules apply also to quoted field names in record values and record types as well as enum symbols.

2.4 Complex Values

Complex values are built from primitive values and/or other complex values and conform to the Zed data model's complex types: record, array, set, map, union, enum, and error.

Complex values have an implied type when their constituent values all have implied types.

2.4.1 Record Value

A record value has the form:

{ <name> : <value>, <name> : <value>, ... }

where <name> is a ZSON name and <value> is any optionally-decorated ZSON value inclusive of other records. Each name/value pair is called a field. There may be zero or more fields.

2.4.2 Array Value

An array value has the form:

[ <value>, <value>, ... ]

If the elements of the array are not of uniform type, then the implied type of the array elements is a union of the types present.

An array value may be empty. An empty array value without a type decorator is presumed to be an empty array of type null.

2.4.3 Set Value

A set value has the form:

|[ <value>, <value>, ... ]|

where the indicated values must be distinct.

If the elements of the set are not of uniform type, then the implied type of the set elements is a union of the types present.

A set value may be empty. An empty set value without a type decorator is presumed to be an empty set of type null.

2.4.4 Map Value

A map value has the form:

|{ <key> : <value>, <key> : <value>, ... }|

where zero or more comma-separated, key/value pairs are present.

Whitespace around keys and values is generally optional, but to avoid ambiguity, whitespace must separate an IPv6 key from the colon that follows it.

An empty map value without a type decorator is presumed to be an empty map of type |{null: null}|.

2.4.5 Union Value

A union value is a value that conforms to one of the types within a union type. If the value appears in a context in which the type is unknown or ambiguous, then the value must be decorated as described above.

2.4.6 Enum Value

An enum type represents a symbol from a finite set of symbols referenced by name.

An enum value is indicated with the sigil % and has the form

%<name>

where the <name> is ZSON name.

An enum value must appear in a context where the enum type is known, i.e., with an explicit enum type decorator or within a complex type where the contained enum type is defined by the complex type's decorator.

A sequence of enum values might look like this:

%HEADS (flip=(enum(HEADS,TAILS)))
%TAILS (flip)
%HEADS (flip)

2.4.7 Error Value

An error value has the form:

error(<value>)

where <value> is any ZSON value.

2.5 Types

A primitive type is simply the name of the primitive type, i.e., string, uint16, etc. Complex types are defined as follows.

2.5.1 Record Type

A record type has the form:

{ <name> : <type>, <name> : <type>, ... }

where <name> is a ZSON name and <type> is any type.

The order of the record fields is significant, e.g., type {a:int32,b:int32} is distinct from type {b:int32,a:int32}.

2.5.2 Array Type

An array type has the form:

[ <type> ]

2.5.3 Set Type

A set type has the form:

|[ <type> ]|

2.5.4 Map Type

A map type has the form:

|{ <key-type>: <value-type> }|

where <key-type> is the type of the keys and <value-type> is the type of the values.

2.5.5 Union Type

A union type has the form:

( <type>, <type>, ... )

where there are at least two types in the list.

2.5.6 Enum Type

An enum type has the form:

enum( <name>, <name>, ... )

where <name> is a ZSON name. Each enum name must be unique and the order is not significant, e.g., enum type enum(HEADS,TAILS) is equal to type enum(TAILS,HEADS).

2.5.7 Error Type

An error type has the form:

error( <type> )

where <type> is the type of the underlying ZSON values wrapped as an error.

2.5.8 Named Type

A named type has the form:

<name> = ( <type> )

where a new type is defined with the given name and type.

When a named type appears in a complex value, the new type name may be referenced by any subsequent value in left-to-right depth-first order.

For example,

{p1:80 (port=(uint16)), p2: 8080 (port)}

is valid but

{p1:80 port, p2: 8080 (port=(uint16))}

is invalid.

Named types may be redefined, in which case subsequent references resolve to the most recent definition according to

  • sequence order across values, or
  • left-to-right depth-first order within a complex value.

2.6 Null Value

The null value is represented by the string null.

A value of any type can be null. It is up to an implementation to decide how external data structures map into and out of null values of different types. Typically, a null value means either the zero value or, in the case of record fields, an optional field whose value is not present, though these semantics are not explicitly defined by ZSON.

3. Examples

The simplest ZSON value is a single value, perhaps a string like this:

"hello, world"

There's no need for a type declaration here. It's explicitly a string.

A relational table might look like this:

{ city: "Berkeley", state: "CA", population: 121643 (uint32) } (=city_schema)
{ city: "Broad Cove", state: "ME", population: 806 (uint32) } (=city_schema)
{ city: "Baton Rouge", state: "LA", population: 221599 (uint32) } (=city_schema)

This ZSON text here depicts three record values. It defines a type called city_schema and the inferred type of the city_schema has the signature:

{ city:string, state:string, population:uint32 }

When all the values in a sequence have the same record type, the sequence can be interpreted as a table, where the ZSON record values form the rows and the fields of the records form the columns. In this way, these three records form a relational table conforming to the schema city_schema.

In contrast, a ZSON text representing a semi-structured sequence of log lines might look like this:

{
    info: "Connection Example",
    src: { addr: 10.1.1.2, port: 80 (uint16) } (=socket),
    dst: { addr: 10.0.1.2, port: 20130 (uint16) } (=socket)
} (=conn)
{
    info: "Connection Example 2",
    src: { addr: 10.1.1.8, port: 80 (uint16) } (=socket),
    dst: { addr: 10.1.2.88, port: 19801 (uint16) } (=socket)
} (=conn)
{
    info: "Access List Example",
    nets: [ 10.1.1.0/24, 10.1.2.0/24 ]
} (=access_list)
{ metric: "A", ts: 2020-11-24T08:44:09.586441-08:00, value: 120 }
{ metric: "B", ts: 2020-11-24T08:44:20.726057-08:00, value: 0.86 }
{ metric: "A", ts: 2020-11-24T08:44:32.201458-08:00, value: 126 }
{ metric: "C", ts: 2020-11-24T08:44:43.547506-08:00, value: { x:10, y:101 } }

In this case, the first record defines not just a record type with named type conn, but also a second embedded record type called socket. The parenthesized decorators are used where a type is not inferred from the value itself:

  • socket is a record with typed fields addr and port where port is an unsigned 16-bit integer, and
  • conn is a record with typed fields info, src, and dst.

The subsequent value defines a type called access_list. In this case, the nets field is an array of networks and illustrates the helpful range of primitive types in ZSON. Note that the syntax here implies the type of the array, as it is inferred from the type of the elements.

Finally, there are four more values that show ZSON's efficacy for representing metrics. Here, there are no type decorators as all of the field types are implied by their syntax, and hence, the top-level record type is implied. For instance, the ts field is an RFC 3339 date and time string, unambiguously the primitive type time. Further, note that the value field takes on different types and even a complex record type on the last line. In this case, there is a different top-level record type implied by each of the three variations of type of the value field.

4. Grammar

Here is a left-recursive pseudo-grammar of ZSON. Note that not all acceptable inputs are semantically valid as type mismatches may arise. For example, union and enum values must both appear in a context that defines their type.

<zson> = <zson> <eos> <dec-value> | <zson> <dec-value> | <dec-value>

<eos> = .

<value> = <any> | <any> <val-typedef> | <any> <decorators>

<val-typedef> = "(" "=" <name> ")"

<decorators> = "(" <type> ")" | <decorators> "(" <type> ")"

<any> = <primitive> | <type-val> | <record> | <array> | <set> | <map> | <enum>

<primitive> = primitive value as defined above

<record> = "{" <flist> "}"  |  "{"  "}"

<flist> = <flist> "," <field> | <field>

<field> = <name> ":" <value>

<name> = <identifier> | <quoted-string>

<quoted-string> = quoted string as defined above

<identifier> = as defined above

<array> = "[" <vlist> "]"  |  "["  "]"

<vlist> = <vlist> "," <value> | <value>

<set> = "|[" <vlist> "]|"  |  "|["  "]|"

<enum> = "%" ( <name> | <quoted-string> )

<map> = "|{" <mlist> "}|"  |  "|{"  "}|"

<mlist> = <mvalue> | <mlist> "," <mvalue>

<mvalue> = <value> ":" <value>

<type-value> = "<" <type> ">"

<error-value> = "error(" <value> ")"

<type> = <primitive-type> | <record-type> | <array-type> | <set-type> |
            <union-type> | <enum-type> | <map-type> |
            <type-def> | <name> | <error-type>

<primitive-type> = uint8 | uint16 | etc. as defined above

<record-type> = "{" <tflist> "}"  |  "{" "}"

<tflist> = <tflist> "," <tfield> | <tfield>

<tfield> = <name> ":" <type>

<array-type> = "[" <type> "]"  |  "[" "]"

<set-type> = "|[" <type> "]|"  |  "|[" "]|"

<union-type> = "(" <type> "," <tlist> ")"

<tlist> = <tlist> "," <type> | <type>

<enum-type> = "enum(" <nlist> ")"

<nlist> = <nlist> "," <name> | <name>

<map-type> = "{" <type> "," <type> "}"

<type-def> = <identifier> = <type-type>

<name> = as defined above

<error-type> = "error(" <type> ")"