Skip to content

Typed Map Keys

Joe Betz edited this page Jun 26, 2015 · 9 revisions

Typed Map Keys

All pegasus maps are string keyed, and the Java bindings for Pegasus all extend Map<String, V>. Courier also generates string keyed map bindings, e.g. Map[String, V].

We would like to improve Courier to support typed map keys, e.g. Map[Int, Boolean] or Map[UserId, User].

Our goals are:

  • Allow any pegasus type can to used as a map key. While we consider primitive types, records and enums to be the most useful, we will support maps, arrays and unions as well.
  • The generated classes will named like <Key>To<Value>Map, except string keyed maps, which will continue to be named like <Value>Map for backward compatibility reasons.

A simple example:

Pegasus array definition:

{ "type": "array", "keys": "int", "values": "boolean" }

Courier generated class:

class IntToBooleanMap extends Map[Int, Boolean]

Example JSON data:

{ "1": true, "2": false }

The "keys" field

The "keys" field in a pegasus array definition is new field being introduced by Courier. It is not part of Avro of Pegasus schemas and the Java binding generator (or any other pegasus data binding generator for that matter) will simply ignore it.

For example:

{ "type": "array", "keys": "int", "values": "boolean" }

Will be interpreted by Avro and Pegasus as:

{ "type": "array", "values": "boolean" }

Compatibility with string keys

To retain compatibility with the existing Avro and Pegasus data representation, all typed map keys will be represented as a string in JSON (and all other messaging protocols).

Keys will be serialized to strings using InlineStringCodec. For example, if a map is defined as:

{ "type": "array", "keys": "CourseSessionId", "values": "boolean" }

and a CourseSessionId serializes to json as:

{
  "courseId": 1,
  "sessionId": 1000
}

Then a CourseSessionId keyed map would serialize to json as:

{
  "(courseId~1,sessionId~1000)": true
}

Where (courseId~1,sessionId~1000) is the InlineStringCodec serialization of the CourseSessionId.

Limitations

Data written using the Avro and Pegasus representations of the schemas may contain strings incompatible with the generated Courier bindings since the generated Courier bindings will expect the string to contain data that can be deseriallized by InlineStringCodec to a specific type.

For example, even if a map is keyed by "int" a Java Pegasus generator might write:

{ "newKey": "value" }

Because it is not aware that map keys are suppose to be ints.

This is not a new problem, it has existed all along. When using string keyed maps, it is common for readers and writers to tread the string as "dynamically typed". A reader might expect the keys to be of a particular format (e.g. an int) and for them to fail or reject the data if the key is not deserializable to the expected format. The addition of the "keys" field simply formalizes this expectation that readers and writers already have and provides a uniform way of expressing the expectation.

Predef

All possible map types for primitive types will be generated in Predef. There are seven primitive types so in theory we will need to generate 49 predef map classes. Its is a bit unfortunate. If we want we can be a bit selective and not generate predef maps keyed, for example, by the bytes type. But given that the set of primitive types is well defined and unlikely to change, we may just generate all 49 and be done with it.

Validation

We will upgrade our validation systems to be aware of string key types. They will return appropriate validation errors for malformed keys.

Clone this wiki locally