-
Notifications
You must be signed in to change notification settings - Fork 24
Typed Map Keys
All pegasus maps are string keyed, and the Java bindings for Pegasus all extend Map<String, V>
. Courier also generates string keyed map bindings, e.g. Map[String, V]
.
We would like to improve Courier to support typed map keys, e.g. Map[Int, Boolean]
or Map[UserId, User]
.
Our goals are:
- Allow any pegasus type can to used as a map key. While we consider primitive types, records and enums to be the most useful, we will support maps, arrays and unions as well.
- The generated classes will named like
<Key>To<Value>Map
, except string keyed maps, which will continue to be named like<Value>Map
for backward compatibility reasons.
A simple example:
Pegasus array definition:
{ "type": "array", "keys": "int", "values": "boolean" }
Courier generated class:
class IntToBooleanMap extends Map[Int, Boolean]
Example JSON data:
{ "1": true, "2": false }
The "keys" field in a pegasus array definition is new field being introduced by Courier. It is not part of Avro of Pegasus schemas and the Java binding generator (or any other pegasus data binding generator for that matter) will simply ignore it.
For example:
{ "type": "array", "keys": "int", "values": "boolean" }
Will be interpreted by Avro and Pegasus as:
{ "type": "array", "values": "boolean" }
To retain compatibility with the existing Avro and Pegasus data representation, all typed map keys will be represented as a string in JSON (and all other messaging protocols).
Keys will be serialized to strings using InlineStringCodec
. For example, if a map is defined as:
{ "type": "array", "keys": "CourseSessionId", "values": "boolean" }
and a CourseSessionId
serializes to json as:
{
"courseId": 1,
"sessionId": 1000
}
Then a CourseSessionId
keyed map would serialize to json as:
{
"(courseId~1,sessionId~1000)": true
}
Where (courseId~1,sessionId~1000)
is the InlineStringCodec
serialization of the CourseSessionId
.
Data written using the Avro and Pegasus representations of the schemas may contain strings incompatible with the generated Courier bindings since the generated Courier bindings will expect the string to contain data that can be deseriallized by InlineStringCodec
to a specific type.
For example, even if a map is keyed by "int" a Java Pegasus generator might write:
{ "newKey": "value" }
Because it is not aware that map keys are suppose to be ints.
This is not a new problem, it has existed all along. When using string keyed maps, it is common for readers and writers to tread the string as "dynamically typed". A reader might expect the keys to be of a particular format (e.g. an int) and for them to fail or reject the data if the key is not deserializable to the expected format. The addition of the "keys" field simply formalizes this expectation that readers and writers already have and provides a uniform way of expressing the expectation.
All possible map types for primitive types will be generated in Predef. There are seven primitive types so in theory we will need to generate 49 predef map classes. Its is a bit unfortunate. If we want we can be a bit selective and not generate predef maps keyed, for example, by the bytes type. But given that the set of primitive types is well defined and unlikely to change, we may just generate all 49 and be done with it.
We will upgrade our validation systems to be aware of string key types. They will return appropriate validation errors for malformed keys.