Skip to content

Scope of JSON Schema Validation

Austin Wright edited this page Dec 16, 2019 · 12 revisions

This page fleshes out some of the considerations on what is in scope for JSON Schema Validation, and relationships to other features of JSON Schema.

Validation vs. linting

Across the range of documents that can possibly be parsed, there's a variety of passes that may need to be done over the document in order to parse it for information.

In the simplest form, a state machine consumes each character of a document. At the end of the document, the state of the parser will provide the application with the data necessary for it to do its job. If there were no errors—no illegal characters encountered—the document is considered valid.

However, there may be multiple ways to encode the same data, from the standpoint of the application. For example, JavaScript and JSON can encode strings multiple different ways, for example, "A" and "\u0041" form the same string, there is no difference to the running application (unless the application is introspecting its own source code—not recommended). If the application developer wises to express a preference for one or the other, this may be tested with a linter.

JSON Schema does not offer a way to distinguish or require one form or the other; since according to the application, the data is the same. This would instead be a task for a linter.

Types of Validation

Validation ensures that the document (including its data) is in a form suitable for consumption by the application. It can be broken down into two broad categories:

Syntactic validation ensures that the document parser can reach the end of the document fully and unambiguously able to understand its syntax.

Semantic Validation ensures that the data within the document is within the boundaries that the application will understand.

Since JSON Schema only operates on a valid JSON document, syntactic validation is implicitly part of JSON Schema validation.

Types of Semantic Validation

Structural

I believe this term was invented by JSON Schema.

In short, structural validation is concerned with describing the very most limits on what a document must conform to, including values that might become sensible in the future. For example, Elbonia is not a country right now, but it's feasible that a country named Elbonia could form some time in the future. Therefore, it would not make sense to hard-code a list of countries as part of structural validation.

Formally, structural validation is concerned with placing assertions on a single value. For example, "Value is greater than zero", "Value is an array", "Value is nonempty", or "Value has property with key name".

Consistency

Consistency validation ensures that references between data make sense. For example, "Value is an ID string found in the users database", "Value A is less than value B", or "Value is an ID number that must be distinct from other ID numbers".

There's multiple ways to verify data consistency: far too many mechanisms to incorporate into JSON Schema. Validating data consistency may involve:

  • Scanning the rest of the document for a referenced value
  • Making a network request for a referenced document
  • Querying a database for a record with the given value
  • Sending a message to the referenced address, and verifying it was received

In short, data consistency is tested by actually exercising the intended action. For looking up a user, you must actually query the users database. For testing an email address, you must actually send an email.

I/O bound tests vs. compute-bound tests

An important performance characteristic of JSON Schema validation is that it can be tested synchronously, without relying on I/O such as networking or filesystem. Excluding data consistency tests from the scope of JSON Schema validation has the effect of excluding tests that rely on I/O. For example, the tests that have to check a database to verify that a referenced ID exists.

Not all data consistency tests rely on I/O, however. Many applications simply wish to test that an ID is defined in another part of the same document. Even though this case would be compute-bound, it is still outside the scope of JSON Schema validation for several reasons:

  1. JSON Schema validation does not assume that the entire JSON document must be buffered in memory. JSON Schema validation is designed to be compatible with streaming parsers, that retain a limited state.

  2. Supporting all the different ways that people can implement same-document references would still be extremely complicated. This is functionality best left to an actual scripting language (like Lua or ECMAScript), rather than re-designing a new language in JSON.

  3. JSON Schema should not express a preference on the best way to validate references to data. And the mere existence of keyword may encourage people to change the structure of their JSON documents to be able to use it. It would be unfortunate if an application started embedding more data into JSON documents, because JSON Schema only supported compute-bound tests and not I/O-bound tests.

Inference

The primary job of an inference engine is to determine additional data about a resource based on what is known. Inference typically has some way to perform validation, which checks that there are no contradictions between pieces of data. For example, an assertion that a resource cannot be both a Cat and a Dog.

It is the type used by RDF Schema and RDF's OWL.

Context

Different applications might have different validation constraints at different stages of the application. A few examples:

Regular Expressions

Strings within JSON Schema may be required to pass a regular expression. Validating a string against a regular expression is a kind of syntactic validation, and might keep a state, and earlier characters might change the outcome of parsing for later characters.

This is a warranted feature in JSON Schema because there is always going to be data inside a document that can be broken apart into multiple values, but is not well represented by JSON. For example, a date has year, month, and more components. There's already good standards for encoding dates as strings, inventing a JSON encoding for dates is not really necessary.

RPC

RPC calls for data update and retrieval are one case where the same resource might have different schemas applied to it at different points in its life-cycle. At creation time, the resource might exclude an ID, since it is to be assigned by the server. But for updates, an ID might be mandatory, so the server knows which document in the collection to update. These technically call for two separate schemas.

Related article: https://martinfowler.com/bliki/CQRS.html

Annotations

JSON Schema can still be used to declare relationships between data; verifying that the data is consistent falls outside the scope of a typical validator, and onto applications that must look for and process declared data relationships.

For example, a form generation library might look at a "range" property, and use it to auto-complete usernames from the database of legal names.