JSON extractor

Extractor module provides circuits to generate proofs of arbitrary values in a JSON file. To achieve this, proof generation is broken into following components:

parser: state parser based on a stack machine
interpreter: high-level interpretation of JSON state
codegen: extractor circuit generation
extractor: extracting value for a specific key inside a JSON

Parser

Parser is divided into three files:

Language: JSON language syntax
Parser: initialises the parser and parse individual bytes
Machine: stack machine responsible for updating state

State of JSON parser consists of:

stack with a maximum MAX_STACK_HEIGHT argument
parsing_string
parsing_number

Let's take a simple example: { "k": "v" }. Parser initialises the stack with [0, 0] and starts iterating through each byte one-by-one.

0: detects START_BRACKET: {. so, we're inside a key and updates stack to [1, 0]
3: detects a QUOTE:" and toggles parsing_string to 1
4: detects another QUOTE and toggles parsing_string back to 0
5: detects COLON and updates stack to [1, 1] which means we're now inside a value
7: detects a QUOTE again and toggles parsing_string which is toggled back on 9
11: detects CLOSING_BRACKET: } and resets stack back to [0, 0]

State[ 0 ].byte =  123
State[ 0 ].stack[ 0 ]     = [ 1 ][ 0 ]
State[ 0 ].parsing_string =  0
State[ 0 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 1 ].byte =  32
State[ 1 ].stack[ 0 ]     = [ 1 ][ 0 ]
State[ 1 ].parsing_string =  0
State[ 1 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 2 ].byte =  34
State[ 2 ].stack[ 0 ]     = [ 1 ][ 0 ]
State[ 2 ].parsing_string =  1
State[ 2 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 3 ].byte =  107
State[ 3 ].stack[ 0 ]     = [ 1 ][ 0 ]
State[ 3 ].parsing_string =  1
State[ 3 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 4 ].byte =  34
State[ 4 ].stack[ 0 ]     = [ 1 ][ 0 ]
State[ 4 ].parsing_string =  0
State[ 4 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 5 ].byte =  58
State[ 5 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 5 ].parsing_string =  0
State[ 5 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 6 ].byte =  32
State[ 6 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 6 ].parsing_string =  0
State[ 6 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 7 ].byte =  34
State[ 7 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 7 ].parsing_string =  1
State[ 7 ].parsing_number =  0
mask 34
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 8 ].byte =  118
State[ 8 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 8 ].parsing_string =  1
State[ 8 ].parsing_number =  0
mask 118
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 9 ].byte =  34
State[ 9 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 9 ].parsing_string =  0
State[ 9 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 10 ].byte =  32
State[ 10 ].stack[ 0 ]     = [ 1 ][ 1 ]
State[ 10 ].parsing_string =  0
State[ 10 ].parsing_number =  0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 11 ].stack[ 0 ]     = [ 0 ][ 0 ]
State[ 11 ].parsing_string =  0
State[ 11 ].parsing_number =  0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
value_starting_index 7
value[ 0 ]= 118

Logic for parser:

Iterate through each byte.
Determine current byte token, and form instruction.
- determine what is the current byte in signals: start brace, end brace, start bracket, end bracket, colon, comma, number, quote, other (whitespace)
- create an instruction by multiplying arrays together
form a state mask based on current state
multiply instruction and mask together to calculate whether reading or writing value to stack.
rewrite stack using new instruction
- stack[0] can change when pushing (read start brace or bracket) / popping (read end brace or bracket)
- stack[1] can change when readColon / readComma

Let's deep dive into interpreter and extractor.

Interpreter

Interpreter builds high-level circuits on top of stack to understand state better. It provides following templates:

InsideKey
InsideValueAtTop & InsideValue
InsideArrayIndexAtTop & InsideArrayIndex
NextKVPair & NextKVPairAtDepth
KeyMatch & KeyMatchAtDepth

Codegen

To handle arbitrary depth JSON key, we need to generate circuits on-the-fly using some metadata.

{
    "keys": [
        "a"
    ],
    "value_type": "string"
}

Each new key in keys is associated with depth in parser stack, i.e. key a has depth 0, and the value type of a is a string. Using this, a rust program generates circuit that can extract any key at depth 0 (and not just key a) whose value type is a string.

Extractor

To extract a key at specific depth and value type, we provide

arguments:

DATA_BYTES: data length in bytes
MAX_STACK_HEIGHT: maximum stack height possible during parsing of data. Equal to maximum open brackets {, [ in data.
keyLen{i}: ith key length in bytes, if key is a string
index{i}: ith key array index
depth{i}: ith key's stack depth
maxValueLen: maximum value length

inputs:

data: data in bytes array of DATA_BYTES length
key{i}: key i in bytes array of keyLen{i} length

output:

value: value of the specified key

Extractor performs following operations:

parse data byte-by-byte using parser
use interpreter to gather more information on current state, i.e. whether we're parsing key or value
if parsing_key, then it matches each key in is_key{i}_match signal
if parsing_value, then it checks whether we're inside correct values at each depth, i.e.
- if the key looks like a.0.b.0 then, value of stack at depth 0 should be [1, 1], and depth 1 should be [2, 0], and so on.
if the key matches, then we need to propogate this result to the value of that key.
- We use interpreter's NextKVPair template to determine when we start parsing next key pair again in is_next_pair_at_depth{i}
In previous example,
- key match (byte = 107) happened at state 3. so we toggle is_key1_match_for_value[3] true.
- At state 4, is_key1_match[4] will return false, but, since we're not parsing next key pair again, we want is_key1_match_for_value[4]=true as well.
- So, we just use previous index's is_key1_match_for_value value, i.e. is_key1_match_for_value[4] = is_key1_match_for_value[3] * is_next_pair[4]
- as soon as we hit next pair, we toggle this bit again to 0, and wait for key match again.
To extract the value, we create a mask around that value.
- mask[i] = data[i] * parsing_value[i] * is_value_match[i], i.e. we're inside the correct value and the key matched for this value.
Then, we just shift data by value_starting_bytes to the left and truncate data length to maxValueLen.

We encourage you to look at tests, if you need deeper understanding of examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.md

json.md

JSON extractor

Parser

Interpreter

Codegen

Extractor

Files

json.md

Latest commit

History

json.md

File metadata and controls

JSON extractor

Parser

Interpreter

Codegen

Extractor