Extractor module provides circuits to generate proofs of arbitrary values in a JSON file. To achieve this, proof generation is broken into following components:
- parser: state parser based on a stack machine
- interpreter: high-level interpretation of JSON state
- codegen: extractor circuit generation
- extractor: extracting value for a specific key inside a JSON
Parser is divided into three files:
- Language: JSON language syntax
- Parser: initialises the parser and parse individual bytes
- Machine: stack machine responsible for updating state
State of JSON parser consists of:
stack
with a maximumMAX_STACK_HEIGHT
argumentparsing_string
parsing_number
Let's take a simple example: { "k": "v" }
. Parser initialises the stack with [0, 0]
and starts iterating through each byte one-by-one.
0
: detectsSTART_BRACKET: {
. so, we're inside a key and updates stack to[1, 0]
3
: detects aQUOTE:"
and togglesparsing_string
to1
4
: detects anotherQUOTE
and togglesparsing_string
back to0
5
: detectsCOLON
and updates stack to[1, 1]
which means we're now inside a value7
: detects aQUOTE
again and togglesparsing_string
which is toggled back on9
11
: detectsCLOSING_BRACKET: }
and resets stack back to[0, 0]
State[ 0 ].byte = 123
State[ 0 ].stack[ 0 ] = [ 1 ][ 0 ]
State[ 0 ].parsing_string = 0
State[ 0 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 1 ].byte = 32
State[ 1 ].stack[ 0 ] = [ 1 ][ 0 ]
State[ 1 ].parsing_string = 0
State[ 1 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 2 ].byte = 34
State[ 2 ].stack[ 0 ] = [ 1 ][ 0 ]
State[ 2 ].parsing_string = 1
State[ 2 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 3 ].byte = 107
State[ 3 ].stack[ 0 ] = [ 1 ][ 0 ]
State[ 3 ].parsing_string = 1
State[ 3 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 4 ].byte = 34
State[ 4 ].stack[ 0 ] = [ 1 ][ 0 ]
State[ 4 ].parsing_string = 0
State[ 4 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 5 ].byte = 58
State[ 5 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 5 ].parsing_string = 0
State[ 5 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 6 ].byte = 32
State[ 6 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 6 ].parsing_string = 0
State[ 6 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 7 ].byte = 34
State[ 7 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 7 ].parsing_string = 1
State[ 7 ].parsing_number = 0
mask 34
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 8 ].byte = 118
State[ 8 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 8 ].parsing_string = 1
State[ 8 ].parsing_number = 0
mask 118
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 9 ].byte = 34
State[ 9 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 9 ].parsing_string = 0
State[ 9 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 10 ].byte = 32
State[ 10 ].stack[ 0 ] = [ 1 ][ 1 ]
State[ 10 ].parsing_string = 0
State[ 10 ].parsing_number = 0
mask 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
State[ 11 ].stack[ 0 ] = [ 0 ][ 0 ]
State[ 11 ].parsing_string = 0
State[ 11 ].parsing_number = 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
value_starting_index 7
value[ 0 ]= 118
Logic for parser:
- Iterate through each byte.
- Determine current byte token, and form instruction.
- determine what is the current byte in signals: start brace, end brace, start bracket, end bracket, colon, comma, number, quote, other (whitespace)
- create an instruction by multiplying arrays together
- form a state mask based on current state
- multiply instruction and mask together to calculate whether reading or writing value to stack.
- rewrite stack using new instruction
- stack[0] can change when pushing (read start brace or bracket) / popping (read end brace or bracket)
- stack[1] can change when readColon / readComma
Let's deep dive into interpreter and extractor.
Interpreter builds high-level circuits on top of stack to understand state better. It provides following templates:
InsideKey
InsideValueAtTop
&InsideValue
InsideArrayIndexAtTop
&InsideArrayIndex
NextKVPair
&NextKVPairAtDepth
KeyMatch
&KeyMatchAtDepth
To handle arbitrary depth JSON key, we need to generate circuits on-the-fly using some metadata.
{
"keys": [
"a"
],
"value_type": "string"
}
Each new key in keys
is associated with depth in parser stack, i.e. key a
has depth 0
, and the value type of a
is a string
.
Using this, a rust program generates circuit that can extract any key at depth 0 (and not just key a
) whose value type is a string.
To extract a key at specific depth and value type, we provide
arguments:
DATA_BYTES
: data length in bytesMAX_STACK_HEIGHT
: maximum stack height possible during parsing ofdata
. Equal to maximum open brackets{, [
in data.keyLen{i}
: ith key length in bytes, if key is a stringindex{i}
: ith key array indexdepth{i}
: ith key's stack depthmaxValueLen
: maximum value length
inputs:
data
: data in bytes array ofDATA_BYTES
lengthkey{i}
: key i in bytes array ofkeyLen{i}
length
output:
value
: value of the specified key
Extractor performs following operations:
- parse data byte-by-byte using parser
- use interpreter to gather more information on current state, i.e. whether we're parsing key or value
- if
parsing_key
, then it matches each key inis_key{i}_match
signal - if
parsing_value
, then it checks whether we're inside correct values at each depth, i.e.- if the key looks like
a.0.b.0
then, value of stack at depth0
should be[1, 1]
, and depth1
should be[2, 0]
, and so on.
- if the key looks like
- if the key matches, then we need to propogate this result to the value of that key.
- We use interpreter's
NextKVPair
template to determine when we start parsing next key pair again inis_next_pair_at_depth{i}
- We use interpreter's
- In previous example,
- key match (
byte = 107
) happened at state 3. so we toggleis_key1_match_for_value[3]
true. - At state 4,
is_key1_match[4]
will return false, but, since we're not parsing next key pair again, we wantis_key1_match_for_value[4]=true
as well. - So, we just use previous index's
is_key1_match_for_value
value, i.e.is_key1_match_for_value[4] = is_key1_match_for_value[3] * is_next_pair[4]
- as soon as we hit next pair, we toggle this bit again to 0, and wait for key match again.
- key match (
- To extract the value, we create a
mask
around that value.mask[i] = data[i] * parsing_value[i] * is_value_match[i]
, i.e. we're inside the correct value and the key matched for this value.
- Then, we just shift
data
byvalue_starting_bytes
to the left and truncatedata
length tomaxValueLen
.
We encourage you to look at tests, if you need deeper understanding of examples.