-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make maximum depth configurable via CLI option #3063
base: master
Are you sure you want to change the base?
Conversation
Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?). |
Should add CLI documentation to manual.yml |
src/execute.c
Outdated
void jq_set_parser_maxdepth(jq_state *jq, int parser_maxdepth) | ||
{ | ||
jq->parser_maxdepth = parser_maxdepth; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing new lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, added it on in 586e96d! (Would you want a PR that adds this as a CI lint?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe would be good, i like linters. Let's see what the other maintainers think
In 3b0d520.
Thanks! I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector. It might make sense to recommend the |
docs/content/manual/manual.yml
Outdated
@@ -269,6 +269,13 @@ sections: | |||
available in the program and has a string whose contents are to the texts | |||
in the file named `bar`. | |||
|
|||
* `--depth n`: | |||
|
|||
This option sets the maximum parsing depth (of objects and lists) to `n`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe array instead of lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c4e6358.
👍
Ok, let's see what @nicowilliams thinks
Yeap sounds like a good idea |
You probably want to do |
BTW https://greenberg.science/ layout is 🥇😄 |
Do y'all have a sense of a timeline on this merging (and the next release)? (Not asking to rush you, just for planning on our end.) |
@mgree Hey, i pinged the other maintainers on discord yesterday about this PR and some other ones so hopefully they review it soon. As it adds a new cli argument it's probably good that at least 2 others approve it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Only non-blocking thing for me is if the default should be unbounded instead?
Some basic scripts for analyzing queries from `bin/mzexplore extract`. These scripts will fail on extracted JSON ASTs without a patched version of jq <jqlang/jq#3063>. ### Motivation * This PR adds a known-desirable feature. Related to #17592 and #25337. ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [ ] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - <!-- Add release notes here or explicitly state that there are no user-facing behavior changes. -->
Looking at the issue @mgree ran into in MaterializeInc/materialize#25972 with the 256 depth limit i would say it's a good argument for increasing it or just make it unbounded by default. |
For Materialize's use case, we would want no depth limit : we're looking at non-adversarial inputs that we know to be deeply nested (ASTs with a particularly nested JSON representation). But we're happy to use A quick experiment shows it's probably pretty safe to have no limit by default from a time performance perspective, but maybe not a memory or DoS perspective. Objects nested At a minimum, I'd guess these segfaults should be caught and turned into proper aborts. At worst, maybe they're bugs? (Maybe with printing? Because a lot of work seems to happen before things crash...) It's might be kindest to users to have a maximum depth by default---that way users don't discover the hard way that adversarial files can use lots of memory and/or crash Here's a script for making deep JSON objects and lists (EDIT 2024-04-03: noticed a bug in deep_list.sh 🙄): #!/bin/sh
# deep_obj.sh
jot -b '{"a":' -s '' $1 >obj$1.json
echo 0 >>obj$1.json
jot -b '}' -s '' $1 >>obj$1.json
#!/bin/sh
# deep_list.sh
jot -b '[' -s '' $1 >list$1.json
echo 0 >>list$1.json
jot -b ']' -s '' $1 >>list$1.json And here's a run on my not fancy macOS laptop (EDIT 2024-04-03: re-ran with the bug fixed, didn't change overall results): $ for x in 257 1024 $((1024 * 64)) $((1024 * 256)) $((1024 * 1024))
do
./deep_list.sh $x
./deep_obj.sh $x
echo \# $x
echo \#\# list
command time -l jq --depth 0 . list$x.json >/dev/null
echo \#\# obj
command time -l jq --depth 0 . obj$x.json >/dev/null
done
# 257
## list
0.01 real 0.00 user 0.00 sys
2101248 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
722 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
7 involuntary context switches
76583002 instructions retired
35534464 cycles elapsed
1290240 peak memory footprint
## obj
0.01 real 0.00 user 0.00 sys
2007040 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
701 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
14 involuntary context switches
77345793 instructions retired
35134742 cycles elapsed
1191936 peak memory footprint
# 1024
## list
0.01 real 0.00 user 0.00 sys
2322432 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
776 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
20 involuntary context switches
77963756 instructions retired
37622876 cycles elapsed
1273856 peak memory footprint
## obj
0.01 real 0.00 user 0.00 sys
2375680 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
789 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
12 involuntary context switches
79654892 instructions retired
36269047 cycles elapsed
1228800 peak memory footprint
# 65536
## list
0.03 real 0.02 user 0.01 sys
24293376 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
6139 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
37 involuntary context switches
199628394 instructions retired
114804163 cycles elapsed
23146496 peak memory footprint
## obj
0.05 real 0.03 user 0.01 sys
36016128 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
9003 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
32 involuntary context switches
289287462 instructions retired
161012325 cycles elapsed
32870400 peak memory footprint
# 262144
## list
time: command terminated abnormally
0.07 real 0.05 user 0.02 sys
86921216 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21431 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
63 involuntary context switches
387970009 instructions retired
252783614 cycles elapsed
85774336 peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
0.13 real 0.09 user 0.04 sys
138883072 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
35396 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
63 involuntary context switches
712225649 instructions retired
442663148 cycles elapsed
126955520 peak memory footprint
Segmentation fault: 11
# 1048576
## list
time: command terminated abnormally
0.26 real 0.17 user 0.08 sys
327397376 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
82700 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
82 involuntary context switches
1311119871 instructions retired
847893675 cycles elapsed
321007616 peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
0.46 real 0.30 user 0.15 sys
514150400 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
130856 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
199 involuntary context switches
2446990455 instructions retired
1517820450 cycles elapsed
502243328 peak memory footprint
Segmentation fault: 11 |
Hey, I wanted to check in again. Would you like me to file correctness issues for these segfaults? Is there anything I can do to help with the review of this patch? |
It's not sufficient to just add the state to the parser, since we might call `jv_load_file` from a variety of other functions. A few helper functions defer to `DEFAULT_MAX_PARSING_DEPTH`, as they are given neither a `jq_state` nor `jv_parser`. Added tests in `tests/shtest` to confirm that the maximum depth limit is appropriately enforced.
c4e6358
to
e3ba70a
Compare
e3ba70a
to
6c6e47f
Compare
@mgree force pushed rebase on master that fixes conflict in manual.yml @itchyny @emanuele6 looks ok to merge? fixes #2846 that is on the 1.8 milestone. The only thing left for me would be if we should increase |
Addresses #2846. (I didn't see any contributor guidelines/linting info... please let me know if you'd like me to change anything!)
This PR adds state to two places to track maximum parser depth: in
jq_state
and injv_parser
.It might seem like
jv_parser
is enough on its own, but many other parts of the code parser values holding only ajq_state
. It's not clear to me that these aren't mostly/entirely raw parses (which wouldn't need to track depth), but it seemed easiest to track the state thoroughly.There are a few helper functions at the end of
jv_parse.c
that have no handle at all---neither ajq_state
nor ajv_parser
. I had these just use theDEFAULT_MAX_PARSING_DEPTH
. They are mostly for fuzzing, it seems, butjv_parse
gets called inmain
during option processing. It wouldn't be hard to have this use the currentparser_maxdepth
.