Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for deserializing list-encoded JSON structs [#6558] #6643

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jagill
Copy link

@jagill jagill commented Oct 29, 2024

Which issue does this PR close?

Closes #6558.

Rationale for this change

Currently, a StructArray can only be deserialized from a JSON object (e.g. {a: 1, b: "c"}), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. [1, "c"]) because this is more compact, and the schema is known.
Arrow-json cannot currently deserialize these.

What changes are included in this PR?

This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior.

Are there any user-facing changes?

Users may set the StructParsingMode enum to ListOnly to parse list-style structs. The associated enum,
variants, and method have been documented. I'm happy to update any other documentation.

Discussion topics

  1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.
  2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.
  3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.
  4. One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 29, 2024
Currently, a StructArray can only be deserialized from a JSON object
(e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino)
encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more
compact, and the schema is known.

This PR adds the ability to parse JSON lists into StructArrays, if the
StructParseMode is set to ListOnly.  In ListOnly mode, object-encoded
structs raise an error.  Setting to ObjectOnly (the default) has the
original parsing behavior.

Some notes/questions/points for discussion:
1. I've made a JsonParseMode struct instead of a bool flag for two
   reasons.  One is that it's self-descriptive (what would `true` be?),
   and the other is that it allows a future Mixed mode that could
   deserialize either.  The latter isn't currently requested by anyone.
2. I kept the error messages as similar to the old messages as possible.
   I considered having more specific error messages (like "Encountered a
   '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or
   similar), but wanted to hear opinions before I went that route.
3. I'm not attached to any name/code-style/etc, so happy to modify to
   fit local conventions.

Fixes apache#6558
@tustvold
Copy link
Contributor

tustvold commented Nov 8, 2024

Sorry this is on my list to look into, but I've struggled to find time to look into it yet... I appreciate your patience and restraint from repeatedly tagging me (unlike some people are wont to do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow JSON deserialization of StructArray from JSON List
2 participants