Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider lazily deserializing non-scalar fields #713

Open
osa1 opened this issue Jul 22, 2022 · 7 comments
Open

Consider lazily deserializing non-scalar fields #713

osa1 opened this issue Jul 22, 2022 · 7 comments
Labels
perf Related to runtime performance

Comments

@osa1
Copy link
Member

osa1 commented Jul 22, 2022

A colleague asked about this and I think it can be a good idea, and I don't know if this was considered before.

Message, map, packed repeated, and string fields can be lazily deserialized as their encoding have a length prefix so we can efficiently skip over those fields and copy the encodings to deserialize later.

(Instead of copying encodings, we could store a view into the original buffer, though that would keep the entire buffer alive until all of the lazy fields (including those in nested fields) are deserialized.)

This can improve app startup as we delay some of the deserialization costs to field access time.

Since _FieldSet already holds dynamic elements it shouldn't be too much worse to add another value class.

I wonder if this was considered before? @sigurdm @mraleph

@osa1 osa1 added the perf Related to runtime performance label Jul 22, 2022
@rakudrama
Copy link
Contributor

We did this for the web version of parsing JSON. I have mixed feelings about that change.
Some cases were faster, but throughput of round-trip deserialize-reserialize was worse.

Deserializing involves some validation. How would that be handled?
In the web libraries JSON example, validation had already been performed by JavaScript, and it was converting the JavaScript object tree to Dart structures that was deferred.

@osa1
Copy link
Member Author

osa1 commented Jul 29, 2022

Sorry I wasn't clear in my original message, but this is about binary deserialization.

For JSON it's a bit tricky because (1) we currently rely on dart:convert's JSON deserializer (2) even if we avoid intermediate JSON objects (#670) we need to scan the JSON string to find matching brackets. We can't skip over the input bytes as we can do with the binary format.

@osa1
Copy link
Member Author

osa1 commented Aug 4, 2022

I was checking C++ protobuf implementation unit tests and came across this: https://github.com/protocolbuffers/protobuf/blob/8d3a7327606b115dda871be54c8b8cb66d41ff90/src/google/protobuf/unittest.proto#L115-L116

It turns out some of the protobuf implementations support this already and they even added an option for it.

@sigurdm
Copy link
Collaborator

sigurdm commented Aug 4, 2022

If it was done as an explicit option I guess it would be good.

I have considered this in different contexts before - and to me the biggest gripe is that if the encoded message is broken validation errors now can occur on field access, instead of on first .fromBuffer-call leading to error-handling that is hard to get right (do you now need a try-catch around every field access?).

@osa1
Copy link
Member Author

osa1 commented Aug 4, 2022

I have considered this in different contexts before - and to me the biggest gripe is that if the encoded message is broken validation errors now can occur on field access, instead of on first .fromBuffer-call leading to error-handling that is hard to get right (do you now need a try-catch around every field access?).

The reference implementation apparently provides two options: one that eagerly validates, and one that lazily validates: https://github.com/protocolbuffers/protobuf/blob/8d3a7327606b115dda871be54c8b8cb66d41ff90/src/google/protobuf/descriptor.proto#L583-L619

  // Should this field be parsed lazily?  Lazy applies only to message-type
  // fields.  It means that when the outer message is initially parsed, the
  // inner message's contents will not be parsed but instead stored in encoded
  // form.  The inner message will actually be parsed when it is first accessed.
  //
  // This is only a hint.  Implementations are free to choose whether to use
  // eager or lazy parsing regardless of the value of this option.  However,
  // setting this option true suggests that the protocol author believes that
  // using lazy parsing on this field is worth the additional bookkeeping
  // overhead typically needed to implement it.
  //
  // This option does not affect the public interface of any generated code;
  // all method signatures remain the same.  Furthermore, thread-safety of the
  // interface is not affected by this option; const methods remain safe to
  // call from multiple threads concurrently, while non-const methods continue
  // to require exclusive access.
  //
  //
  // Note that implementations may choose not to check required fields within
  // a lazy sub-message.  That is, calling IsInitialized() on the outer message
  // may return true even if the inner message has missing required fields.
  // This is necessary because otherwise the inner message would have to be
  // parsed in order to perform the check, defeating the purpose of lazy
  // parsing.  An implementation which chooses not to check required fields
  // must be consistent about it.  That is, for any particular sub-message, the
  // implementation must either *always* check its required fields, or *never*
  // check its required fields, regardless of whether or not the message has
  // been parsed.
  //
  // As of May 2022, lazy verifies the contents of the byte stream during
  // parsing.  An invalid byte stream will cause the overall parsing to fail.
  optional bool lazy = 5 [default = false];

  // unverified_lazy does no correctness checks on the byte stream. This should
  // only be used where lazy with verification is prohibitive for performance
  // reasons.
  optional bool unverified_lazy = 15 [default = false];

Apparently these options are only supported for message-type fields, but I wonder if they're allowed for other fields? In principle, we can lazily parse any field that is encoded with a length prefix: maps, packed lists, messages, strings. Having these options on strings may be useful in Dart as protobuf strings are encoded in UTF-8 but Dart strings are UTF-16, so we could do the conversion lazily.

@sigurdm
Copy link
Collaborator

sigurdm commented Aug 4, 2022

Ah - so it is a per-field option. I would have thought it was something you decided on when parsing a message...

@osa1
Copy link
Member Author

osa1 commented Aug 4, 2022

I think deciding this when parsing might make sense, but I'm not sure what kind of API to provide for this to allow lazily deserializing e.g. field of a field of a field.

One potential problem with the per-field option is when I have multiple apps that use the same proto definition, like a server app and a few client apps. It's possible that different applications want different fields to be lazy.

I don't know if this is a common problem though. I guess in the worst case you could have a pre-processor to run on the proto definition to tweak lazy properties before generating the code for your application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Related to runtime performance
Projects
None yet
Development

No branches or pull requests

3 participants