A few changes to support complex pickle files #22

rvolgers · 2022-08-12T12:07:39Z

I was trying to load some pretty non-Rust-friendly pickle files with this crate (using the Value API) and, while I've mostly come to the conclusion that this crate is not really built for that, I've made some improvements which might be worth upstreaming. Consider this a request for comments on whether any of this is useful. I can split it up or make changes as needed.

This adds options which enable loading pickle files with recursive data structures (through an option to return a None when recursing instead of causing an error) and to support custom subclasses of dict, list, and set.

Custom subclasses of dict, list, and set are detected by looking for dict, list or set operations performed on an object. These objects are returned as a tuple (object_attributes, superclass_data) or, if you set an option, the object attributes will be discarded and only the superclass data will be returned. This option makes it possible to mostly treat (e.g.) subclasses of list as just lists, with the caveat that if the list is empty the object's attributes may be returned in the form of a dict instead, as we have no way to detect that an object is a list if no list operations are performed on it.

Each commit should be mostly self-contained, except that the first commit (Decode collections.defaultdict as dict) is reverted and superseded by the commit after that, since the second commit contains a different, more general solution to the same problem. I kept it around in case the more tricky solution is rejected. (Edit: this was true before a number of followup bugfix commits. Instead of rewriting history I've left them as-is, but I can fix that if wanted.)

The final object should be based on the current instance, what the old code calls the "standin". Using the state to update this object should not touch `.ext`. As the FIXME notes, this probably still wrong as we should be updating the memoized instance by reference, not updating a cloned version of it. However, this is at least "more correct" and I need a testcase for that first.

birkenfeld · 2022-08-16T05:10:40Z

Thanks for the PR! I'll have to delay the complete review for a bit, but I like the idea of compatibility with more complex pickles.

One nitpick from first glance: the version bump should be in the minor version, since new APIs are added.
Also, I've enabled the CI, and there seems to be a failing test now, which means that the default behavior has changed incompatibly?

rvolgers · 2022-08-16T09:39:08Z

Hey, thanks for taking a look!

I don't like that test, it checks that something fails which now succeeds. Should people be relying on failure as part of the API? The code already has a "do the best we can" attitude, so I would say no, but let me know what you think.

The version bump is just a quick hack. I noticed that cargo will fail to use a source override sometimes if the override and the crates.io one have the same version.

There is one thing which I was not quite sure about, so I want to make sure it's clear. The new code tries to detect the type of an object by looking at the operations performed on it. But if it e.g. a subclass of dict but the dict is empty, there are no dict operations performed, and thus the new code will not detect it as a dict subclass. This leads to the object being returned as a "normal" object, meaning its attributes are returned. The user of this library cannot tell the difference if unwrap_subclassed_data_structures is used.

This seems like a bit of a fundamental problem with the unwrap_subclassed_data_structures setting, so maybe it should be removed.

I actually have a followup PR which allows to specify the way to handle various types ahead of time in the options passed to the library. This avoids the ambiguity of auto-detection, but there are more complications and I'm not sure it's worth it.

landaire · 2023-06-01T06:37:53Z

This PR's been open for a while and while I haven't tested these changes, support for recursive structures would be very useful for a data format I'm parsing.

Edit: I actually just found that there must be a bug somewhere with the memo ref count. I took @rvolgers branch and did some experimenting -- I no longer get errors relating to recursive objects when I disable this condition (always keep the item in the memo):

serde-pickle/src/de.rs

Line 603 in 13c0cd6

if count <= 0 {

rvolgers added 8 commits August 11, 2022 20:44

Decode collections.defaultdict as dict

592674c

Collect extended object info

9e84cbf

Add option to replace recursive objects by None

3cd5f20

Add option to unwrap subclassed data structures

eca5b87

Fixup: remove leftover debugging code

fc5c79a

Fixup: add explanatory comments

d419572

Fixup: resolve the FIXME in the last commit

d773cee

landaire mentioned this pull request Jun 3, 2023

Fix (some?) recursive objects by disabling memo ref counting #24

Open

razaqq mentioned this pull request Oct 28, 2023

How to deal with recursive structures #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few changes to support complex pickle files #22

A few changes to support complex pickle files #22

rvolgers commented Aug 12, 2022 •

edited

Loading

birkenfeld commented Aug 16, 2022

rvolgers commented Aug 16, 2022 •

edited

Loading

landaire commented Jun 1, 2023 •

edited

Loading

A few changes to support complex pickle files #22

Are you sure you want to change the base?

A few changes to support complex pickle files #22

Conversation

rvolgers commented Aug 12, 2022 • edited Loading

birkenfeld commented Aug 16, 2022

rvolgers commented Aug 16, 2022 • edited Loading

landaire commented Jun 1, 2023 • edited Loading

rvolgers commented Aug 12, 2022 •

edited

Loading

rvolgers commented Aug 16, 2022 •

edited

Loading

landaire commented Jun 1, 2023 •

edited

Loading