Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few changes to support complex pickle files #22

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

rvolgers
Copy link

@rvolgers rvolgers commented Aug 12, 2022

I was trying to load some pretty non-Rust-friendly pickle files with this crate (using the Value API) and, while I've mostly come to the conclusion that this crate is not really built for that, I've made some improvements which might be worth upstreaming. Consider this a request for comments on whether any of this is useful. I can split it up or make changes as needed.

This adds options which enable loading pickle files with recursive data structures (through an option to return a None when recursing instead of causing an error) and to support custom subclasses of dict, list, and set.

Custom subclasses of dict, list, and set are detected by looking for dict, list or set operations performed on an object. These objects are returned as a tuple (object_attributes, superclass_data) or, if you set an option, the object attributes will be discarded and only the superclass data will be returned. This option makes it possible to mostly treat (e.g.) subclasses of list as just lists, with the caveat that if the list is empty the object's attributes may be returned in the form of a dict instead, as we have no way to detect that an object is a list if no list operations are performed on it.

Each commit should be mostly self-contained, except that the first commit (Decode collections.defaultdict as dict) is reverted and superseded by the commit after that, since the second commit contains a different, more general solution to the same problem. I kept it around in case the more tricky solution is rejected. (Edit: this was true before a number of followup bugfix commits. Instead of rewriting history I've left them as-is, but I can fix that if wanted.)

The final object should be based on the current instance, what the old
code calls the "standin".

Using the state to update this object should not touch `.ext`.

As the FIXME notes, this probably still wrong as we should be updating
the memoized instance by reference, not updating a cloned version of it.
However, this is at least "more correct" and I need a testcase for that
first.
@birkenfeld
Copy link
Owner

Thanks for the PR! I'll have to delay the complete review for a bit, but I like the idea of compatibility with more complex pickles.

One nitpick from first glance: the version bump should be in the minor version, since new APIs are added.
Also, I've enabled the CI, and there seems to be a failing test now, which means that the default behavior has changed incompatibly?

@rvolgers
Copy link
Author

rvolgers commented Aug 16, 2022

Hey, thanks for taking a look!

I don't like that test, it checks that something fails which now succeeds. Should people be relying on failure as part of the API? The code already has a "do the best we can" attitude, so I would say no, but let me know what you think.

The version bump is just a quick hack. I noticed that cargo will fail to use a source override sometimes if the override and the crates.io one have the same version.

There is one thing which I was not quite sure about, so I want to make sure it's clear. The new code tries to detect the type of an object by looking at the operations performed on it. But if it e.g. a subclass of dict but the dict is empty, there are no dict operations performed, and thus the new code will not detect it as a dict subclass. This leads to the object being returned as a "normal" object, meaning its attributes are returned. The user of this library cannot tell the difference if unwrap_subclassed_data_structures is used.

This seems like a bit of a fundamental problem with the unwrap_subclassed_data_structures setting, so maybe it should be removed.

I actually have a followup PR which allows to specify the way to handle various types ahead of time in the options passed to the library. This avoids the ambiguity of auto-detection, but there are more complications and I'm not sure it's worth it.

@landaire
Copy link

landaire commented Jun 1, 2023

This PR's been open for a while and while I haven't tested these changes, support for recursive structures would be very useful for a data format I'm parsing.

Edit: I actually just found that there must be a bug somewhere with the memo ref count. I took @rvolgers branch and did some experimenting -- I no longer get errors relating to recursive objects when I disable this condition (always keep the item in the memo):

if count <= 0 {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants