Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patsy pickles #104

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

thequackdaddy
Copy link
Contributor

Hello,

This is a continuation effort of #86 and #67. My apologies if this is viewed as stepping on someone else's toes... but this is an interesting--and useful topic--for me.

I've basically got this working (with some simple tests on real data), but some things break some of the unit tests.

One problem is that as far as I can tell, doing a pickle.loads(pickle.dumps(obj)) will yield a different object, assigned to different memory than the original obj. Thus, the typical round-trip test obj == pickle.loads(pickle.dumps(obj)) fails, unless you just check that the obj.__dict__ matches. However, that breaks some things like... this line

I've also run across some issues with the stateful transforms like center, standardize, etc. These are functions (and from my rudimentary knowledge of computer science, pickling functions is generally a no-no). I need to set the __qualname__ and __name__ of these functions because when you generally do...

> center = stateful_transform(Center)
> center.__qualname__
'Center'

But when you pickle.dump, it realizes that Center refers to the class, and center is a function. So I have to hard-change the __qualname__ to get this to work. See here. This seems undesireable.

Anyways, let me know if there's any interest in this work. I'm by no means an expert at this... and the test cases could obviously be improved. But I'm not a programmer by trade, so please grant me some forbearance.

Thanks!

@thequackdaddy thequackdaddy changed the title Pickles patsy pickles Apr 6, 2017
@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Apr 6, 2017

Oh... and if you care to see the test-suite results... I ran it here...
https://travis-ci.org/thequackdaddy/patsy/jobs/219101866

Some small things fail here, but I think it works overall. Namely there is a requirement that 2 generatedEvalEnvironments don't equal each other, but that contradicts the idea that a pickle.loads(pickle.dumps(obj)) == obj

Sorry, no coverage available.


def __setstate__(self, pickle):
version, data, contrast, levels = pickle
check_pickle_version(version, 0, name=self.__class__.__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to check the pickle version before unpacking it – e.g. if we ever added a 5th field to the pickle state here, then this would error out on the unpacking line, before getting to the check_pickle_version line. Basically we have to commit to never change anything that happens before we look at the version, so the first line should be something like

check_pickle_version(pickle[0], 0, ...)   # if we commit to using a (version, ...) tuple forever

or

check_pickle_version(pickle["version"], 0, ...)

if we decide maybe we want to be stuck with something more readable than a tuple ;-)

patsy/desc.py Outdated
def __eq__(self, other):
return self.__dict__ == other.__dict__

# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this class is using the default __getstate__...?

patsy/desc.py Outdated
@@ -209,7 +234,7 @@ def _pretty_repr_(self, p, cycle): # pragma: no cover
[self.intercept, self.intercept_origin,
self.intercept_removed, self.terms])

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does IntermediateExpr need to support pickling? I think it should only exist as a temporary object in memory?

@@ -120,7 +121,17 @@ def __repr__(self):
kwlist.append(("categories", self.categories))
repr_pretty_impl(p, self, [], kwlist)

__getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is also using the default __getstate__?

def __eq__(self, other):
return self.__dict__ == other.__dict__

# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise...

patsy/eval.py Outdated
@@ -10,7 +10,7 @@
# for __future__ flags!

# These are made available in the patsy.* namespace
__all__ = ["EvalEnvironment", "EvalFactor"]
__all__ = ["EvalEnvironment", "EvalFactor", "VarLookupDict"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this add VarLookupDict to patsy's public API?

@@ -44,7 +44,7 @@ def __init__(self, print_as):
def __repr__(self):
return "%s(%r)" % (self.__class__.__name__, self._print_as)

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to support pickling?

@@ -70,7 +70,7 @@ def _repr_pretty_(self, p, cycle):
kwargs = [("extra", self.extra)]
return repr_pretty_impl(p, self, [self.type, self.origin], kwargs)

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to support pickling?

@@ -83,7 +83,7 @@ def __init__(self, type, token, args, origin):
def _repr_pretty_(self, p, cycle):
return repr_pretty_impl(p, self, [self.type, self.token, self.args])

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise...

@@ -95,14 +95,14 @@ def __repr__(self):
return "%s(%r, %r, %r)" % (self.__class__.__name__,
self.token_type, self.arity, self.precedence)

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise...


class _StackOperator(object):
def __init__(self, op, token):
self.op = op
self.token = token

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise...

@@ -115,7 +115,7 @@ def __init__(self, unary_ops, binary_ops, atomic_types, trace):
self.atomic_types = atomic_types
self.trace = trace

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise...

# We reimplement patsy.util.no_pickling, to avoid circular import issues
def __getstate__(self):
raise NotImplementedError
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an actual __getstate__

@@ -73,7 +73,7 @@ def __repr__(self):
suffix = "-"
return "%r%s" % (self.factor, suffix)

__getstate__ = no_pickling
# __getstate__ = no_pickling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to support pickling?

patsy/state.py Outdated

class StatefulTransform(object):
def __getstate__(self):
return (0, self.__dict__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm. This is cheating :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of a saying from my childhood hero

"evalenv_transform_standardize": EvalEnvironment([{
'standardize': standardize
}]),
"evalenv_transform_catgorical": EvalEnvironment([{'C': C}]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "e" in "categorical"

from patsy.origin import Origin


PICKE_TESTCASES_ROOTDIR = os.path.join(os.path.dirname(__file__), '..', 'pickle_testcases')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "L" in "PICKLE"

patsy/util.py Outdated
error_msg += "."

# TODO Use a better exception than ValueError.
raise ValueError(error_msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could use PickleError

@njsmith
Copy link
Member

njsmith commented Apr 6, 2017

General comments:

  • It's very annoying that pickle doesn't accept module objects :-(. The code for working around that seems basically reasonable, though maybe could be a bit cleaner.

  • I don't understand why stateful transform functions like center need to be pickleable? the instantiated transform objects certainly do (and they need real __getstate__ methods :-P), but a fitted model shouldn't be referring to the functions themselves anymore, I would think... maybe we need to refactor a bit to make this true.

  • I'm not a big fan of all the __eq__ changes – this is a globally visible change, and the definitions are just targeted to the idiosyncratic needs of our tests :-(. (And btw, they're all buggy, because in python you have to override __ne__ too... I forget the exact rules, this might be fixed in py3, but definitely in py2.) Maybe a better approach would be to write end-to-end tests, that run some data through an unpickled model and checks that the result matches?

@njsmith
Copy link
Member

njsmith commented Apr 6, 2017

But overall: this isn't ready yet, but it's very exciting to see someone working on it :-)

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Apr 6, 2017

Thanks. I'll try and take a look at this tomorrow.

I think you caught a few of the shortcuts I was trying... I'll clean up as appropriate.

As the the __eq__ issue... i was also thinking I could define my own assert statement that would check that the __dict__ and all nested __dict__ were equal. Maybe call it assert_pickle_equal. Some items, like EvalEnvironment, I don't see how to pass data through.

Thanks for your excellent comments. Hopefully I'll have an update tomorrow afternoon-ish.

@thequackdaddy thequackdaddy force-pushed the patsy_pickles branch 3 times, most recently from 5588119 to 14e62a7 Compare April 6, 2017 22:41
@thequackdaddy
Copy link
Contributor Author

Update for today:

https://travis-ci.org/thequackdaddy/patsy/builds/219481856 (all green)

I've gone through/reverted your code as you recommended. Thanks for your pointers.

I haven't figured out why stateful transform function like center are in the EvalEnvironment. I quickly tried to remove it, but got an error because patsy later looked for it. (Not sure if it actually uses it... need to follow the code some more.)

But this seems do-able.

@thequackdaddy
Copy link
Contributor Author

This has been very educational. Thanks for the help.

I took a stab at fixing the issue where the patsy stateful transforms required __qualname__ to be speicified. I think that once the design_info has been built, it should be safe to just remove the functions.... So I added a clean method on EvalEnvironment that removes them.

Here's a gist showing that I think this works.

@njsmith
Copy link
Member

njsmith commented Apr 7, 2017

That looks like it will work, but I think it should be possible to solve in a more fundamental way. Looking at memorize_passes_needed in EvalFactor I think I see the problem: there's some code that pulls out the set of names referenced in the code (using the ast_names helper function), and then reduces the EvalEnvironment to just those names (eval_env.subset). This was added in #66 as preparation for the pickling support -- it's why trying to pickle an EvalEnvironment doesn't try to save your entire Python environment :-).

And then if you look a few lines after we compute the set of variables to keep, then there's some code that rewrites the code to convert stateful transform calls like center(x) into method calls on a newly allocate transform object.

So I think the best solution is to swap the order of these things: if we make the list of variables to keep after eliminating references to stateful transforms, then it should just work.

I know this is getting into somewhat deep wizardry... I'm going to try to look at this more closely within the next few days, and I can fix this part up if you don't get there first :-)

@thequackdaddy
Copy link
Contributor Author

@njsmith Hello again. Just pinging to see if there's interest in this... would be a little helpful for me to cut down on some model sizes.

@codecov-io
Copy link

codecov-io commented Feb 27, 2018

Codecov Report

Merging #104 into master will decrease coverage by 0.65%.
The diff coverage is 89.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
- Coverage   99.08%   98.43%   -0.66%     
==========================================
  Files          30       31       +1     
  Lines        5557     5928     +371     
  Branches      773      822      +49     
==========================================
+ Hits         5506     5835     +329     
- Misses         28       63      +35     
- Partials       23       30       +7
Impacted Files Coverage Δ
patsy/mgcv_cubic_splines.py 99.14% <100%> (+0.03%) ⬆️
patsy/splines.py 100% <100%> (ø) ⬆️
patsy/origin.py 96.72% <100%> (+0.29%) ⬆️
patsy/design_info.py 99.64% <100%> (+0.01%) ⬆️
patsy/constraint.py 100% <100%> (ø) ⬆️
patsy/state.py 100% <100%> (ø) ⬆️
patsy/desc.py 98.46% <100%> (+0.11%) ⬆️
patsy/missing.py 99.09% <100%> (+0.03%) ⬆️
patsy/contrasts.py 100% <100%> (ø) ⬆️
patsy/test_highlevel.py 99.36% <100%> (+0.02%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b7b9c2...7129507. Read the comment docs.

@thequackdaddy
Copy link
Contributor Author

@njsmith - Checking in again to see if this is on the right track/something that could be merged. Let me know either way. Thanks so much!

@dswah
Copy link

dswah commented Apr 8, 2018

@thequackdaddy @njsmith
how is this PR looking?

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Apr 8, 2018

@dswah I'm waiting for review/feedback... This is definitely a big project, and part of it--such as the whole assert_pickled_equals is difficult to check.

If you want to read/through, provide feedback, I think that might be helpful.

@louispotok louispotok mentioned this pull request May 22, 2018
chrish42 and others added 17 commits November 3, 2018 19:02
…__ as appropriate to make assert tests work.
…d special tests for transform equivalence, new assert statement that makes the pickle a dict (not the most robust... but it works), added PickleError,
…l transforms which have different names/qualnames from expected.
* Now run subset and clean the eval_env after
removing bare funcalls
* Had some straggling tuples
* Altered test so that stateful_transforms won't
be saved in eval_env
* Removed the clean method on EvalEnvironment
@matthewwardrop
Copy link
Collaborator

Hi @thequackdaddy ,

Thanks for these patches. As one of the new maintainers I'm going through the PRs and figuring out what still needs attention. Even though your patches have been sitting here for along time, the code-base hasn't change much and so it's likely they are still suitable. Is this something you are still interested in hashing out?

Given that the name of the game from a maintenance perspective is to keep things working as is, I'm a little cautious to introduce massive changes that might introduce new bugs. It should also be noted that https://github.com/matthewwardrop/formulaic (the spiritual successor of this project) already has support for pickling, as well as sparse representations and is much more performant. But if you'd like to explore finishing up this work, I'm willing to dig into it with you.

Best,
M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants