Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memoize a bit less #131

Closed
timodonnell opened this issue Dec 8, 2015 · 2 comments
Closed

memoize a bit less #131

timodonnell opened this issue Dec 8, 2015 · 2 comments

Comments

@timodonnell
Copy link
Contributor

I have a bunch of variants to annotate (calling variant.effects().top_priority_effect().short_description) and I noticed that each successive batch of 10,000 variants I annotate takes longer than the one before:

343772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 255.9554730.2 sec
Writing result
Wrote: data/derived/mutations.csv
333772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 576.6763450.2 sec
Writing result
Wrote: data/derived/mutations.csv
323772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 931.6908260.2 sec
Writing result
Wrote: data/derived/mutations.csv
313772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1308.0414910.2 sec
Writing result
Wrote: data/derived/mutations.csv
303772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1463.0381030.2 sec
Writing result
Wrote: data/derived/mutations.csv
293772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1691.1865350.2 sec
Writing result
Wrote: data/derived/mutations.csv
283772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1979.2676600.2 sec
Writing result
Wrote: data/derived/mutations.csv
273772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 2166.0401970.2 sec
Writing result
Wrote: data/derived/mutations.csv
263772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 2436.7108590.2 sec

When I kill it, it's always sitting in Collection.__eq__ called by the memoize wrapped function:

^CTraceback (most recent call last):
  File "/demeter/users/odonnt02/venvs/standard-2.7/bin/ovarian-paired-analyses-annotate", line 9, in <module>
    load_entry_point('ovarian-paired-analyses', 'console_scripts', 'ovarian-paired-analyses-annotate')()
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 188, in run
    for count in generator:
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 84, in compute_and_update
    df.ix[this_chunk_rows].copy())[columns]
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 91, in annotate_effect
    for v in df["variant"]
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/common.py", line 48, in wrapped_fn
    return memoized_values[key]
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/collection.py", line 97, in __eq__
    all(x == y for (x, y) in zip(self, other)))
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/collection.py", line 82, in __iter__
    def __iter__(self):
KeyboardInterrupt

It turns out the culprit is that we are memoizing EffectCollection.top_priority_effect(). Since this is a method, memoized with memoize (as opposed to a property memoized with memoize_property), the memoize cache is searching through a single giant and ever-growing dict of EffectCollection instances. Since Collection's hash implementation is collision-prone (it's len) we end up with lots of equality comparisons.

Deleting this line: https://github.com/hammerlab/varcode/blob/master/varcode/effect_collection.py#L181 results in much better performance, that stays constant:

323772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 74.9254590.2 sec
Writing result
Wrote: data/derived/mutations.csv
313772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 69.5253870.2 sec
Writing result
Wrote: data/derived/mutations.csv
303772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 59.9143870.2 sec
Writing result
Wrote: data/derived/mutations.csv
293772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 51.9703140.2 sec
Writing result
Wrote: data/derived/mutations.csv
283772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 48.6330340.2 sec
Writing result
Wrote: data/derived/mutations.csv
273772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 46.2613650.2 sec
Writing result
Wrote: data/derived/mutations.csv
263772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 48.4534800.2 sec

Possible things to do:

  • Don't memoize EffectCollection.top_priority_effect
  • "manually" memoize it by having it store a property on the object or change it to a property and memoize with memoize_property
  • improve the memoize implementation so the cached results don't grow unboundedly
@iskandr
Copy link
Contributor

iskandr commented Dec 8, 2015

I'm OK with solution #1 since the other methods and properties it relies on are still memoized. I wonder if that's just a band-aid solution -- it might make sense to make memoize use a fixed size cache with an eviction policy.

timodonnell added a commit that referenced this issue Dec 14, 2015
Closes #131

Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.
@timodonnell
Copy link
Contributor Author

I made a PR with the simple fix for now, and a new ticket #133 to track the non-band-aid solutions

timodonnell added a commit that referenced this issue Dec 18, 2015
Closes #131

Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants