memoize a bit less #131

timodonnell · 2015-12-08T06:29:48Z

I have a bunch of variants to annotate (calling variant.effects().top_priority_effect().short_description) and I noticed that each successive batch of 10,000 variants I annotate takes longer than the one before:

343772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 255.9554730.2 sec
Writing result
Wrote: data/derived/mutations.csv
333772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 576.6763450.2 sec
Writing result
Wrote: data/derived/mutations.csv
323772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 931.6908260.2 sec
Writing result
Wrote: data/derived/mutations.csv
313772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1308.0414910.2 sec
Writing result
Wrote: data/derived/mutations.csv
303772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1463.0381030.2 sec
Writing result
Wrote: data/derived/mutations.csv
293772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1691.1865350.2 sec
Writing result
Wrote: data/derived/mutations.csv
283772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 1979.2676600.2 sec
Writing result
Wrote: data/derived/mutations.csv
273772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 2166.0401970.2 sec
Writing result
Wrote: data/derived/mutations.csv
263772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 2436.7108590.2 sec

When I kill it, it's always sitting in Collection.__eq__ called by the memoize wrapped function:

^CTraceback (most recent call last):
  File "/demeter/users/odonnt02/venvs/standard-2.7/bin/ovarian-paired-analyses-annotate", line 9, in <module>
    load_entry_point('ovarian-paired-analyses', 'console_scripts', 'ovarian-paired-analyses-annotate')()
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 188, in run
    for count in generator:
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 84, in compute_and_update
    df.ix[this_chunk_rows].copy())[columns]
  File "/demeter/users/odonnt02/sinai/git/ovarian-paired-analyses/ovarian_paired_analyses/annotate.py", line 91, in annotate_effect
    for v in df["variant"]
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/common.py", line 48, in wrapped_fn
    return memoized_values[key]
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/collection.py", line 97, in __eq__
    all(x == y for (x, y) in zip(self, other)))
  File "/demeter/users/odonnt02/sinai/git/varcode/varcode/collection.py", line 82, in __iter__
    def __iter__(self):
KeyboardInterrupt

It turns out the culprit is that we are memoizing EffectCollection.top_priority_effect(). Since this is a method, memoized with memoize (as opposed to a property memoized with memoize_property), the memoize cache is searching through a single giant and ever-growing dict of EffectCollection instances. Since Collection's hash implementation is collision-prone (it's len) we end up with lots of equality comparisons.

Deleting this line: https://github.com/hammerlab/varcode/blob/master/varcode/effect_collection.py#L181 results in much better performance, that stays constant:

323772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 74.9254590.2 sec
Writing result
Wrote: data/derived/mutations.csv
313772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 69.5253870.2 sec
Writing result
Wrote: data/derived/mutations.csv
303772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 59.9143870.2 sec
Writing result
Wrote: data/derived/mutations.csv
293772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 51.9703140.2 sec
Writing result
Wrote: data/derived/mutations.csv
283772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 48.6330340.2 sec
Writing result
Wrote: data/derived/mutations.csv
273772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 46.2613650.2 sec
Writing result
Wrote: data/derived/mutations.csv
263772 / 343772 rows remaining to annotate. Annotating 10000 rows.
Annotated in 48.4534800.2 sec

Possible things to do:

Don't memoize EffectCollection.top_priority_effect
"manually" memoize it by having it store a property on the object or change it to a property and memoize with memoize_property
improve the memoize implementation so the cached results don't grow unboundedly

The text was updated successfully, but these errors were encountered:

iskandr · 2015-12-08T15:04:23Z

I'm OK with solution #1 since the other methods and properties it relies on are still memoized. I wonder if that's just a band-aid solution -- it might make sense to make memoize use a fixed size cache with an eviction policy.

Closes #131 Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.

timodonnell · 2015-12-14T16:48:55Z

I made a PR with the simple fix for now, and a new ticket #133 to track the non-band-aid solutions

Closes #131 Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.

timodonnell added a commit that referenced this issue Dec 14, 2015

Don't memoize EffectCollection.top_priority_effect()

71d1f08

Closes #131 Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.

This was referenced Dec 14, 2015

Don't memoize EffectCollection.top_priority_effect() #132

Merged

consider memoizing smarter #133

Open

timodonnell added a commit that referenced this issue Dec 18, 2015

Don't memoize EffectCollection.top_priority_effect()

5312fab

Closes #131 Also a trivial performance improvement for object equality in Collection and Variant when the objects being compared are the same object.

iskandr closed this as completed in #132 Dec 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memoize a bit less #131

memoize a bit less #131

timodonnell commented Dec 8, 2015

iskandr commented Dec 8, 2015

timodonnell commented Dec 14, 2015

memoize a bit less #131

memoize a bit less #131

Comments

timodonnell commented Dec 8, 2015

iskandr commented Dec 8, 2015

timodonnell commented Dec 14, 2015