Small fixes to improve estimation of expressed neoantigens #249

jburos · 2017-08-22T16:58:50Z

Added a variety of debug messages, also modified isovar cache so that the cached file varies according to the input variants provided. In the absence of this, I was seeing the same number of expressed variants irrespective of which filter_fn was applied.

Finally, captured the case documented in #247 when trying to predict neoantigens using cohorts.functions.expressed_neoantigen_count when no variants were expressed.

tavinathanson

Looks great, thanks for fixing these!

One thing I'm not clear on is why this fixes the filter_fn issue. My understanding is that _load_single_patient_neoantigens filters by filter_fn at the very end, after loading from the cache; so what changed here that makes that work better?

tavinathanson · 2017-08-22T17:39:58Z

cohorts/cohort.py

-                df_epitopes[df_column] = df_epitopes.source_sequence_key.apply(
-                    lambda key: dict(key)[variant_column])
-            df_epitopes["patient_id"] = patient.id
+            if len(df_isovar.index) == 0:


Why not just len(df_isovar)?

tavinathanson · 2017-08-22T17:41:09Z

cohorts/cohort.py

-        # different caches
-        isovar_cached_file_name = "%s-isovar.csv" % self.merge_type
+        # different cache depending on epitope length & variant set
+        variant_hash = make_hash(frozenset(variants))


Nice! Are we confident that this hashes to the same value for the same VariantCollection?

It does if

you're on the same hardware/OS install

you export PYTHONHASHSEED=0

Re the PYTHONHASHSEED, I set this in my ~/.bashrc.

One thing i've thought about doing is alerting the user if this env variable hasn't been set. Curious to know what you think - is this overkill?

Why not just use hashlib.md5 to avoid both of these issues?

md5 only works for strings -- not Variants. Unless you pickle.dump them, or do str(x) for each -- which is also kind of hacky.

I agree that it's hacky, but sounds better to me than dealing with the above issues.

tavinathanson · 2017-08-22T17:41:21Z

cohorts/hash.py

+
+DictProxyType = type(object.__dict__)
+
+def make_hash(o):


Style: lots of unnecessary newlines?

tavinathanson · 2017-08-22T17:43:14Z

cohorts/hash.py

@@ -0,0 +1,42 @@
+import copy
+
+# courtesy of https://stackoverflow.com/a/8714242/3457743


Can you throw this in the docstring and move the function to utils or something? Extra file seems unnecessary to me; but if you disagree I don't feel strongly.

If you like. I don't have a strong preference re: more or fewer files, except when it helps organization.

Cool, that's my preference then

jburos · 2017-08-22T18:10:04Z

Thanks @tavinathanson! This is really helpful.

Re: the isovar change, I wasn't seeing this with the neoantigen case but I did see it when in an example I was working with - which calculated a weighted sum of variants where the weight was based on the level of expression. If you're interested in seeing the specific function, it's been pushed to a development branch here.

I haven't gone back to look at how this might impact the filtering by expression -- I'm not sure it does, since there are "be careful with isovar the cache doesn't depend on parameters" messages throughout the code! -- but it does concern me that the cache may have been initialized with a partially-filtered set of variants, leading to underestimating expressed variants on subsequent calls with a more relaxed filter. If this cache is error-prone, we would probably want to restructure this function to only ever cache using an unfiltered set of variants & filter by variants later.

Definitely open to feedback since I want to make sure we get this right.

tavinathanson · 2017-08-22T18:12:29Z

@jburos what do you mean by "that the cache may have been initialized with a partially-filtered set of variants"? Where does that happen?

jburos · 2017-08-22T18:25:29Z

@tavinathanson I don't think it would happen in our normal workflow, but we don't know what variants are passed into this function the first time it's called - it could be called directly by the user, with only indels. In that case the indels-only would be cached & used on subsequent calls (e.g. for expressed_snv_count).

Otherwise the only two places this is used within the code include load_neoantigens & the variant_expressed_filter, neither of which passes a filtered set of variants to load_single_patient_isovar.

tavinathanson · 2017-08-22T18:27:51Z

@jburos gotcha, makes sense! Maybe we should put https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L989 inside this function?

jburos · 2017-08-22T18:31:37Z

Yep, that's what I was thinking as well. Then there's always an option to filter by the variants the user provides after reading from the cache (analogous to the way other cached-items work).

But, would this mean you don't want to cache at all based on the variants or epitope-lengths given? Not really convinced we want to give up on that either (though, epitope-lengths could be done easily since they're numeric).

tavinathanson · 2017-08-22T18:34:13Z

@jburos I think we can have both of those things. Caching on the isovar results is useful in terms of time savings, for sure; and this does it in a better way. And loading in unfiltered variants addresses other potential bugs.

coveralls · 2017-08-22T18:41:54Z

Coverage decreased (-0.4%) to 52.164% when pulling d0f8927 on fix-issue-247 into ea31881 on master.

coveralls · 2017-08-22T18:57:29Z

Coverage decreased (-0.4%) to 52.133% when pulling d0f8927 on fix-issue-247 into ea31881 on master.

jburos · 2017-08-22T21:32:19Z

So, I tried to add a filter_by_variants step to the load_single_patient_isovar process (based on a portion of code from isovar.allele_reads here), but on further thought it seems much cleaner to leave this logic within isovar. Instead I now have a phantom _load_single_patient_isovar_unfiltered function which isn't used anywhere. I will likely remove this.

Also ended up tweaking our logging code (per #246) so that I could confirm that the cache-files were indeed being re-used without problems. LMK if you have a strong opinion about this. I can revert it, just needed it temporarily so that I could actually see my debug messages.

coveralls · 2017-08-22T21:42:54Z

Coverage decreased (-1.04%) to 51.492% when pulling b261796 on fix-issue-247 into ea31881 on master.

coveralls · 2017-08-22T21:58:20Z

Coverage decreased (-1.04%) to 51.492% when pulling b261796 on fix-issue-247 into ea31881 on master.

jburos · 2017-08-22T23:11:35Z

@tavinathanson no rush, just FYI I'm done making updates on this PR for now.

tavinathanson · 2017-08-23T13:45:57Z

cohorts/cohort.py

+        logger.debug("Loading unfiltered isovar data for patient: {}".format(patient.id))
+        epitope_hash = make_hash(frozenset(epitope_lengths))
+        cache_file_name = "{}-isovar.{}.csv".format(self.merge_type, str(epitope_hash))
+        import pdb; pdb.set_trace()


Let's get rid of this PDB statement :)

tavinathanson

I think the two separate isovar functions are fairly confusing. Why not, as you mentioned, add a filter_isovar just like filter_neoantigens, basically copying the code in https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L134 and https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L47, e.g. a new FilterableIsovar class? And then only have only function like _load_single_patient_neoantigens?

jburos · 2017-08-23T14:08:48Z

Well, I did that at first -- then undid it. It doesn't seem right to me to repeat the logic that's in isovar -- better to keep it in isovar (ie if the logic changes in isovar, wouldn't we then just repeat it here?).

As it stands, we never have a use case for querying on all variants -- in every case we pull the variants in before calling this function -- so if it's confusing I'd prefer to just remove the _unfiltered function & leave the current one as-is. What do you think -- does that make sense?

coveralls · 2017-08-23T21:34:50Z

Coverage decreased (-0.6%) to 51.889% when pulling d31469a on fix-issue-247 into ea31881 on master.

…#246

coveralls · 2017-08-30T12:01:22Z

Coverage decreased (-0.6%) to 52.069% when pulling 18808ca on fix-issue-247 into f9f4af2 on master.

jburos requested a review from tavinathanson August 22, 2017 17:08

tavinathanson reviewed Aug 22, 2017

View reviewed changes

tavinathanson reviewed Aug 23, 2017

View reviewed changes

tavinathanson requested changes Aug 23, 2017

View reviewed changes

jburos added 11 commits August 30, 2017 07:35

add debug messages; handle case with no expressed variants

38760e4

remove branch-specific edit captured in cherry-pick

4d4c239

add make_hash function

5bf534a

fix quotes

58cb202

add import statement

54b8fb3

move make_hash to utils

e196712

refactor load_single_patient_isovar; clean up debug logging

288400f

don't set logging level in package - let the user configure this. c.f. …

c33d1a3

…#246

clean up logging a little more

3d179e2

minimal docstrings

0a1006c

remove not-helpful unfiltered function

18808ca

jburos force-pushed the fix-issue-247 branch from d31469a to 18808ca Compare August 30, 2017 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small fixes to improve estimation of expressed neoantigens #249

Small fixes to improve estimation of expressed neoantigens #249

jburos commented Aug 22, 2017

tavinathanson left a comment

tavinathanson Aug 22, 2017

tavinathanson Aug 22, 2017

jburos Aug 22, 2017 •

edited

Loading

tavinathanson Aug 22, 2017

jburos Aug 22, 2017

tavinathanson Aug 22, 2017

tavinathanson Aug 22, 2017

tavinathanson Aug 22, 2017

jburos Aug 22, 2017

tavinathanson Aug 22, 2017

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

coveralls commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

jburos commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

jburos commented Aug 22, 2017

tavinathanson Aug 23, 2017

jburos Aug 23, 2017

tavinathanson left a comment

jburos commented Aug 23, 2017

coveralls commented Aug 23, 2017 •

edited

Loading

coveralls commented Aug 30, 2017 •

edited

Loading

		@@ -0,0 +1,42 @@
		import copy

		# courtesy of https://stackoverflow.com/a/8714242/3457743


		DictProxyType = type(object.__dict__)

		def make_hash(o):

Small fixes to improve estimation of expressed neoantigens #249

Are you sure you want to change the base?

Small fixes to improve estimation of expressed neoantigens #249

Conversation

jburos commented Aug 22, 2017

tavinathanson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jburos Aug 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

jburos commented Aug 22, 2017

tavinathanson commented Aug 22, 2017

coveralls commented Aug 22, 2017 • edited Loading

coveralls commented Aug 22, 2017 • edited Loading

jburos commented Aug 22, 2017 • edited Loading

coveralls commented Aug 22, 2017 • edited Loading

coveralls commented Aug 22, 2017 • edited Loading

jburos commented Aug 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tavinathanson left a comment

Choose a reason for hiding this comment

jburos commented Aug 23, 2017

coveralls commented Aug 23, 2017 • edited Loading

coveralls commented Aug 30, 2017 • edited Loading

jburos Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

jburos commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 22, 2017 •

edited

Loading

coveralls commented Aug 23, 2017 •

edited

Loading

coveralls commented Aug 30, 2017 •

edited

Loading