Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fixes to improve estimation of expressed neoantigens #249

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jburos
Copy link
Member

@jburos jburos commented Aug 22, 2017

Added a variety of debug messages, also modified isovar cache so that the cached file varies according to the input variants provided. In the absence of this, I was seeing the same number of expressed variants irrespective of which filter_fn was applied.

Finally, captured the case documented in #247 when trying to predict neoantigens using cohorts.functions.expressed_neoantigen_count when no variants were expressed.

@jburos jburos requested a review from tavinathanson August 22, 2017 17:08
Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for fixing these!

One thing I'm not clear on is why this fixes the filter_fn issue. My understanding is that _load_single_patient_neoantigens filters by filter_fn at the very end, after loading from the cache; so what changed here that makes that work better?

df_epitopes[df_column] = df_epitopes.source_sequence_key.apply(
lambda key: dict(key)[variant_column])
df_epitopes["patient_id"] = patient.id
if len(df_isovar.index) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just len(df_isovar)?

# different caches
isovar_cached_file_name = "%s-isovar.csv" % self.merge_type
# different cache depending on epitope length & variant set
variant_hash = make_hash(frozenset(variants))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Are we confident that this hashes to the same value for the same VariantCollection?

Copy link
Member Author

@jburos jburos Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does if

  1. you're on the same hardware/OS install
  2. you export PYTHONHASHSEED=0

Re the PYTHONHASHSEED, I set this in my ~/.bashrc.

One thing i've thought about doing is alerting the user if this env variable hasn't been set. Curious to know what you think - is this overkill?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use hashlib.md5 to avoid both of these issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md5 only works for strings -- not Variants. Unless you pickle.dump them, or do str(x) for each -- which is also kind of hacky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's hacky, but sounds better to me than dealing with the above issues.

cohorts/hash.py Outdated

DictProxyType = type(object.__dict__)

def make_hash(o):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: lots of unnecessary newlines?

cohorts/hash.py Outdated
@@ -0,0 +1,42 @@
import copy

# courtesy of https://stackoverflow.com/a/8714242/3457743
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you throw this in the docstring and move the function to utils or something? Extra file seems unnecessary to me; but if you disagree I don't feel strongly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like. I don't have a strong preference re: more or fewer files, except when it helps organization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that's my preference then

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

Thanks @tavinathanson! This is really helpful.

Re: the isovar change, I wasn't seeing this with the neoantigen case but I did see it when in an example I was working with - which calculated a weighted sum of variants where the weight was based on the level of expression. If you're interested in seeing the specific function, it's been pushed to a development branch here.

I haven't gone back to look at how this might impact the filtering by expression -- I'm not sure it does, since there are "be careful with isovar the cache doesn't depend on parameters" messages throughout the code! -- but it does concern me that the cache may have been initialized with a partially-filtered set of variants, leading to underestimating expressed variants on subsequent calls with a more relaxed filter. If this cache is error-prone, we would probably want to restructure this function to only ever cache using an unfiltered set of variants & filter by variants later.

Definitely open to feedback since I want to make sure we get this right.

@tavinathanson
Copy link
Member

@jburos what do you mean by "that the cache may have been initialized with a partially-filtered set of variants"? Where does that happen?

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

@tavinathanson I don't think it would happen in our normal workflow, but we don't know what variants are passed into this function the first time it's called - it could be called directly by the user, with only indels. In that case the indels-only would be cached & used on subsequent calls (e.g. for expressed_snv_count).

Otherwise the only two places this is used within the code include load_neoantigens & the variant_expressed_filter, neither of which passes a filtered set of variants to load_single_patient_isovar.

@tavinathanson
Copy link
Member

@jburos gotcha, makes sense! Maybe we should put https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L989 inside this function?

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

Yep, that's what I was thinking as well. Then there's always an option to filter by the variants the user provides after reading from the cache (analogous to the way other cached-items work).

But, would this mean you don't want to cache at all based on the variants or epitope-lengths given? Not really convinced we want to give up on that either (though, epitope-lengths could be done easily since they're numeric).

@tavinathanson
Copy link
Member

@jburos I think we can have both of those things. Caching on the isovar results is useful in terms of time savings, for sure; and this does it in a better way. And loading in unfiltered variants addresses other potential bugs.

@coveralls
Copy link

coveralls commented Aug 22, 2017

Coverage Status

Coverage decreased (-0.4%) to 52.164% when pulling d0f8927 on fix-issue-247 into ea31881 on master.

@coveralls
Copy link

coveralls commented Aug 22, 2017

Coverage Status

Coverage decreased (-0.4%) to 52.133% when pulling d0f8927 on fix-issue-247 into ea31881 on master.

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

So, I tried to add a filter_by_variants step to the load_single_patient_isovar process (based on a portion of code from isovar.allele_reads here), but on further thought it seems much cleaner to leave this logic within isovar. Instead I now have a phantom _load_single_patient_isovar_unfiltered function which isn't used anywhere. I will likely remove this.

Also ended up tweaking our logging code (per #246) so that I could confirm that the cache-files were indeed being re-used without problems. LMK if you have a strong opinion about this. I can revert it, just needed it temporarily so that I could actually see my debug messages.

@coveralls
Copy link

coveralls commented Aug 22, 2017

Coverage Status

Coverage decreased (-1.04%) to 51.492% when pulling b261796 on fix-issue-247 into ea31881 on master.

@coveralls
Copy link

coveralls commented Aug 22, 2017

Coverage Status

Coverage decreased (-1.04%) to 51.492% when pulling b261796 on fix-issue-247 into ea31881 on master.

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

@tavinathanson no rush, just FYI I'm done making updates on this PR for now.

logger.debug("Loading unfiltered isovar data for patient: {}".format(patient.id))
epitope_hash = make_hash(frozenset(epitope_lengths))
cache_file_name = "{}-isovar.{}.csv".format(self.merge_type, str(epitope_hash))
import pdb; pdb.set_trace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this PDB statement :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops!

Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two separate isovar functions are fairly confusing. Why not, as you mentioned, add a filter_isovar just like filter_neoantigens, basically copying the code in https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L134 and https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L47, e.g. a new FilterableIsovar class? And then only have only function like _load_single_patient_neoantigens?

@jburos
Copy link
Member Author

jburos commented Aug 23, 2017

Well, I did that at first -- then undid it. It doesn't seem right to me to repeat the logic that's in isovar -- better to keep it in isovar (ie if the logic changes in isovar, wouldn't we then just repeat it here?).

As it stands, we never have a use case for querying on all variants -- in every case we pull the variants in before calling this function -- so if it's confusing I'd prefer to just remove the _unfiltered function & leave the current one as-is. What do you think -- does that make sense?

@coveralls
Copy link

coveralls commented Aug 23, 2017

Coverage Status

Coverage decreased (-0.6%) to 51.889% when pulling d31469a on fix-issue-247 into ea31881 on master.

@coveralls
Copy link

coveralls commented Aug 30, 2017

Coverage Status

Coverage decreased (-0.6%) to 52.069% when pulling 18808ca on fix-issue-247 into f9f4af2 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants