-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small fixes to improve estimation of expressed neoantigens #249
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for fixing these!
One thing I'm not clear on is why this fixes the filter_fn
issue. My understanding is that _load_single_patient_neoantigens
filters by filter_fn
at the very end, after loading from the cache; so what changed here that makes that work better?
cohorts/cohort.py
Outdated
df_epitopes[df_column] = df_epitopes.source_sequence_key.apply( | ||
lambda key: dict(key)[variant_column]) | ||
df_epitopes["patient_id"] = patient.id | ||
if len(df_isovar.index) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just len(df_isovar)
?
cohorts/cohort.py
Outdated
# different caches | ||
isovar_cached_file_name = "%s-isovar.csv" % self.merge_type | ||
# different cache depending on epitope length & variant set | ||
variant_hash = make_hash(frozenset(variants)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Are we confident that this hashes to the same value for the same VariantCollection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does if
- you're on the same hardware/OS install
- you
export PYTHONHASHSEED=0
Re the PYTHONHASHSEED, I set this in my ~/.bashrc
.
One thing i've thought about doing is alerting the user if this env variable hasn't been set. Curious to know what you think - is this overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use hashlib.md5
to avoid both of these issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md5
only works for strings -- not Variants. Unless you pickle.dump
them, or do str(x)
for each -- which is also kind of hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's hacky, but sounds better to me than dealing with the above issues.
cohorts/hash.py
Outdated
|
||
DictProxyType = type(object.__dict__) | ||
|
||
def make_hash(o): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: lots of unnecessary newlines?
cohorts/hash.py
Outdated
@@ -0,0 +1,42 @@ | |||
import copy | |||
|
|||
# courtesy of https://stackoverflow.com/a/8714242/3457743 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you throw this in the docstring and move the function to utils or something? Extra file seems unnecessary to me; but if you disagree I don't feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you like. I don't have a strong preference re: more or fewer files, except when it helps organization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that's my preference then
Thanks @tavinathanson! This is really helpful. Re: the isovar change, I wasn't seeing this with the neoantigen case but I did see it when in an example I was working with - which calculated a weighted sum of variants where the weight was based on the level of expression. If you're interested in seeing the specific function, it's been pushed to a development branch here. I haven't gone back to look at how this might impact the filtering by expression -- I'm not sure it does, since there are "be careful with isovar the cache doesn't depend on parameters" messages throughout the code! -- but it does concern me that the cache may have been initialized with a partially-filtered set of variants, leading to underestimating expressed variants on subsequent calls with a more relaxed filter. If this cache is error-prone, we would probably want to restructure this function to only ever cache using an unfiltered set of variants & filter by variants later. Definitely open to feedback since I want to make sure we get this right. |
@jburos what do you mean by "that the cache may have been initialized with a partially-filtered set of variants"? Where does that happen? |
@tavinathanson I don't think it would happen in our normal workflow, but we don't know what variants are passed into this function the first time it's called - it could be called directly by the user, with only Otherwise the only two places this is used within the code include load_neoantigens & the |
@jburos gotcha, makes sense! Maybe we should put https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L989 inside this function? |
Yep, that's what I was thinking as well. Then there's always an option to filter by the variants the user provides after reading from the cache (analogous to the way other cached-items work). But, would this mean you don't want to cache at all based on the variants or epitope-lengths given? Not really convinced we want to give up on that either (though, epitope-lengths could be done easily since they're numeric). |
@jburos I think we can have both of those things. Caching on the |
So, I tried to add a Also ended up tweaking our logging code (per #246) so that I could confirm that the cache-files were indeed being re-used without problems. LMK if you have a strong opinion about this. I can revert it, just needed it temporarily so that I could actually see my debug messages. |
@tavinathanson no rush, just FYI I'm done making updates on this PR for now. |
cohorts/cohort.py
Outdated
logger.debug("Loading unfiltered isovar data for patient: {}".format(patient.id)) | ||
epitope_hash = make_hash(frozenset(epitope_lengths)) | ||
cache_file_name = "{}-isovar.{}.csv".format(self.merge_type, str(epitope_hash)) | ||
import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this PDB statement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two separate isovar
functions are fairly confusing. Why not, as you mentioned, add a filter_isovar
just like filter_neoantigens
, basically copying the code in https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L134 and https://github.com/hammerlab/cohorts/blob/master/cohorts/varcode_utils.py#L47, e.g. a new FilterableIsovar
class? And then only have only function like _load_single_patient_neoantigens
?
Well, I did that at first -- then undid it. It doesn't seem right to me to repeat the logic that's in isovar -- better to keep it in isovar (ie if the logic changes in isovar, wouldn't we then just repeat it here?). As it stands, we never have a use case for querying on all variants -- in every case we pull the variants in before calling this function -- so if it's confusing I'd prefer to just remove the |
Added a variety of debug messages, also modified isovar cache so that the cached file varies according to the input variants provided. In the absence of this, I was seeing the same number of expressed variants irrespective of which filter_fn was applied.
Finally, captured the case documented in #247 when trying to predict neoantigens using
cohorts.functions.expressed_neoantigen_count
when no variants were expressed.