Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l2gprediction): add score explanation based on features #939

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Dec 3, 2024

✨ Context

This PR closes opentargets/issues#3664

This is how the prioritisation for the 44acafc7985c3180b072394a28d7bad9 locus row looks like:

--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000075073                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.08760687195471195, eQtlColocH4Maximum -> -0.14339800289527474, distanceTssMean -> 0.5949956624176115, vepMeanNeighbourhood -> -0.011421102911212407, geneCount500kb -> 0.504088935973282, eQtlColocClppMaximumNeighbourhood -> 3.8670788812320505E-4, credibleSetConfidence -> 0.3579454324922001, distanceTssMeanNeighbourhood -> 2.1287083945895433, distanceSentinelTssNeighbourhood -> 0.06814055239948229, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0014771542785202165, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.5838524396517637, eQtlColocH4MaximumNeighbourhood -> 9.521371211488606, sQtlColocH4Maximum -> 0.08286727373455878, eQtlColocClppMaximum -> 1.8582865064964005, distanceSentinelTss -> 1.2031946462979695, sQtlColocClppMaximum -> -0.19218105019873652, distanceFootprintMean -> -0.21218899079465908, sQtlColocClppMaximumNeighbourhood -> -0.8441121532817452, isProteinCoding -> 1.3144695281881593, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.3102929859324838, vepMaximumNeighbourhood -> 0.047033325222490756, proteinGeneCount500kb -> 0.2529121503720645, vepMean -> 0.28520782840325626, distanceSentinelFootprint -> 0.3116443797012342, vepMaximum -> 0.0} 
--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000156515                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.06818768537940262, eQtlColocH4Maximum -> -0.14336530875137046, distanceTssMean -> 1.5530291409428105, vepMeanNeighbourhood -> 0.27627936610430315, geneCount500kb -> 0.7291870883206015, eQtlColocClppMaximumNeighbourhood -> -0.27041447664625295, credibleSetConfidence -> 0.27395001762237353, distanceTssMeanNeighbourhood -> 1.8170243838499736, distanceSentinelTssNeighbourhood -> 0.06262635310529577, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0015916146826161937, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.3691309949297295, eQtlColocH4MaximumNeighbourhood -> 9.585267225244944, sQtlColocH4Maximum -> 0.5632581225491728, eQtlColocClppMaximum -> 0.37135374806985344, distanceSentinelTss -> -0.12253906908689487, sQtlColocClppMaximum -> -0.14436242678936584, distanceFootprintMean -> -0.12424150830529755, sQtlColocClppMaximumNeighbourhood -> 1.0804192712631597, isProteinCoding -> 1.1216576014723711, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.46839704438965074, vepMaximumNeighbourhood -> -0.034280363722216364, proteinGeneCount500kb -> 0.39402035712660993, vepMean -> 0.7099149349443197, distanceSentinelFootprint -> 0.3021197989635074, vepMaximum -> 0.0} 

All results available at: gs://ot-team/irene/l2g/06122024/locus_to_gene_predictions
All predictions have their corresponding explanations.

🛠 What does this PR implement

  • New shapleyValues field (map type) in the prediction schema
  • New util convert_map_type_to_columns to convert the feature annotation in the locusToGeneFeatures map type to a dataframe that I can pass to the SHAP explainer
  • I have added model as an instance attribute to the Predictions dataset.
  • New explain method in the predictions dataset. Calculates shapley values and returns another object with the new column.
  • Edited the step to add this information
  • Enhancement in Dataset.filter so that the returned new instance of the object maintains the attributes. This was necessary to propagate the model instance attribute after each modification of the predictions dataset.

🙈 Missing

  • To run the step properly: I have only tried that it works by running predictions.explain() interactively.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@ireneisdoomed ireneisdoomed marked this pull request as ready for review December 4, 2024 16:32
@ireneisdoomed ireneisdoomed requested a review from d0choa December 4, 2024 16:32
@github-actions github-actions bot added the Method label Dec 6, 2024
@ireneisdoomed
Copy link
Contributor Author

The new version fixes the bug in the previous one by avoiding the operations with dictionaries, and just building the new map by joining the initial dataframe with the dataframe with the contributions.

Getting the shapley values takes time, but in my experiments creating the Spark dataframe from the Pandas df was the real bottleneck. The code might complain with memory issues when run locally on a very big dataframe.

I have tried avoiding this by using Pandas UDFs taking this and this as a guide, but Spark kept crashing due to serialization issues. Predicting now has gone from 6m to 13m (job). All predictions have their explanations built in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add shapley values to L2G predictions
1 participant