feat(l2gprediction): add score explanation based on features #939

ireneisdoomed · 2024-12-03T11:40:12Z

✨ Context

This is how the prioritisation for the 44acafc7985c3180b072394a28d7bad9 locus row looks like:

--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000075073                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.08760687195471195, eQtlColocH4Maximum -> -0.14339800289527474, distanceTssMean -> 0.5949956624176115, vepMeanNeighbourhood -> -0.011421102911212407, geneCount500kb -> 0.504088935973282, eQtlColocClppMaximumNeighbourhood -> 3.8670788812320505E-4, credibleSetConfidence -> 0.3579454324922001, distanceTssMeanNeighbourhood -> 2.1287083945895433, distanceSentinelTssNeighbourhood -> 0.06814055239948229, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0014771542785202165, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.5838524396517637, eQtlColocH4MaximumNeighbourhood -> 9.521371211488606, sQtlColocH4Maximum -> 0.08286727373455878, eQtlColocClppMaximum -> 1.8582865064964005, distanceSentinelTss -> 1.2031946462979695, sQtlColocClppMaximum -> -0.19218105019873652, distanceFootprintMean -> -0.21218899079465908, sQtlColocClppMaximumNeighbourhood -> -0.8441121532817452, isProteinCoding -> 1.3144695281881593, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.3102929859324838, vepMaximumNeighbourhood -> 0.047033325222490756, proteinGeneCount500kb -> 0.2529121503720645, vepMean -> 0.28520782840325626, distanceSentinelFootprint -> 0.3116443797012342, vepMaximum -> 0.0} 
--------------------------------------------------------------------------------------------------------------------------
 geneId        | ENSG00000156515                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 shapleyValues | {pQtlColocH4MaximumNeighbourhood -> -0.06818768537940262, eQtlColocH4Maximum -> -0.14336530875137046, distanceTssMean -> 1.5530291409428105, vepMeanNeighbourhood -> 0.27627936610430315, geneCount500kb -> 0.7291870883206015, eQtlColocClppMaximumNeighbourhood -> -0.27041447664625295, credibleSetConfidence -> 0.27395001762237353, distanceTssMeanNeighbourhood -> 1.8170243838499736, distanceSentinelTssNeighbourhood -> 0.06262635310529577, pQtlColocH4Maximum -> -4.429795836247109E-5, sQtlColocH4MaximumNeighbourhood -> -0.0015916146826161937, pQtlColocClppMaximum -> 0.0, distanceFootprintMeanNeighbourhood -> 3.3691309949297295, eQtlColocH4MaximumNeighbourhood -> 9.585267225244944, sQtlColocH4Maximum -> 0.5632581225491728, eQtlColocClppMaximum -> 0.37135374806985344, distanceSentinelTss -> -0.12253906908689487, sQtlColocClppMaximum -> -0.14436242678936584, distanceFootprintMean -> -0.12424150830529755, sQtlColocClppMaximumNeighbourhood -> 1.0804192712631597, isProteinCoding -> 1.1216576014723711, pQtlColocClppMaximumNeighbourhood -> -0.019016532039634684, distanceSentinelFootprintNeighbourhood -> 0.46839704438965074, vepMaximumNeighbourhood -> -0.034280363722216364, proteinGeneCount500kb -> 0.39402035712660993, vepMean -> 0.7099149349443197, distanceSentinelFootprint -> 0.3021197989635074, vepMaximum -> 0.0}

All results available at: gs://ot-team/irene/l2g/06122024/locus_to_gene_predictions
All predictions have their corresponding explanations.

🛠 What does this PR implement

New shapleyValues field (map type) in the prediction schema
New util convert_map_type_to_columns to convert the feature annotation in the locusToGeneFeatures map type to a dataframe that I can pass to the SHAP explainer
I have added model as an instance attribute to the Predictions dataset.
New explain method in the predictions dataset. Calculates shapley values and returns another object with the new column.
Edited the step to add this information
Enhancement in Dataset.filter so that the returned new instance of the object maintains the attributes. This was necessary to propagate the model instance attribute after each modification of the predictions dataset.

🙈 Missing

To run the step properly: I have only tried that it works by running predictions.explain() interactively.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-shapley-predictions

ireneisdoomed · 2024-12-06T17:21:01Z

The new version fixes the bug in the previous one by avoiding the operations with dictionaries, and just building the new map by joining the initial dataframe with the dataframe with the contributions.

Getting the shapley values takes time, but in my experiments creating the Spark dataframe from the Pandas df was the real bottleneck. The code might complain with memory issues when run locally on a very big dataframe.

I have tried avoiding this by using Pandas UDFs taking this and this as a guide, but Spark kept crashing due to serialization issues. Predicting now has gone from 6m to 13m (job). All predictions have their explanations built in.

ireneisdoomed added 3 commits December 3, 2024 09:33

feat(prediction): add model as instance attribute

72259fc

feat: added convert_map_type_to_columns spark util

9e8c491

feat(prediction): new method explain returns shapley values

450a937

github-actions bot added size-S Dataset Feature labels Dec 3, 2024

ireneisdoomed added 3 commits December 4, 2024 15:58

feat(prediction): explain returns predictions with shapley values

08ae6bd

chore: compute shapleyValues in the l2g step

9d40e62

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

125425f

…-shapley-predictions

github-actions bot added size-M Step and removed size-S labels Dec 4, 2024

ireneisdoomed marked this pull request as ready for review December 4, 2024 16:32

ireneisdoomed requested a review from d0choa December 4, 2024 16:32

ireneisdoomed added 6 commits December 5, 2024 17:53

refactor: use pandas udf instead

f407512

refactor: forget about udfs and get shaps single threaded

f542395

chore: remove reference to chromatin interaction data in HF card

9403fe6

fix(l2g_prediction): methods that return new instance preserve attribute

1bc6f3a

feat(dataset): filter method preserves all instance attributes

8420933

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

8a85f4f

…-shapley-predictions

github-actions bot added the Method label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2gprediction): add score explanation based on features #939

feat(l2gprediction): add score explanation based on features #939

ireneisdoomed commented Dec 3, 2024 •

edited

Loading

ireneisdoomed commented Dec 6, 2024

feat(l2gprediction): add score explanation based on features #939

Are you sure you want to change the base?

feat(l2gprediction): add score explanation based on features #939

Conversation

ireneisdoomed commented Dec 3, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

ireneisdoomed commented Dec 6, 2024

ireneisdoomed commented Dec 3, 2024 •

edited

Loading