Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPNST sample updates AND PDX code addition #251

Merged
merged 6 commits into from
Dec 6, 2024
Merged

MPNST sample updates AND PDX code addition #251

merged 6 commits into from
Dec 6, 2024

Conversation

sgosline
Copy link
Member

This is a pretty large merge as my branch is a bit out dated, so @jjacobson95 please confirm that the dataset build functionality works! (I didn't try build_all.py, only build_dataset.py).

This PR closes #146 which has been open far too long. The PDX data required a new curve fitting statistic, which was part of there delay. There was a second untracked issue about thempnst samples not matching entirely, but that has also been resolved .

@sgosline sgosline requested a review from jjacobson95 November 22, 2024 00:23
@jjacobson95
Copy link
Collaborator

Does this branch replace the mpnst build with mpnstPDX or does it intend for both to be run? Trying to figure out how to resolve these conflicts.

@jjacobson95
Copy link
Collaborator

It looks like its built to be separate dataset from mpnst, is this the plan?

@sgosline
Copy link
Member Author

sgosline commented Nov 22, 2024

I updated the mpnst dataset (drug response in PDX MT) and added the mpsntpdx (drug response in vivo pdx) dataset.

@jjacobson95
Copy link
Collaborator

Is the samples file duplicated between the two datasets?

Trying to figure out how build/mpnstPDX/build_samples.sh should work.

@jjacobson95
Copy link
Collaborator

jjacobson95 commented Nov 22, 2024

Same question with the drugs file. There is no build_drugs.sh, should I make this just a duplicate of the mpnst_drugs.tsv file as well?

@sgosline
Copy link
Member Author

sgosline commented Nov 22, 2024

The sample file should be copied, not regenerated again because the original samples are the same. I'm not sure what happened to the drug file, it has been added now.

@jjacobson95
Copy link
Collaborator

The mpnst_copy_number.csv file is missing improve_sample_id values for some rows.

When running the validation code, I'm getting this error:

linkml-validate --schema schema/coderdata.yaml --target-class "Copy Number" local/mpnst_copy_number.csv
...
[ERROR] [local/mpnst_copy_number.csv/345134] 'improve_sample_id' is a required property in /
[ERROR] [local/mpnst_copy_number.csv/345135] 'improve_sample_id' is a required property in /
[ERROR] [local/mpnst_copy_number.csv/345136] 'improve_sample_id' is a required property in /

Tail of mpnst_copy_number in this branch:

57135,0.00695501817407717,deep del,MPNST PDX MT,NF Data Portal,
57054,0.00695501817407717,deep del,MPNST PDX MT,NF Data Portal,
57055,0.00695501817407717,deep del,MPNST PDX MT,NF Data Portal,
9085,0.00695501817407717,deep del,MPNST PDX MT,NF Data Portal,
253175,0.00695501817407717,deep del,MPNST PDX MT,NF Data Portal,

Tail of mpnst_copy_number in build 0.1.40:

57135,0.00732396189266279,deep del,MPNST PDX MT,NF Data Portal,4270
57054,0.00732396189266279,deep del,MPNST PDX MT,NF Data Portal,4270
57055,0.00732396189266279,deep del,MPNST PDX MT,NF Data Portal,4270
253175,0.00732396189266279,deep del,MPNST PDX MT,NF Data Portal,4270
9085,0.00732396189266279,deep del,MPNST PDX MT,NF Data Portal,4270

@sgosline
Copy link
Member Author

Ok, the latest commit should fix this. its a change to the sample generation I added....

@jjacobson95
Copy link
Collaborator

jjacobson95 commented Nov 23, 2024

Running into several issues as this branch is 71 commits behind main. Don't have time today to work on these, but in a merged branch (drop_drugs) I am running into the following issues for MPNST (not pdx).

Drugs File

Traceback (most recent call last):
  File "/app/build_drug_desc.py", line 97, in <module>
    main()
  File "/app/build_drug_desc.py", line 87, in main
    id_morg = ids.rename({"canSMILES":'smile'},axis=1).merge(morgs)[['improve_drug_id','structural_descriptor','descriptor_value']]
  File "/opt/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 9843, in merge
    return merge(
  File "/opt/venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 148, in merge
    op = _MergeOperation(
  File "/opt/venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 719, in __init__
    self.left_on, self.right_on = self._validate_left_right_on(left_on, right_on)
  File "/opt/venv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1500, in _validate_left_right_on
    raise MergeError(
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Experiments file:

Joining with `by = join_by(chem_name)`
Error in `left_join()`:
! Can't join `x$chem_name` with `y$chem_name` due to incompatible types.
ℹ `x$chem_name` is a <character>.
ℹ `y$chem_name` is a <logical>.
Backtrace:
     ▆
  1. ├─base::subset(...)
  2. ├─dplyr::distinct(...)
  3. ├─dplyr::select(...)
  4. ├─dplyr::mutate(left_join(alldrugs, drug_map), time_unit = "hours")
  5. ├─dplyr::left_join(alldrugs, drug_map)
  6. ├─dplyr:::left_join.data.frame(alldrugs, drug_map)
  7. │ └─dplyr:::join_mutate(...)
  8. │   └─dplyr:::join_cast_common(x_key, y_key, vars, error_call = error_call)
  9. │     ├─rlang::try_fetch(...)
 10. │     │ └─base::withCallingHandlers(...)
 11. │     └─vctrs::vec_ptype2(x, y, x_arg = "", y_arg = "", call = error_call)
 12. ├─vctrs (local) `<fn>`()
 13. │ └─vctrs::vec_default_ptype2(...)
 14. │   ├─base::withRestarts(...)
 15. │   │ └─base (local) withOneRestart(expr, restarts[[1L]])
 16. │   │   └─base (local) doWithOneRestart(return(expr), restart)
 17. │   └─vctrs::stop_incompatible_type(...)
 18. │     └─vctrs:::stop_incompatible(...)
 19. │       └─vctrs:::stop_vctrs(...)
 20. │         └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
 21. │           └─rlang:::signal_abort(cnd, .file)
 22. │             └─base::signalCondition(cnd)
 23. └─rlang (local) `<fn>`(`<vctrs__2>`)
 24.   └─handlers[[1L]](cnd)
 25.     └─dplyr:::rethrow_error_join_incompatible_type(cnd, vars, error_call)
 26.       └─dplyr:::stop_join(...)
 27.         └─dplyr:::stop_dplyr(...)
 28.           └─rlang::abort(...)

@jjacobson95
Copy link
Collaborator

jjacobson95 commented Nov 23, 2024

As these are preventing the full build, I'm going to exclude them to get the updated drug and samples numbers for the others for AACR. When these are working, I'll rebuild again with them included.

@sgosline
Copy link
Member Author

So is the bug in the pdx_code branch? or the drop_drugs branch? I can work on it today.

@jjacobson95
Copy link
Collaborator

Sorry this is a bit complicated. It is currently on the drop_drugs branch and I think it exists because some files don't match how the build_all / build_dataset process is currently working. This updated build process is detailed in mpnst-readme-update branch.

If you could just get the sample / drug numbers for these (for AACR), I'd be happy to align this to the build process when I return after thanksgiving.

Just a side note, I am still working on getting the numbers for the others, the GDC-client was updated which broke HCMI yesterday (undocumented bug) so I'm seeing what that'll take to fix.

@sgosline
Copy link
Member Author

yeah the numbers are unchanged since the paper, so we can just use those.

@jjacobson95
Copy link
Collaborator

Sounds great

@sgosline
Copy link
Member Author

I am adding another commit to where i now have things working. hope they work for you.

sgosline added a commit that referenced this pull request Nov 27, 2024
mPnst and mpnstpdx code now build.
@jjacobson95
Copy link
Collaborator

jjacobson95 commented Dec 2, 2024

Looks like this is pretty close to working. No error messages to report during the build, however, the validator is failing for mpnst_transcriptomics, mpnst_proteomics, mpnst_copy_number, and mpnst_mutations.

All of the error messages where is fails are the same:
'improve_sample_id' is a required property.

It indicates that there are thousands of rows with missing improve_sample_id values.


For example, for mpnst_transcriptomics.csv:

Header (improve_sample_ids are present)

entrez_id,transcriptomics,improve_sample_id,source,study
729759,15.555831,22,NF Data Portal,MPNST PDX MT
401934,0.103466,22,NF Data Portal,MPNST PDX MT
388581,3.487198,22,NF Data Portal,MPNST PDX MT
388581,4.795471,22,NF Data Portal,MPNST PDX MT
388581,28.787805,22,NF Data Portal,MPNST PDX MT
388581,7.684035,22,NF Data Portal,MPNST PDX MT
388581,2.763487,22,NF Data Portal,MPNST PDX MT
80772,5.464398,22,NF Data Portal,MPNST PDX MT
80772,1.631856,22,NF Data Portal,MPNST PDX MT

Tail (improve_sample_ids are not present)

4513,766.053199,,NF Data Portal,MPNST PDX MT
4509,113.918447,,NF Data Portal,MPNST PDX MT
4508,134.885977,,NF Data Portal,MPNST PDX MT
4514,1387.07963,,NF Data Portal,MPNST PDX MT
4537,350.462059,,NF Data Portal,MPNST PDX MT
4539,212.951003,,NF Data Portal,MPNST PDX MT
4538,143.134393,,NF Data Portal,MPNST PDX MT
4540,50.31931,,NF Data Portal,MPNST PDX MT
4541,9.996729,,NF Data Portal,MPNST PDX MT
4519,90.065835,,NF Data Portal,MPNST PDX MT

@sgosline
Copy link
Member Author

sgosline commented Dec 3, 2024

I can't seem to repro this error locally. it all works on my end, using this branch.

@jjacobson95
Copy link
Collaborator

jjacobson95 commented Dec 4, 2024

After clearing the cache and rebuilding, I am still ending up with missing improve_sample_ids.

This is my run command:

 # ensure branch is up to date
git pull 

# ensure branch is on drop_drugs branch
git branch 

# remove all docker caches and images
docker stop $(docker ps -q)
docker rm $(docker ps -aq)
docker rmi $(docker images -q) -f
docker builder prune -a --force
docker system prune -a --volumes -f

# ensure nothing is in local to interfere or affect the build
rm -r local    

#run build command
python build/build_dataset.py --dataset mpnst --build

I'll go ahead and git clone to test a totally fresh repo as well. Maybe there is some artifact that could be causing this.

*Update: On a clean repo, this issue is still present for me. I am using a Mac with an M1 chip.

@jjacobson95 jjacobson95 merged commit df14101 into main Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add in MPNST PDX Data
2 participants