Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pitch] Partial keep raw in OpenMS #145

Closed
jspaezp opened this issue Aug 6, 2023 · 14 comments · Fixed by #148
Closed

[Pitch] Partial keep raw in OpenMS #145

jspaezp opened this issue Aug 6, 2023 · 14 comments · Fixed by #148
Assignees

Comments

@jspaezp
Copy link
Contributor

jspaezp commented Aug 6, 2023

Hello,

I am working on a project where I would like to convert an sdrf to an openms workflow where I would like to keep the extension as is, only for some files.

(I am also implementing the workflow... nf-core/quantms#64 (comment))

Ideas:

parse_sdrf convert-openms {other options/args} --raw {.d/.raw/all/none)
# --raw .d would keep the raw extenion if it is .d, make it .mzML otherwise
# --raw all would keep all as raw
# --raw none would convert all to mzML
parse_sdrf convert-openms {other options/args} --extension_convert "raw:mzML,mzml:MZML,mzML:mzML,d:d"

# I am not that creative with names ...
# if --extension_alias is set, it will try to replace with the passed options
# and error out if a file does not match

# In this example
# specs.raw -> specs.mzML
# specs.mzml -> specs.mzML
# specs.mzML -> specs.mzML
# specs.d -> specs.d
# specs.docx -> # NOTHINGS, ERROR raised

let me know what you think and if the feature would be something you want me to make a PR for

Places where changes would go:

if not keep_raw:
ext = os.path.splitext(raw)
out = ext[0] + ".mzML"
else:
out = raw

(there is another place in the same file)

https://github.com/bigbio/sdrf-pipelines/blob/adf9279a5c1c03d578575c4a89dfd535af8715c2/sdrf_pipelines/parse_sdrf.py#L37C1-L37C1

@ypriverol
Copy link
Member

@jspaezp I like option 2. What do you think @jpfeuffer @daichengxin @fabianegli ?

@jpfeuffer
Copy link
Collaborator

Yes maybe we should do "2.", just in case we allow mixed designs at some point.

Name of the option could also be "extension_map(ping)"

@ypriverol
Copy link
Member

I have finished the first iteration of this implementation. @jpfeuffer Can I ask you something, will the name only changed in the openms.tsv but in the experimentl_design.tsv they are keep with the original name?

@ypriverol ypriverol mentioned this issue Sep 4, 2023
3 tasks
@jpfeuffer
Copy link
Collaborator

So, to be honest, after thinking about it a bit more, I think the safest way would be to always keep the original filenames everywhere.

BUT this means that you need to pass the design along the workflow for a bit longer, until the actual conversion step.
The actual conversion step will then precisely know what it is converting and which filenames to replace in the design.

By the way, has anyone tried with having the "raw" names in the design first? I thought @timosachsenberg implemented a file basename check only in the affected OpenMS tools.

@jpfeuffer
Copy link
Collaborator

I'm saying this because in n theory, this is all a hack based on the idea "we know that it most likely needs to be converted at some point in the future to mzML", which is not very nice.

@ypriverol
Copy link
Member

@jpfeuffer I guess, this will need some refactoring in quantms and also OpenMS if we don't do conversions?

@jpfeuffer
Copy link
Collaborator

No, as I said this should be implemented.

@ypriverol
Copy link
Member

Can we continue with the following approach and then do the major refactoring? In any case, this PR only include the possibilityin sdrf-pipelines to perform conversions in names. For example, if the keep names are allowed, then conversion will not be performed.

@jpfeuffer
Copy link
Collaborator

I think no changes are necessary but someone has to try.

@ypriverol
Copy link
Member

I actually removed now the option keep raw in the sdrf-pipelines. That would be the default option. If we want to convert between formats we will need to use option --extension_convert including raw -> mzML and others.

@ypriverol ypriverol linked a pull request Sep 4, 2023 that will close this issue
3 tasks
@jspaezp
Copy link
Contributor Author

jspaezp commented Sep 15, 2023

Hello there!

This fix broke how my data was handled, If I had files that ended with .zip it would fail but finishing with exit status 0 and generating an empty experimental design.

I am working over here on fixing this, I also noticed that the testing suite always passed because "ERROR" was looked for in the output, but click errored out with the "Error" word, thus it never failed.

Right now this test fails on me, let me know if that is the intended behavior.

def test_validate_srdf(change_test_dir):
     """
     :return:
     """
     runner = CliRunner()
     result = runner.invoke(cli, ["validate-sdrf", "--sdrf_file", f"{TESTDATA_PATH}/PXD000288.sdrf.tsv", "--check_ms"])
     assert result.exit_code == 0, result.output
     assert "ERROR" not in result.output.upper()
     print("validate sdrf " + result.output)
     """
     # Currently fails with this error
     E       AssertionError: The following columns are mandatory and not present in the SDRF: comment[technical replicate] -- ERROR
     E         The column comment[technical replicate] is not present in the SDRF -- ERROR
     E         The following columns are mandatory and not present in the SDRF: comment[technical replicate] -- ERROR
     E         The column comment[technical replicate] is not present in the SDRF -- ERROR
     E         There were validation errors.
     E         
     E       assert True == 0
     E        +  where True = <Result SystemExit(True)>.exit_code
     """
  • Sebastian

@ypriverol
Copy link
Member

You mean in your experimental design you have data in the way .zip and you want to convert .zip -> .d in the OpenMS experimental design?

@jspaezp
Copy link
Contributor Author

jspaezp commented Sep 15, 2023

Let me go back a bit ... just to make sure we are all in the same page.

If I wanted to set up a workflow with the file https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip; then I would build a .sdrf where that URL goes in the comment[file uri] column; therefore it would make a lot of sense to me that the comment[data file] column contains the filename from that URI (3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip).

nontheless, since we need the name in the experimental design (experimental_design, column Spectra_Filepath) to match the file name that the search engine will recieve, we would need to convert that .zip to .d.

'why do it in this stage?': Since there is now a flag in quantms that allow forcing the conversion of .d files to .mzML, makes more sense to me to allow the user to have the same starting .sdrf file, with any extension supported by the pipeline (d.gz, d.tar.gz, .d.tar, .d ...) and have the pipeline translate the extension, instead of having the user create an .sdrf with the "expected extension" of the output.

HAVING SAID THAT! even if that is not something you want, I will split the work on the testing side of things and submit a PR on it.

@ypriverol
Copy link
Member

Ok, the idea is that we allow users to convert:

  • .d.zip -> .d
  • .raw.zip -> .raw
  • .raw -> .mzML
    ....

I actually in the logic validated that we have only some extensions supported like .d -> .d or .raw -> .mzML, etc. I agree with you that we should move to a more generic appraoch when we allow any type of conversion from any file type to any file type. I will do a PR with the new changes for you to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants