Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should irods check validate the stored data or against the md5 file #185

Open
xiamaz opened this issue Jun 30, 2023 · 4 comments
Open

Should irods check validate the stored data or against the md5 file #185

xiamaz opened this issue Jun 30, 2023 · 4 comments
Labels
question Further information is requested

Comments

@xiamaz
Copy link
Member

xiamaz commented Jun 30, 2023

Currently all check commands for irods work against the separately stored md5 file. This is similar to what is being done by the sodar server commands. After moving a landing zone, there should be no additional need to manually validate these files.

These commands duplicate logic already contained in irods, as validation of replica checksums against the stored data is already part of irods itself.

Unless there are sodar independent workflows which require manual validation of uploaded md5 files, I would propose replacing the checks with native irods checksum checks in cubi-tk.

This affects irods/check, sea-snap/check_irods and snappy.

@xiamaz xiamaz added the question Further information is requested label Jun 30, 2023
@xiamaz
Copy link
Member Author

xiamaz commented Jun 30, 2023

@ericblanc20 @holtgrewe Input would be much appreciated

@xiamaz xiamaz mentioned this issue Jun 30, 2023
3 tasks
@ericblanc20
Copy link
Contributor

I am not sure I understand what you propose to do. I may be mistaken, but I understand that:

  • cubi-tk irods check checks the internal integrity of iRODS, i.e. consistency between md5 checksums across replicates. It looks like a health check of the iRODS system.
  • cubi-tk sodar/snappy/seasnap check compares the md5 stored in iRODS with the local md5. Its purpose is to verify agreement between the local data and what has been stored in SODAR.

In functional analysis projects, it is often valuable to be able to verify that the local analysis files (on the cluster) are identical to those stored on SODAR, especially when the analysis report had been re-run.

@xiamaz
Copy link
Member Author

xiamaz commented Jun 30, 2023

Thanks. The issue is that currently the checksum for any individual file is stored in both individual md5 files with the same name and in the irods metadata itself.

Given your use-cases at no point should the md5 file in irods be necessary, as it should always be better to let irods compute and store the checksum for us. E.g. irods check should just perform https://github.com/irods/python-irodsclient#computing-and-retrieving-checksums and pipeline specific checks should compare the checksum obtained from the irods metadata against a locally computed checksum.

@sellth
Copy link
Contributor

sellth commented Jul 3, 2023

This is an interesting point and maybe @mikkonie can chime in on this once he's back from vacation. Why do we actually move the .md5 files into the main iRODS storage? They are only needed for landing zone validation and could be discarded afterwards as the hashsums are also stored in the iRODS metadata.

Edit: I guess there is some use in having them readily available for another check after downloading data from SODAR (especially when not using iRODS tools i.e. Davrods), but this then begs the question why they're not shown in the "List files" web view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants