Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix everything based on recommendations from CFDE #12

Open
icaoberg opened this issue Mar 24, 2021 · 0 comments
Open

Fix everything based on recommendations from CFDE #12

icaoberg opened this issue Mar 24, 2021 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@icaoberg
Copy link
Collaborator

This is the feedback we received from Arthur Brady from CFDE

REQUIRED CHANGES (these have already been made in the attached datapackage; together, they make the submission a valid one):
* Concatenating id_namespace and local_id must always yield a string which is a valid URI. If URIs aren't already available for your objects, then the easiest, lowest-cost way to do this is to roll your own id_namespace using a "tag URI" scheme (see https://tools.ietf.org/html/rfc4151), which I've done for you in the amended dataset, as an example. (Note that the date in the tag-string is NOT meant to indicate anything about the date of the submission, or the version of the data, or anything like that: it's a time, paired with a domain you control (hubmapconsortium.org), which taken together mean only "this URI was created by whoever controlled this particular domain at this particular time." it won't need updating unless (a) you want to switch to a different URI scheme for identifying these same C2M2 objects in future submissions, or (b) someone totally new takes over the HuBMAP domain, in which case the date part of the tag for any new data coming out of HuBMAP should be updated to reflect the new owners. I also fixed all local_id values to be URI-compliant (substituted "%20" for spaces, e.g.)
* I tweaked the timestamps to comply with the format spec: substituted the space in between the date and the time with a 'T' character, and added '-00:00' to the end to indicate 'no time zone specified.' (you can swap that "no time zone" suffix out for the real value, if you want to and have the info available, or you can just delete the '-00:00' suffixes entirely our timestamp suffixes are optional (cf. https://github.com/nih-cfde/published-documentation/wiki/TableInfo:-file.tsv) -- i just put all the possible components in so as to give a fully-qualified example.)
* biosample.tsv had duplicate rows; I removed those
*  after dedup, biosample.tsv still had multiple rows per biosample: it seems 8 samples were assigned to multiple university-based projects. primary project attributions like these need to be unique per sample: as a (surely unsatisfying) hack, i reattached all the biosamples to the root "HuBMAP" project, which will likely not suit you long-term, but is at least a valid configuration. if this is a situation in which one biosample was sub-sampled and then the sub-samples were sent to different places to participate in different projects, then we'll need either (a) (new, more specific) identifiers for each sub-sample to distinguish them from one another, with each uniquely attached to its own particular (e.g. university-based project, or (b) to live with what I did for a while (essentially decoupling biosamples from individual projects and just putting them all under the "HuBMAP" purview) until such can be generated. part of the complexity here is that we haven't explicitly modeled provenance or subsampling yet, so the approximations we're expecting ("most specific sample that actually got used, ignoring upstream sample-from-sample lineage") won't precisely match a complete, detailed sample inventory (yet). if this is too much to worry about at this point, then don't: in a few months we'll be providing structures to say "we had sample A, then we split it into A1, A2 and A3, and then we sent those off." For now, we only care about A1, A2 and A3 and not A, but if A is all you have info on, then my hack is probably the easiest way to represent that.
* There was a reference in project_in_project to a project labeled "Vanderbilt University," which broke our integrity checks because there was no such record in the main project table. I added one into project.tsv just to be safe, although this project doesn't seem to appear anywhere else in the submission. If you want to delete it, that's fine -- just be sure to delete it from both project.tsv and project_in_project.tsv, or there'll be another key-violation error.

OTHER NON-URGENT COMMENTS
* I updated the JSON file to match the most recent changes to the C2M2 JSON Schema.  The only change was to make the primary_dcc_contact.dcc_abbreviation field required (...and you already had a non-null value, there, so this schema change won't affect you directly today, but I did want to note & explain the diff between the included .json file and the earlier one)
somewhere around 40 collections are specified, but there's nothing (files, biosamples) actually in any of them. if this was intended, then it's a fully-compliant thing to do and doesn't need to change: i just figured i'd point it out in case it wasn't the expected outcome.
@icaoberg icaoberg added the bug Something isn't working label Mar 24, 2021
@icaoberg icaoberg self-assigned this Mar 24, 2021
@icaoberg icaoberg transferred this issue from another repository May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant