-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RepertoireGroup refinements #578
Comments
No, I would not want the implicit definition, because it would be very useful (and possibly a quite common use case) to group repertoires across multiple repositories. I'm assuming this use case actually with VDJServer. From an ADC perspective, I think all identifiers (which are meant to be FAIR) should be PIDs, except those few like subject_id and sample_id which are local to a study. I don't think that a separate But it is an open question on whether RepertoireGroup is a top-level object that has its own ADC entry point, and thus needs a PID? It's maybe possible that RepertoireGroup is embedded within the objects that use it, e.g. DataProcessing. |
Yes but that's purposefully not in the the scope of RepertoireGroup. An analogy would be wanting only the productive rearrangements within a Repertoire; this is not defined with Repertoire but in the DataProcessing that acts upon that repertoire. The same would be true with RepertoireGroup. With the redesign of DataProcessing, it points to Repertoire and RepertoireGroup, instead of vice versa. |
Yes so RepertoireGroup can be re-used for both: (1) a set of repertoires, and (2) a sequence of repertoires for a time course. It's optional, so leave it out if it doesn't apply. |
Yes, I agree, same here. When you do a download on the Gateway, you are essentially creating a |
It just seems to be promoting time point as a "special case" when to me really it is one of many criteria/fields from the Repertoire metadata that you may use to group repertoires. Why is grouping via time point a more important grouping criteria than disease state or tissue type. |
I don't think our repositories would store |
Well, you are the one that created this issue and wanted "resolvability" before RepertoireGroups are usable ;-D You are welcome to add a IMO, to use RepertoireGroup on the Gateway today, we don't need agreement on the details of PIDs or resolvable IDs. All we need is agreement that the |
The difference is that I consider that time course is not a grouping criteria like disease state or tissue type. In those examples, you are referring to a property of Repertoire (like Do you want to eliminate the time point from RepertoireGroup? If so, then we still need a mechanism to describe a sequence of repertoires which can have user-defined labels, that IMO can be used to specify generic ordering but also time course data, what is your suggestion for how to specify that in RepertoireGroup? |
So then the results of the analysis on that In theory, this should be cheap, as it will contain neither |
VDJServer will store
Right now, none of the studies in the ADC have cohorts defined as RepertoireGroups. That would be nice to have though. |
I just revisited the current definition and the discussion above. Based on all of that, I think the This manifest:
Could have a
I could also see having a
|
I could also see how one could create two |
@scharch @schristley I think we are done 8-) |
Tongue in cheek comment, I didn't really mean that I think we are done... I am re-opening this issue - sorry for the confusion. |
@scharch in all seriousness, this meets my expectations/thoughts on what one might want from a I am not so sure on your use case and what you are thinking about this... |
Brainstorming a pretty complex use case that combines Manifest and RepertoireGroup as we might use it on the iReceptor Gateway in an Analysis App. A file that describes two
Combine that with a manifest from a download like this:
Assume an analysis tool designed to do comparative analyses on N Repertoire Groups. Expects a RepertoireGroup.json, a Manifest.json file, and all of the relevant files described in the Manifest.json. Processing would:
|
Here is another interesting one - real study, with real data - Go to the Gateway and search for TRA and PRJCA002413:
With manifest as below (this is essentially what you get from the Gateway if you were to download this data):
The RepertoireGroup file says there are three cohorts being considered. The Manifest file says that there is one source DataSet to consider. Assuming the analysis is the same as the previous scenario, the analysis then splits covdi19-1.tsv into three separate files, one for ERS, LRS, and HC and then runs a comparative analysis across those three RepertoireGroups. Exactly the same data but a RepertoireGroup file that only contains LRS and ERS gives you a late recovery and early recovery comparison (ignores healthy controls). Exactly the same data, but with five RepertoireGroups in the file, each with one repertoire per group for each of ERS1, ERS2, ERS3, ERS4, ERS5 gives you a comparative analysis across the five different early recovery subjects (ignores all of the LRS and HC repertoires) . This seems pretty powerful to me... We can basically use the RepertoireGroup file to slice a given data set in any way that we want (at the repertoire_id level at least). |
The big thing that's missing is the ability to extract and combine parts of I think this may be related to differences in our conceptions of what |
Should |
To do that, would it suffice to have a
|
No, because I might want an arbitrary filter criteria that doesn't align with the original researcher's sample/data processing.
|
OK, now that is complicated... and I was afraid that was going to be your answer 8-) Any ideas how to describe that??? |
And are we "simply" describing what is in the data set? Or are we describing what one needs to extract from the data set? That is the TSV file contains all rearrangements and we say:
to describe what needs to be extracted? |
We could probably use an ADC query to describe what is in the data set??? |
I was thinking along those lines, yes, but ultimately it probably needs to be free text.
Again, ideally the latter. But practically, probably the former. |
@scharch Yes, and I agree with your advocation! The first pass of Now I say, "as a whole", but I'm lying. I'm already doing some filtering because I almost always want just productive rearrangements, so I filter out the nonproductives. I've always had it in my mind to extend that filtering capability, and have delayed doing anything, but I increasingly need it. My anecdote? In a collaboration, I'm asked, can you give them those stats (gene usage, etc) for just IGHV4? No problem, I'll write a one-off script to pull out the rearrangements and viola. Oh, if it isn't hard, give it to me for all VH families. Then two days later, can you give me IGHV4 but also separated by J family? You see where this is going... Before you know it, my one-off script is multiple scripts with multiple parameters. I have a multi-TB analysis tree with dozen of directories where the rearrangements have been duplicated numerous times... Exactly what I didn't want!
That's exactly what I was thinking. When I started thinking about how to generalize filtering, what makes sense is to use the ADC query language to describe the filter instead of inventing something new. This isn't too hard to implement so long as you are not that concerned about efficiency, you compute the expression tree implied by the ADC query on each rearrangement record, and it comes out true or false. But, and this is a big butt (sorry ;-), I've only been thinking in the context of filtering Rearrangements. If we want a more general filtering which works on Clones and Cells and whatever, that starts getting more complicated. Also, the filtering is only on fields of that one type, i.e. fields in the rearrangement file. If you want to express things that crosses types (e.g. filter on both Rearrangement and Cell fields) then it gets even more complicated. |
That's almost worthless honestly. If that is all you want, stick it in One idea is try not to do everything in one chunk. We have the basic |
No, but actually we need this anyways. The simple |
I agree with the principle, of course. But I was more thinking about
I think that the "right" answer is probably a |
If we don't try to do everything but think that allowing filters on object types would be useful, then a simple enhancement to
and with the type-key, that allows us to specify more, so here's the same example but also provide a filter on clones.
At least for my test case, this would be highly useful. I could create groups for each of the combinations of V and J families that I need, and yes it would be a lot, but then I could run |
Doing something like, give me the |
Yeah, this is where I was coming from, but there's definitely plenty of utility in simpler cases. I want to sleep on it a bit, but what you've proposed above seems pretty good to me at first blush. |
@schristley I'm thinking maybe the filters should be per |
Sure, just put the
Not easily. We don't try with the ADC API and instead just describe in the docs. |
As @schristley says, doing this as a single query in the ADC is probably unlikely. With that said this is very possible with multiple queries across the ADC endpoints. In fact, the above query is exactly what the iReceptor Gateway Cell page does currently. This is a search for Expression of TRBV4-1 > 10 across all T-cell repertoires in a specific study. If you click Download you get all the Rearrangements, Cells, and Expression data. So the ADC queries are powerful, you "just" can't do complicated joins across the collections. The iReceptor Gateway does 1 repertoire, 12 expression, 10 cell, and 10 rearrangement queries to gather the data presented in the page below 8-) ![]() |
With that said, finding all the Cells with that expression level is one query:
where query2.json is:
|
A couple of questions that came up in an iReceptor Plus meeting today:
repertoire_group_url
to the fields of RepertoireGroup to uniquely identify the source of the repertoire and where it was found. Or do we consider this implicit in the sense that one would have a separate RepertoireGroup for each repository in an analysis. This implicit definition is challenging if you want to have a RepertoireGroup that is grouping repertoires from multilple repositories.repertoire_id
is sufficient to completely describe the groups of repertoires that we might want to group together. For example, if we wanted a RepertoireGroup that only contained IGH and I have a different SampeProcessing or DataProcessing for these rearrangements, I would need to provide asample_processsing_id
or adata_processing_id
to accurately describe the set of rearrangements that are being considered as part of this RepertoireGroup.The text was updated successfully, but these errors were encountered: