-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize htsget:// scheme on spec #581
Comments
Hi, this was discussed today in the FASP call - and we thought we'd put some notes for further discussion: The call agreed there was probably a need for a If DRS was to serve up some form of
Is there other info the url would need to contain? Does the (there was then a discussion about the naming of the /reads and /variants endpoints - and if these are custom then how could they be discovered - at the moment even the service-info endpoints are not at a 'known' spot if custom paths are used) |
Looping in @mlin @jmarshall @daviesrob . Please add in anyone else who should be on this thread |
I'm not really in this loop, and am a bit confused on the status of htsget:// URIs. These are already used in the 2.24.1 (and probably earlier) release of the htsjdk, but I am inferring from the existence of this thread that they aren't standardized yet. The htsjdk recognizes the following as an endpoint to an alignment file, by recognizes I mean accepts
|
This was discussed in the 2020-10-27 htsget meeting. Clients like samtools given an URL like OTOH clients like IGV given a sample URL like I believe communicating that an URL will respond to htsget-style requests is the fundamental motivation for the proposed URL scheme. Otherwise each tool could have its own command line option or preferences, but an approach that was the same for all tools and didn't require a separate option would be preferable. (Another approach that wouldn't require defining a new scheme would be to alter the htsget protocol so that all htsget requests were required to contain an htsget query parameter. Then an htsget server could respond to a request for plain If such an URL scheme was going to be made a convention, obviously the htsget spec would be a place to do it. As to the format and contents of the URL, what I think we were envisaging when this was discussed in the htsget meeting was that the
(This is consistent with @jrobinso's description of what HTSJDK already supports.) If it did not contain all the parts of that URL (i.e., all of the hostname, port, and path), it would not in general be possible to reconstruct the complete URL — e.g., there is no reason a host couldn't serve multiple htsget datasets on different paths if it wanted to, so you can't drop components of the path as doing so would be ambiguous as to which dataset you were querying. Plain (Discovery and the exact path of the endpoints is a separate question. It was e.g. discussed on the htsget mailing list in February 2020, subject “htsget endpoint path is partially hardcoded” et al. In the past IIRC I have been told that Service Registry is the GA4GH solution for discovery.) |
At the moment I'm doing the following in IGV (and igv.js) to determine if an https:// url is htsget. This is done after all other possibilities for interpreting the URL are exhausted. Its working, but feels maybe a bit fragile. I would still need to do this for an htsget:// url in order to discover the type of service (variant or reads), so at the moment the htsget:// hint if you will doesn't really save any calls, however step 2 could be skipped. (1) query the server with class=header I use the class=header parameter in the initial query, even though I don't really want or need the header, to avoid accidentally requesting the entire file and having it returned in the ticket as data URIs. This would be a pathological case, unlikely. re http, currently given an htsget:// protocol I first try https://, catch errors, and if it fails try http://. I think the htsjdk does something similar based on comments there. |
@jmarshall With respect to the meeting notes you referenced, it looks like an IGV use case is at least partly responsible for the "htsget" idea. IGV doesn't need it, both desktop and web versions are fully implemented now, bugs and issue reports notwithstanding. The htsjdk is using it, perhaps @lbergelson can expand on its usefulness there, but the IGV team (speaking) is neutral on this. If it is implemented a straight swap of protocols (htsget -> https/http) with no other change in the URL is preferred, as is apparently currently the case. |
The proposal to use an An alternative would be to formalise what IGV is doing. If a client makes an ordinary non-htsget HTTP(S) request for a region from a BAM/CRAM/VCF web resource (e.g.
in some order, and hope to make Range requests on the main file based on the contents of the downloaded index (from whichever index extension succeeded). An htsget server could signal the client to retry this as an htsget request in the usual HTTP way by returning a 426 Upgrade Required status code:
This would be the client's signal to retry the request using htsget-style An htsget server certainly should return 426 for requests for index filenames (e.g. URLs with spurious IMHO this is a better approach than a bespoke URL scheme because it keeps the details of the shift to htsget between the client and server. Having an |
#581 (comment) by @andrewpatto:
Why does DRS need to expose the richness of the htsget protocol? What issue would there be with serving up an |
Not sure if that's accurate. IIRC htsjdk had that |
That's why I wrote “seems”; moreover this was a reference to this previous comment: #581 (comment). HTSJDK introduced this in its client implementation in samtools/htsjdk#1494, which didn't explain whether or in what way the scheme was necessary in their implementation. Perhaps @andersleung or @lbergelson can shed some light on this. Anyway: I withdraw any comment about the genesis of this scheme proposal, and it's immaterial in any case. I would be interested in your and others' views on the substance of the counterproposal. |
I like the 426 response proposal, it would clean up the client code. IGV, and I imagine other clients, reads the index first, and there are 3 patterns that needed to be tried, ".bam.bai", ".bai", and ".bam.csi". The first 2 because the htsjdk and samtools use different naming conventions for indeces. To be honest I don't recall if ".csi" is also tried, but the point is any of the common patterns for indexes would ideally return the 426, so the first one succeeds and we don't cycle through all of them. |
The origin of the We're willing to adapt to an alternative mechanism like what @jmarshall proposed if that's the consensus. |
@jmarshall, On reflection, due to the necessity of knowing what is being served I'm not sure the 426 will in the end save anything. The client (e.g. IGV) just has a URL to start, its not known if its a URL to a "bam" source until the initial server poke. By the time the data type is known its also known that its an htsget server (or not), , so I can't imagine a situation where an attempt is made to fetch an index. |
From @jmarshall
The |
I think it's useful for a client to be able to identify what type of object it's accessing without making any web requests. The |
This is a foible of IGV's implementation. For htslib for example, detecting that the response is an htsget ticket is done in the file access layer, before any decisions have been made that need to know whether the stream is going to be used as a For any URL, you don't know for sure if it's going to be a “bam” or a “vcf” or another kind of source until the initial server poke. So I guess I'm a little surprised that in general you don't do the the initial server poke (to get a Content-Type or to sniff the first few bytes to detect formats) before you instantiate different kinds of reader objects. I assume for most direct URLs, you do some heuristics based on the extension at the end of the URL? For htsget you could check for the URL path containing
This is quite an extraordinary statement for DRS to make. The nature of the web is that you can make likely inferences from the structure of URLs, but you don't know for sure what you're going to get until you make the request and receive a response. In particular, an URL may result in a redirect and the client is expected to make a request to another URL as specified. Are you saying that An htsget ticket is really a form of redirection.* What I hear you saying is that DRS would like the (* ISTR we briefly mused about using a 3xx status code for the ticket response, but considered that this could make the protocol harder or impossible to implement using some HTTP client libraries — which might be expecting to handle all 3xx responses themselves.)
My problem with If people want an optional indication that an URL is likely to be an htsget URL, an optional distinctive query parameter is also a hack but a lesser one and provides a useful optional heuristic. e.g.,
could be used by DRS and others if they wished, is distinctive, could be validated by htsget servers, and solves @jrobinso's problem. |
@jmarshall Maybe its a "foible", but IGV supports ~ 45 file formats plus a few web services, and in most cases it would not be possible to detect file format even after reading the entire file. So yes it insists on some conventions if you want to view your data there, in most cases file extension. I'm able to make an exception for htsget because it returns a defined json container (the ticket) with a format specifier so we can know what it serves. If anything is missing here, and this is minor, its a call that says yes I'm an htsget server and this is what I'm serving, I am using. "class=header" for that now which works well enough. |
The intention would be that Given an URL like However in htsget the
or perhaps use some heuristic around trimming back to |
@andrewpatto wrote
Some thoughts about how DRS and htsget interact It's not tractable for DRS to make provision or account for the type/protocol of everything it may serve a payload. This is another variant of the issues we have discussed with DICOM and whether DRS can/should reflect the structure (model) of a specific data type. Besides, htsget's ability to retrieve specific regions, the ability to access /reads or /variants requires knowledge of the specific datatype being handled. DRS is not the protocol to provide reach-in to the specifics of the objects it carries. Improvement of how type is indicated in DRS would deal with that. It is also likely that the reads and variants should be specific objects with their own DRS ids. That said, as an example we have demonstrations that the URL provided by DRS can be passed to SAMTools. That could be used to provide very similar functionality to htsget. The difference as I understand it is, for htsget, the slicing of the file would be done on the htsget server. Using the DRS url the slicing would be on the WES server where you are running samtools. In theory samtools would have to retrieve the whole file. However, the widely expected behaviour is that the compute (samtools) would be run in the same cloud region as the file - so no download occurs. Properly organized, there should be minimal net difference in performance. The difference is really one of convenience for the user. DRS and WES provide generic capability for you to roll your own solution to many problems. htsget provides specific capability for given datatypes more simply packaged. There's also a mismatch on ids. A DRS id will always give you the same set of bytes. That's a fundamental of DRS intended to address reproducibility etc. The accessions used in the htsget examples (e.g. NAxxxxxx) wouldn't consistently give the same set of bytes. That's not say it wouldn't be useful to be able to use DRS ids with htsget to refer to the same binary data. That could be separated from the use of the DRS protocol to access the file. |
Sorry for the very late reply to this. I don't fundamentally disagree with anything in the thread - but I guess can make some comments where my thinking from a DRS perspective might provide some different arguments.
So totally understand that it might not be seen as a big enough issue to warrant a htsget URI format (following the above thread - I'm not even convinced myself) . Just putting some of the arguments out there. |
I would also add though whilst I totally agree that htsget can in some way be viewed as being just http - I also think that it has a clearly defined 'pattern of use' of http that is unique to it. That is, the custom 'known' parameters names like So is it a protocol layered on top of another protocol?
(I'm sure there is some RFC that says this is a bad idea..) |
I think that the bottom line here should be to either include this scheme on the standard or explicitly discourage it since implementations are starting to diverge and might cause different types of (integration) troubles in the (near) future. |
As agreed in today's htsget APAC-friendly meeting, I've been tasked to summarise this thread in the following table and then I also added some more pros/cons to the mix. Full disclosure: While I was quite partial to the Please do help/comment in improving this analysis if you see things that are unclear/malformed/biased in any way.
|
Closing as discussion seems stalled and @jmarshall is going for HTTP 426 response (PR #665) anyway. |
It is true that discussion has stalled, partly because until yesterday we had not had an htsget meeting in quite a while. This issue took up most of the time of yesterday's meeting and I think there was general agreement that this is htsget's main open question at the moment and we will endeavour to get discussion rolling again. The proposed HTTP 426 response can provide an alternative identification mechanism but (as explained on the PR and in yesterday's meeting) it is really orthogonal to the question of whether to use or specify a bespoke URL scheme or distinctive query parameter, as is being considered on this issue. Even if htsget does bless self-identifying URLs via a scheme or query parameter, the defined 426 response for servers to use for index file requests may be a useful thing to have in the spec as well. So considering PR #665 does not mean that this issue's question has been decided. |
Briefly mentioned on igvteam/igv#850 (comment), discussed on some of the htsget meetings, GA4GH's Slack
#fasp
channel (as of today) and possibly on other issues in this repo (please refer to them if so), it is unclear if thehtsget://
scheme should officially appear on the spec?From a client perspective it'd be advantageous to discern the protocol right away but this might have other (unforeseen?) side effects?
/cc @jb-adams @ohofmann @jrobinso
/cc @CastilloDel @mmalenic
The text was updated successfully, but these errors were encountered: