The htsget protocol was selected for the Data Reception
part of the GDI starter kit. Htsget enables for retrieval of files from the storage-and-interfaces archive. The specification of the protocol can be found here. The implementation can be found here.
This repository contains the implementation for the GDI starter kit, packaged in the docker-compose.yml
file.
For a demo setup, you also need to setup the GDI starter-kit storage-and-interfaces.
More details regarding that and the way to run the services can be found in the sections below.
The configuration for the htsget server exists in the folder called config-htsget-rs
. Detailed description of the configuration options can be found in the reference implementation repository. The most important settings are:
send_encrypted_to_client
defines whether crypt4gh should be enabled (whether htsget will assume that the files retrieved from sda download service are encrypted), and make the calculations accordingly.private_key
,public_key
allow for predefined the keys that htsget will use for communication with sda-download. If the keys are not defined, htsget will create keypairs for every request (preferred default).
response_url
defines the url that will be used in the response to the client from htsget.forward_headers
defines whether headers from the client will be forwarded to sda-download (should be set totrue
in order to use the authentication mechanism).
index
defines the location of the index file in sda-download. This is assumed to be non-encrypted in all cases, as it does not contain sensitive information.file
defines the location of the header and file requested from sda-download by htsget in order to calculate the ranges according to the request.
The values in the config file config-htsget-rs/download-config.toml
allow for requesting encrypted partial or full files.
The htsget product of the starter kit depends on the storage-and-interfaces product. Specifically, the data served has to be ingested and stored in the archive included in the storage-and-interfaces repository. This is achieved by using the docker-compose-demo.yml
, as below.
To start the services, start the individual docker compose environments from their respective root directories:
docker compose -f docker-compose-demo.yml up -d # in the folder starter-kit-storage-and-interfaces
docker compose up -d # in the folder starter-kit-htsget
The logs for the two docker compose files can be accessed using the following commands for storage-and-interfaces and htsget respectively
docker compose -f docker-compose-demo.yml logs -f # in the folder starter-kit-storage-and-interfaces
docker compose -f docker-compose.yml logs -f # in the folder starter-kit-htsget
In order to test the htsget implementation, there needs to be some data ingested into the archive. The demo setup of storage-and-interfaces
provides one dataset, DATASET0001
, containing the file htsnexus_test_NA12878.bam
.
Get an authentication token from the auth service of storage-and-interefaces
using
token=$(curl -s -k https://localhost:8080/tokens | jq -r '.[0]')
You also need a crypt4gh key pair. This will be used for (re-)encrypting the file before it's sent to you.
crypt4gh generate -n demokey
pubkey=$(base64 -w0 demokey.pub.pem)
Now you should be able to make the requests to the htsget server. To request the byte range of chromosome 11 of the file htsnexus_test_NA12878.bam
run:
curl -v -H "Client-Public-Key: $pubkey" -H "Authorization: Bearer $token" -H -k http://localhost:8088/reads/DATASET0001/htsnexus_test_NA12878?referenceName=11
The request will return a ticket of how to download the requested partial file:
{
"htsget": {
"format": "BAM",
"urls": [
{
"url": "data:;base64,Y3J5cHQ0Z2gBAAAAAgAAAA=="
},
{
"url": "http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh",
"headers": {
"Range": "bytes=16-123",
...
}
},
{
"url": "data:;base64,ZAAAAAAAAACxHxjMhagEVY+4bVEZYuqYGK5Ph3jrffrMhXpc3wYWenp2ofohEUwSBOuZF3kH6TEiQsjSPGaE1bvdMQ2uUuuHLWicplUneE77G079sTW8rJIJJ1VgZecPi9cTfQ=="
},
{
"url": "http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh",
"headers": {
"Range": "bytes=124-1049147",
...
}
},
{
"url": "http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam.c4gh",
"headers": {
"Range": "bytes=2557120-2598042",
"accept": "*/*",
...
}
]
}
}
This repsonse contains byte ranges (eg. "Range": "bytes=124-1049147"
) as parts of url requests.
This should guide you to make requests to http://localhost:8443/s3-encrypted
(which is sda-download
from storage-and-interfaces
) to retrieve data for chromosome 11 from the file:
curl 'http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam' -H "Authorization: Bearer $token" -H "Client-Public-Key: $pubkey" -H "Range: bytes=16-123" -o p11-00.bam.c4gh
curl 'http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam' -H "Authorization: Bearer $token" -H "Client-Public-Key: $pubkey" -H "Range: bytes=124-1049147" -o p11-01.bam.c4gh
curl 'http://localhost:8443/s3-encrypted/DATASET0001/htsnexus_test_NA12878.bam' -H "Authorization: Bearer $token" -H "Client-Public-Key: $pubkey" -H "Range: bytes=2557120-2598042" -o p11-02.bam.c4gh
The response from hstget also lists two data sections:
"url": "data:;base64,Y3J5cHQ0Z2gBAAAAAgAAAA=="
and
"url": "data:;base64,ZAAAAAAAAACxHxjMhagEVY+4bVEZYuqYGK5Ph3jrffrMhXpc3wYWenp2ofohEUwSBOuZF3kH6TEiQsjSPGaE1bvdMQ2uUuuHLWicplUneE77G079sTW8rJIJJ1VgZecPi9cTfQ==
These segments are part of the requested data. Save the data (eg. Y3J5cHQ0Z2gBAAAAAgAAAA==
) to files
echo Y3J5cHQ0Z2gBAAAAAgAAAA== | base64 -d > start.b64
echo ... | base64 -d > mid.b64
Then concatenate all segments:
cat start.b64 p11-00.bam.c4gh mid.b64 p11-01.bam.c4gh p11-02.bam.c4gh > htsnexus_11.bam.c4gh
Make sure that the file can be decrypted with your private key:
crypt4gh decrypt -s demokeys.sec.pem -f htsnexus_11.bam.c4gh
Finally, check that samtools can open the new file:
samtools view htsnexus_11.bam
or, if you don't have samtools installed
docker run -it --rm -v $(pwd):/tmp staphb/samtools /bin/bash