From 83545c6a9d009e98e3a35241a647807b79577f9b Mon Sep 17 00:00:00 2001 From: Marko Malenic Date: Fri, 29 Nov 2024 10:51:17 +1100 Subject: [PATCH 1/3] docs: re-word and simplify, add quick starts where applicable --- CONTRIBUTING.md | 6 +- README.md | 76 +++++----- SECURITY.md | 2 +- htsget-actix/README.md | 31 ++-- htsget-axum/README.md | 49 ++++--- htsget-config/README.md | 133 +++++++++--------- .../examples/config-files/basic.toml | 18 +++ htsget-http/README.md | 7 +- htsget-lambda/README.md | 2 - htsget-search/README.md | 39 ++--- htsget-storage/README.md | 14 +- htsget-test/README.md | 4 - 12 files changed, 193 insertions(+), 188 deletions(-) create mode 100644 htsget-config/examples/config-files/basic.toml diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 66c04fdbc..619e77e01 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -17,9 +17,9 @@ Have a look at existing [issues] to see if an issue has already been discussed. ## Pull Requests -We welcome you to open up a pull request -to suggest a change, even if it's a small one line change. If the change is large, it is a good idea to first open an -issue to discuss the change in order gain feedback and guidance. +We welcome you to open up a pull request to suggest a change, even if it's a small one line change. +If the change is large, it is a good idea to first open an issue to discuss the change in order gain feedback and +guidance. ### Tests and formatting diff --git a/README.md b/README.md index 69101b74f..6eca98c69 100644 --- a/README.md +++ b/README.md @@ -12,18 +12,35 @@ A **server** implementation of the [htsget protocol][htsget-protocol] for bioinformatics in Rust. It is: * **Fully-featured**: supports BAM and CRAM for reads, and VCF and BCF for variants, as well as other aspects of the protocol such as TLS, and CORS. * **Serverless**: supports local server instances using [Axum][axum] and [Actix Web][actix-web], and serverless instances using [AWS Lambda Rust Runtime][aws-lambda-rust-runtime]. -* **Storage interchangeable**: supports local filesystem storage as well as objects via [Minio][minio] and AWS S3. +* **Storage interchangeable**: supports local filesystem storage as well as objects via [Minio][minio] and [AWS S3][aws-s3]. * **Thoroughly tested and benchmarked**: tested using a purpose-built [test suite][htsget-test] and benchmarked using [criterion-rs]. -To get started, see [Usage]. - -**Note**: htsget-rs is still experimental, and subject to change. - [actix-web]: https://github.com/actix/actix-web [criterion-rs]: https://github.com/bheisler/criterion.rs -[Usage]: #usage -## Overview +## Quick start + +To run a local instance htsget-rs, run [htsget-axum]: + +```sh +cargo run -p htsget-axum +``` + +And fetch tickets from `127.0.0.1:8080`, which serves data from [data]: + +```sh +curl 'http://127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer' +``` + +### Configuration + +Htsget-rs is configured using environment variables or config files, see [htsget-config] for details. + +### Cloud + +Cloud-based htsget-rs uses [htsget-lambda]. For an example deployment of this crate see [deploy]. + +## Protocol Htsget-rs implements the [htsget protocol][htsget-protocol], which is an HTTP-based protocol for querying bioinformatics files. The htsget protocol outlines how a htsget server should behave, and it is an effective way to fetch regions of large bioinformatics files. @@ -51,22 +68,6 @@ htsget-rs implements the following components of the protocol: [htsget-diagram-png]: https://samtools.github.io/hts-specs/pub/htsget-ticket.png [tokio]: https://github.com/tokio-rs/tokio -## Usage - -Htsget-rs is configured using environment variables, for details on how to set them, see [htsget-config]. - -### Local -To run a local instance htsget-rs, run [htsget-axum] by executing the following: -```sh -cargo run -p htsget-axum -``` -Using the default configuration, this will start a ticket server on `127.0.0.1:8080` and a data block server on `127.0.0.1:8081` -with data accessible from the [data] directory. See [htsget-axum] for more information. - -### Cloud -Cloud based htsget-rs uses [htsget-lambda]. For more information and an example deployment of this crate see -[deploy]. - ### Tests Tests can be run tests by executing: @@ -77,33 +78,26 @@ cargo test --all-features To run benchmarks, see the benchmark sections of [htsget-actix][htsget-actix-benches] and [htsget-search][htsget-search-benches]. -[htsget-actix-benches]: htsget-actix/README.md#Benchmarks -[htsget-search-benches]: htsget-search/README.md#Benchmarks +[htsget-actix-benches]: htsget-actix/README.md#benchmarks +[htsget-search-benches]: htsget-search/README.md#benchmarks ## Project Layout -This repository consists of a workspace composed of the following crates: +This repository is a workspace of crates: - [htsget-config]: Configuration of the server. - [htsget-actix]: Local instance of the htsget server. Contains framework dependent code using [Actix Web][actix-web]. - [htsget-axum]: Local instance of the htsget server. Contains framework dependent code using [Axum][axum]. - [htsget-http]: Handling of htsget HTTP requests. Framework independent code. -- [htsget-lambda]: Cloud based instance of the htsget server. Contains framework dependent +- [htsget-lambda]: Cloud-based instance of the htsget server. Contains framework dependent code using the [Rust Runtime for AWS Lambda][aws-lambda-rust-runtime]. - [htsget-search]: Core logic needed to search bioinformatics files based on htsget queries. +- [htsget-storage]: Storage interfaces for local and cloud-based files. - [htsget-test]: Test suite used by other crates in the project. Other directories contain further applications or data: -- [data]: Contains example data files which can be used by htsget-rs, in folders denoting the file type. -This directory also contains example events used by a cloud instance of htsget-rs in the [`events`][data-events] subdirectory. -- [deploy]: An example deployment of [htsget-lambda]. - -In htsget-rs the ticket server handled by [htsget-axum], [htsget-actix] or [htsget-lambda], and the data -block server is handled by the [storage backend][storage-backend], either [locally][local-storage], or using [AWS S3][s3-storage]. -This project layout is structured to allow for extensibility and modularity. For example, a new ticket server and data server could -be implemented using Cloudflare Workers in a `htsget-http-workers` crate and Cloudflare R2 in [htsget-search]. - -See the [htsget-search overview][htsget-search-overview] for more information on the storage backend. +- [data]: Contains example data files used by htsget-rs and in tests. +- [deploy]: Deployments for htsget-rs. [axum]: https://github.com/tokio-rs/axum [htsget-config]: htsget-config @@ -111,19 +105,14 @@ See the [htsget-search overview][htsget-search-overview] for more information on [htsget-http]: htsget-http [htsget-lambda]: htsget-lambda [htsget-search]: htsget-search -[htsget-search-overview]: htsget-search/README.md#Overview +[htsget-storage]: htsget-storage [htsget-test]: htsget-test -[storage-backend]: htsget-search/src/storage -[local-storage]: htsget-search/src/storage/local.rs -[s3-storage]: htsget-search/src/storage/s3.rs - [data]: data [deploy]: deploy [actix-web]: https://actix.rs/ [aws-lambda-rust-runtime]: https://github.com/awslabs/aws-lambda-rust-runtime -[data-events]: data/events ## Contributing @@ -140,4 +129,5 @@ This project is licensed under the [MIT license][license]. [htsget-lambda]: htsget-lambda [license]: LICENSE [aws-lambda-rust-runtime]: https://github.com/awslabs/aws-lambda-rust-runtime +[aws-s3]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html [minio]: https://min.io/ \ No newline at end of file diff --git a/SECURITY.md b/SECURITY.md index 1b4973d32..f55748b8d 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -2,4 +2,4 @@ ## Reporting a Vulnerability -Please report vulnerabilities by opening an issue or sending an email to info@umccr.org +Please report vulnerabilities by opening an issue or sending an email to info@umccr.org. diff --git a/htsget-actix/README.md b/htsget-actix/README.md index 3dc87b924..867a1f314 100644 --- a/htsget-actix/README.md +++ b/htsget-actix/README.md @@ -29,16 +29,21 @@ This crate is used for running a local instance of htsget-rs. It is based on: [htsget-http]: ../htsget-http -## Usage +## Quick start -This application has the same functionality as [htsget-axum]. To use it, following the [htsget-axum][htsget-axum] instructions, and -replace any calls to `htsget-axum` with `htsget-actix`. +Launch a server instance: -It is recommended to use [htsget-axum] because it better fits with the rest of [htsget-rs]. For example [htsget-actix] -uses the actix-web framework for the ticket server, however it depends on [htsget-axum] for the data server. Also, components -in [htsget-lambda] use Axum dependencies. +```sh +cargo run -p htsget-actix +``` + +And fetch tickets from `localhost:8080`: -[htsget-lambda]: ../htsget-lambda +```sh +curl 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer' +``` + +This crate uses [htsget-config] for configuration. All options supported in [htsget-axum] are also supported here. ### As a library @@ -53,12 +58,13 @@ This crate has the following features: * `experimental`: used to enable experimental features that aren't necessarily part of the htsget spec, such as Crypt4GH support through `C4GHStorage`. ## Benchmarks + Benchmarks for this crate written using [Criterion.rs][criterion-rs], and aim to compare the performance of this crate with the [htsget Reference Server][htsget-refserver]. -There are a set of light benchmarks, and one heavy benchmark. Light benchmarks can be performed by executing: +There are a set of light benchmarks, and one heavy benchmark. For light benchmarks run: ``` -cargo bench -p htsget-axum -- LIGHT +cargo bench -p htsget-actix -- LIGHT ``` To run the heavy benchmark, an additional vcf file needs to be downloaded, and placed in the [`data/vcf`][data-vcf] directory: @@ -67,16 +73,17 @@ To run the heavy benchmark, an additional vcf file needs to be downloaded, and p curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz > data/vcf/internationalgenomesample.vcf.gz ``` -Then to run the heavy benchmark: +Then run the heavy benchmark: ``` -cargo bench -p htsget-axum -- HEAVY +cargo bench -p htsget-actix -- HEAVY ``` [criterion-rs]: https://github.com/bheisler/criterion.rs [htsget-refserver]: https://github.com/ga4gh/htsget-refserver [data-vcf]: ../data/vcf -[htsget-axum]: ../htsget-axum/README.md#usage +[htsget-axum]: ../htsget-axum/README.md +[htsget-config]: ../htsget-config/README.md ## License diff --git a/htsget-axum/README.md b/htsget-axum/README.md index 9e3703944..3678d75ec 100644 --- a/htsget-axum/README.md +++ b/htsget-axum/README.md @@ -21,16 +21,24 @@ This crate is used for running a server instance of htsget-rs. It is based on: [htsget-http]: ../htsget-http -## Usage +## Quick start -### For running htsget-rs as an application +Launch a server instance: -This crate uses [htsget-config] for configuration. See [htsget-config] for details on how to configure this crate. - -To run an instance of this crate, execute the following command: ```sh cargo run -p htsget-axum ``` + +And fetch tickets from `localhost:8080`: + +```sh +curl 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer' +``` + +This crate uses [htsget-config] for configuration. + +### Storage backends + Using the default configuration, this will start a ticket server on `127.0.0.1:8080` and a data block server on `127.0.0.1:8081` with data accessible from the [`data`][data] directory. This application supports storage backends defined in [htsget-storage]. @@ -38,6 +46,7 @@ To use `S3Storage`, compile with the `s3-storage` feature: ```sh cargo run -p htsget-axum --features s3-storage ``` + This will start a ticket server with `S3Storage` using a bucket called `"data"`. To use `UrlStorage`, compile with the `url-storage` feature. @@ -51,19 +60,18 @@ See [htsget-search] for details on how to structure files. #### Using TLS -There two server instances that are launched when running this crate. The ticket server, which returns a list of ticket URLs that a client must fetch. -And the data block server, which responds to the URLs in the tickets. By default, the data block server runs without TLS. -To run the data block server with TLS, pem formatted X.509 certificates are required. +By default, htsget-rs runs without TLS. To use TLS, pem formatted X.509 certificates are required. -For development and testing purposes, self-signed certificates can be used. -For example, to generate self-signed certificates run: +For development and testing purposes, self-signed certificates can be used. For example, to generate self-signed certificates run: ```sh openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes -subj '/CN=localhost' ``` -It is not recommended to use self-signed certificates in a production environment -as this is considered insecure. +It is not recommended to use self-signed certificates in a production environment as this is considered insecure. + +There two server instances that are launched when running this crate, the ticket server and data block server. TLS +is specified separately for both servers. #### Example requests @@ -73,39 +81,39 @@ Some example requests using `curl` are shown below: * GET ```sh -curl '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer' +curl 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer' ``` * POST ```sh -curl --header "Content-Type: application/json" -d '{}' '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer' +curl --header "Content-Type: application/json" -d '{}' 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer' ``` * Parametrised GET ```sh -curl '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer?format=VCF&class=header' +curl 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer?format=VCF&class=header' ``` * Parametrised POST ```sh -curl --header "Content-Type: application/json" -d '{"format": "VCF", "regions": [{"referenceName": "chrM"}]}' '127.0.0.1:8080/variants/data/vcf/sample1-bcbio-cancer' +curl --header "Content-Type: application/json" -d '{"format": "VCF", "regions": [{"referenceName": "chrM"}]}' 'http://localhost:8080/variants/data/vcf/sample1-bcbio-cancer' ``` * Service info ```sh -curl '127.0.0.1:8080/variants/service-info' +curl 'http://localhost:8080/variants/service-info' ``` ### Crypt4GH -The htsget-rs server experimentally supports serving [Crypt4GH][c4gh] encrypted files to clients. See the [Crypt4GH section][config-c4gh] in the configuration -for more details on how to configure this. +The htsget-rs server experimentally supports serving [Crypt4GH][c4gh] encrypted files to clients. See the [Crypt4GH section][config-c4gh] +in the configuration for more details on how to configure this. -Run the server with the following to enable Crypt4GH support using the [example config][example-config]: +To use Crypt4GH run the server using the [example config][example-config] and the `experimental` flag: ```sh cargo run -p htsget-axum --features experimental -- --config htsget-config/examples/config-files/c4gh.toml @@ -119,6 +127,7 @@ curl 'http://localhost:8080/reads/data/c4gh/htsnexus_test_NA12878?referenceName= The output consists of the Crypt4GH header, which includes the original header, the edit lists, and the re-encrypted header that the recipient can use to decrypt bytes: + ```json { "htsget": { diff --git a/htsget-config/README.md b/htsget-config/README.md index 4e3be2d8f..72b1ce1e8 100644 --- a/htsget-config/README.md +++ b/htsget-config/README.md @@ -8,25 +8,25 @@ [actions-badge]: https://github.com/umccr/htsget-rs/actions/workflows/action.yml/badge.svg [actions-url]: https://github.com/umccr/htsget-rs/actions?query=workflow%3Atests+branch%3Amain -Configuration for [htsget-rs] and relevant crates. +Configuration for [htsget-rs]. [htsget-rs]: https://github.com/umccr/htsget-rs ## Overview -This crate is used to configure htsget-rs by using a config file or reading environment variables. +This crate is used to configure htsget-rs using a config file or environment variables. ## Usage -### For running htsget-rs as an application +To configure htsget-rs, a TOML config file can be defined. There is also support for reading config from environment variables. +Any config options set by environment variables override values in the config file. -To configure htsget-rs, a TOML config file can be used. It also supports reading config from environment variables. -Any config options set by environment variables override values in the config file. For some of -the more deeply nested config options, it may be more ergonomic to use a config file rather than environment variables. +The configuration consists of TOML tables, such as config for the ticket server, data server, service-info, or resolvers. -The configuration consists of multiple parts, config for the ticket server, config for the data server, service-info config, and config for the resolvers. +As a starting point, see the [basic TOML][basic] example file which should work for many use cases. #### Ticket server config + The ticket server responds to htsget requests by returning a set of URL tickets that the client must fetch and concatenate. To configure the ticket server, set the following options: @@ -52,7 +52,8 @@ ticket_server_cors_max_age = 86400 ticket_server_cors_expose_headers = [] ``` -#### Local data server config +#### Data server config + The local data server responds to tickets produced by the ticket server by serving local filesystem data. To configure the data server, set the following options: @@ -126,9 +127,21 @@ environment = 'dev' #### Resolvers -The resolvers component of htsget-rs is used to map query IDs to the location of the resource. Each query that htsget-rs receives is -'resolved' to a location, which a data server can respond with. A query ID is matched with a regex, and is then mapped with a substitution string that -has access to the regex capture groups. Resolvers are configured in an array, where the first matching resolver is resolver used to map the ID. +The resolvers component of htsget-rs is used to map query IDs to the location of the resource. This is the component of the +code that takes the [`id`][id], which is everything after `reads/` or `variants/` in the http path, and maps it to a data location. + +For example, if the request to htsget-rs is: + +```sh +curl 'http://localhost:8080/reads/some_id/file' +``` + +Then the resolvers controls how the server finds `some_id/file`, which may be stored locally, in the cloud, or at an arbitrary URL location. +The resolvers maps `some_id/file` to a location using regexes and substitution strings. The location of the file does not +need to have the same name as the id. + +A query ID is matched with a regex, and is then mapped with a substitution string that has access to the regex capture groups. +Resolvers are configured in an array, where the first matching resolver is resolver used to map the ID. To create a resolver, add a `[[resolvers]]` array of tables, and set the following options: @@ -146,6 +159,8 @@ regex = '(?P.*?)/(?P.*)' substitution_string = '$group1/data/$group2' ``` +This would mean that a request to `http://localhost:8080/reads/some_id/file` would search for files at `some_id/data/file.bam` and `some_id/data/file.bam.bai`. + For more information about regex options see the [regex crate](https://docs.rs/regex/). Each resolver also maps to a certain storage backend. This storage backend can be used to set query IDs which are served from local storage, from S3-style bucket storage, or from HTTP URLs. @@ -161,7 +176,19 @@ To use `LocalStorage`, set `backend = 'Local'` under `[resolvers.storage]`, and | `path_prefix` | The path prefix which the URL tickets will have. This should likely match the `data_server_serve_at` path. | URL path | `''` | | `use_data_server_config` | Whether to use the data server config to fill in the above values. This overrides any other options specified from this table. | Boolean | `false` | -To use `S3Storage`, build htsget-rs with the `s3-storage` feature enabled, set `backend = 'S3'` under `[resolvers.storage]`, and specify any additional options from below: +By default, if the above options are left unspecified, they inherit values from the [`data_server`][data-server] config. +For example, the following sets the `scheme`, `authority`, `local_path` and `path_prefix` to values used by the `data_server`. + +```toml +[[resolvers]] +regex = '.*' +substitution_string = '$0' + +[resolvers.storage] +backend = 'Local' +``` + +To use `S3Storage`, build htsget-rs with the `s3-storage` feature enabled, set `backend = 'S3'` under `[resolvers.storage]`, and specify: | Option | Description | Type | Default | |--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------| @@ -169,25 +196,6 @@ To use `S3Storage`, build htsget-rs with the `s3-storage` feature enabled, set ` | `endpoint` | A custom endpoint to override the default S3 service address. This is useful for using S3 locally or with storage backends such as MinIO. See [MinIO](#minio). | String | Not set, uses regular AWS S3 services. | | `path_style` | The S3 path style to request from the storage backend. If `true`, "path style" is used, e.g. `host.com/bucket/object.bam`, otherwise `bucket.host.com/object` style is used. | Boolean | `false` | -`UrlStorage` is another storage backend which can be used to serve data from a remote HTTP URL. When using this storage backend, htsget-rs will fetch data from a `url` which is set in the config. It will also forward any headers received with the initial query, which is useful for authentication. -To use `UrlStorage`, build htsget-rs with the `url-storage` feature enabled, set `backend = 'Url'` under `[resolvers.storage]`, and specify any additional options from below: - -| Option | Description | Type | Default | -|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------| -| `url` | The URL to fetch data from. | HTTP URL | `"https://127.0.0.1:8081/"` | -| `response_url` | The URL to return to the client for fetching tickets. | HTTP URL | `"https://127.0.0.1:8081/"` | -| `forward_headers` | When constructing the URL tickets, copy HTTP headers received in the initial query. | Boolean | `true` | -| `header_blacklist` | List of headers that should not be forwarded. | Array of headers | `[]` | -| `tls` | Additionally enables client authentication, or sets non-native root certificates for TLS. See [TLS](#tls) for more details. | TOML table | TLS is always allowed, however the default performs no client authentication and uses native root certificates. | - -When using `UrlStorage`, the following requests will be made to the `url`. -* `GET` request to fetch only the headers of the data file (e.g. `GET /data.bam`, with `Range: bytes=0-`). -* `GET` request to fetch the entire index file (e.g. `GET /data.bam.bai`). -* `HEAD` request on the data file to get its length (e.g. `HEAD /data.bam`). - -By default, all headers received in the initial query will be included when making these requests. To exclude certain headers from being forwarded, set the `header_blacklist` option. Note that the blacklisted headers are removed from the requests made to `url` and from the URL tickets as well. - - For example, a `resolvers` value of: ```toml [[resolvers]] @@ -198,39 +206,30 @@ substitution_string = '$key' backend = 'S3' # Uses the first capture group in the regex as the bucket. ``` + Will use "example_bucket" as the S3 bucket if that resolver matches, because this is the first capture group in the `regex`. Note, to use this feature, at least one capture group must be defined in the `regex`. -Note, all the values for `S3Storage` or `LocalStorage` can be also be set manually by adding a -`[resolvers.storage]` table. For example, to manually set the config for `LocalStorage`: - -```toml -[[resolvers]] -regex = '.*' -substitution_string = '$0' - -[resolvers.storage] -backend = 'Local' -scheme = 'Http' -authority = '127.0.0.1:8081' -local_path = './' -path_prefix = '' -``` +`UrlStorage` is a storage backend which can be used to serve data from a remote HTTP URL. When using this storage backend, htsget-rs will fetch data from a `url` which is set in the config. It will also forward any headers received with the initial query, which is useful for authentication. +To use `UrlStorage`, build htsget-rs with the `url-storage` feature enabled, set `backend = 'Url'` under `[resolvers.storage]`, and specify any additional options from below: -or, to manually set the config for `S3Storage`: +| Option | Description | Type | Default | +|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------| +| `url` | The URL to fetch data from. | HTTP URL | `"https://127.0.0.1:8081/"` | +| `response_url` | The URL to return to the client for fetching tickets. | HTTP URL | `"https://127.0.0.1:8081/"` | +| `forward_headers` | When constructing the URL tickets, copy HTTP headers received in the initial query. | Boolean | `true` | +| `header_blacklist` | List of headers that should not be forwarded. | Array of headers | `[]` | +| `tls` | Additionally enables client authentication, or sets non-native root certificates for TLS. See [TLS](#tls) for more details. | TOML table | TLS is always allowed, however the default performs no client authentication and uses native root certificates. | -```toml -[[resolvers]] -regex = '.*' -substitution_string = '$0' +When using `UrlStorage`, the following requests will be made to the `url`. +* `GET` request to fetch only the headers of the data file (e.g. `GET /data.bam`, with `Range: bytes=0-`). +* `GET` request to fetch the entire index file (e.g. `GET /data.bam.bai`). +* `HEAD` request on the data file to get its length (e.g. `HEAD /data.bam`). -[resolvers.storage] -backend = 'S3' -bucket = 'bucket' -``` +By default, all headers received in the initial query will be included when making these requests. To exclude certain headers from being forwarded, set the `header_blacklist` option. Note that the blacklisted headers are removed from the requests made to `url` and from the URL tickets as well. -`UrlStorage` can only be specified manually. Example of a resolver with `UrlStorage`: + ```toml [[resolvers]] regex = ".*" @@ -246,14 +245,9 @@ header_blacklist = ["Host"] There are additional examples of config files located under [`examples/config-files`][examples-config-files]. -#### Note -By default, when htsget-rs is compiled with the `s3-storage` feature flag, `storage = 'S3'` is used when no `storage` options -are specified. Otherwise, `storage = 'Local'` is used when no storage options are specified. Compilation includes the `s3-storage` -feature flag by default, so in order to have `storage = 'Local'` as the default, `--no-default-features` can be passed to `cargo`. - #### Allow guard Additionally, the resolver component has a feature, which allows resolving IDs based on the other fields present in a query. -This is useful as allows the resolver to match an ID, if a particular set of query parameters are also present. For example, +This is useful as it allows the resolver to match an ID only if a particular set of query parameters are also present. For example, a resolver can be set to only resolve IDs if the format is also BAM. This component can be configured by setting the `[resolver.allow_guard]` table with. The following options are available to restrict which queries are resolved by a resolver: @@ -314,8 +308,7 @@ ticket_server_tls.key = "key.pem" ``` This project uses [rustls] for all TLS logic, and it does not depend on OpenSSL. The rustls library can be more -strict when accepting certificates and keys. For example, it does not accept self-signed certificates that have -a CA used as an end-entity. If generating certificates for `root_store` using OpenSSL, the correct extensions, +strict when accepting certificates and keys. If generating certificates for `root_store` using OpenSSL, the correct extensions, such as `subjectAltName` should be included. An example of generating a custom root CA and certificates for a `UrlStorage` backend: @@ -363,6 +356,7 @@ The config can also be read from an environment variable: ```shell export HTSGET_CONFIG="config.toml" ``` + If no config file is specified, the default configuration is used. Further, the default configuration file can be printed to stdout by passing the `--print-default-config` flag: @@ -378,7 +372,7 @@ Use the `--help` flag to see more details on command line options. #### Log formatting -The [Tracing][tracing] crate is used extensively by htsget-rs is for logging functionality. The `RUST_LOG` variable is +The [Tracing][tracing] crate is used by htsget-rs is for logging functionality. The `RUST_LOG` variable is read to configure the level that trace logs are emitted. For example, the following indicates trace level for all htsget crates, and info level for all other crates: @@ -401,9 +395,9 @@ See [here][formatting-style] for more information on how these values look. [rust-log]: https://rust-lang-nursery.github.io/rust-cookbook/development_tools/debugging/config_log.html [formatting-style]: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/index.html#formatters -#### Configuring htsget-rs with environment variables +#### Environment variables -All the htsget-rs config options can be set by environment variables, which is convenient for runtimes such as AWS Lambda. +All the htsget-rs config options can be set using environment variables, which is convenient for runtimes such as AWS Lambda. The ticket server, data server and service info options are flattened and can be set directly using environment variable. It is not recommended to set the resolvers using environment variables, however it can be done by setting a single environment variable which contains a list of structures, where a key name and value pair is used to set the nested options. @@ -575,4 +569,7 @@ This project is licensed under the [MIT license][license]. [minio]: https://min.io/ [c4gh]: https://samtools.github.io/hts-specs/crypt4gh.pdf [data-c4gh]: ../data/c4gh -[secrets-manager]: https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html \ No newline at end of file +[secrets-manager]: https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html +[id]: https://samtools.github.io/hts-specs/htsget.html#url-parameters +[basic]: examples/config-files/basic.toml +[data-server]: README.md#data-server-config \ No newline at end of file diff --git a/htsget-config/examples/config-files/basic.toml b/htsget-config/examples/config-files/basic.toml new file mode 100644 index 000000000..49457147c --- /dev/null +++ b/htsget-config/examples/config-files/basic.toml @@ -0,0 +1,18 @@ +# An example of running htsget-rs. +# Run with `cargo run --all-features -- --config htsget-config/examples/config-files/basic.toml` + +ticket_server_addr = "127.0.0.1:8080" +data_server_addr = "127.0.0.1:8081" + +# Serve data locally from the `data` directory. +[[resolvers]] +regex = '.*' +substitution_string = '$0' +storage.backend = 'Local' + +# Serve data from S3 if the id is prefixed with `example_bucket`. +[[resolvers]] +regex = '^(example_bucket)/(?P.*)$' +substitution_string = '$key' +storage.backend = 'S3' +# Uses the first capture group in the regex as the bucket. diff --git a/htsget-http/README.md b/htsget-http/README.md index 0eae7c78c..b31a5f441 100644 --- a/htsget-http/README.md +++ b/htsget-http/README.md @@ -15,6 +15,7 @@ Framework independent code for handling HTTP in [htsget-rs]. ## Overview This crate handles all the framework independent code for htsget-rs, it: + * Produces htsget-specific HTTP responses. * Converts query results to JSON HTTP responses. * Handles htsget client error reporting. @@ -22,16 +23,14 @@ This crate handles all the framework independent code for htsget-rs, it: ## Usage -### For running htsget-rs as an application - There is no need to interact with this crate for running htsget-rs. ### As a library This crate is useful for implementing additional framework dependent versions of the htsget-rs server. For example, htsget-rs could be written using another framework such as [warp]. This crate provides functions -like `get`, `post` and `get_service_info_json` for this purpose. -These functions take query and endpoint information, and process it using [htsget-search] to return JSON HTTP responses. +like `get`, `post` and `get_service_info_json` for this purpose. These functions take query and endpoint information, +and process it using [htsget-search] to return JSON HTTP responses. #### Feature flags diff --git a/htsget-lambda/README.md b/htsget-lambda/README.md index e24d93952..d10587634 100644 --- a/htsget-lambda/README.md +++ b/htsget-lambda/README.md @@ -23,8 +23,6 @@ This crate is used for running a cloud-based instance of htsget-rs. It: ## Usage -### For running htsget-rs as an application - This crate can be deployed to AWS as a Lambda function, or interacted with locally using [cargo-lambda]. See [deploy] for more details. Note, this crate does not use any configuration relating to the local data server. CORS configuration uses values from the ticket server config. See [htsget-config] for more information about configuration. diff --git a/htsget-search/README.md b/htsget-search/README.md index 7623f8c1b..51c8f8d37 100644 --- a/htsget-search/README.md +++ b/htsget-search/README.md @@ -19,13 +19,11 @@ Creates URL tickets for [htsget-rs] by processing bioinformatics files. It: This crate is the primary mechanism by which htsget-rs interacts with, and processes bioinformatics files. It does this by using [noodles] to query files and their indices. This crate contains abstractions that remove commonalities between file formats. Together with file format -specific code, this defines an interface that handles the core logic of a htsget request. ht +specific code, this defines an interface that handles the core logic of a htsget request. [noodles]: https://github.com/zaeleus/noodles -## Usage - -### For running htsget-rs as an application +## File structure This crate is responsible for handling bioinformatics file data. It supports BAM, CRAM, VCF and BCF files. For htsget-rs to function, files need to be organised in the following way: @@ -40,8 +38,6 @@ For htsget-rs to function, files need to be organised in the following way: * GZI files must end with `.gzi`. * See [minimising byte ranges][minimising-byte-ranges] for more details on GZI. -This is quite inflexible, and is likely to change in the future to allow arbitrary mappings of files and indices. - [gzi]: http://www.htslib.org/doc/bgzip.html#GZI_FORMAT [minimising-byte-ranges]: #minimising-byte-ranges @@ -53,7 +49,7 @@ This crate has the following features: The htsget trait comes with a basic model to represent components needed to perform a search: `Query`, `Format`, `Class`, `Tags`, `Headers`, `Url`, `Response`. `HtsGetFromStorage` is the struct which is used to process requests. -* + #### Feature flags This crate has the following features: @@ -63,16 +59,20 @@ This crate has the following features: ## Minimising Byte Ranges -One challenge involved with implementing htsget is meaningfully minimising the size of byte ranges returned in response +One challenge involved with implementing htsget is minimising the size of byte ranges returned in response tickets. Since htsget is used to reduce the amount of data a client needs to fetch by querying specific parts of a file, the data returned by htsget should ideally be as minimal as possible. This is done by reading the index file or -the underlying target file, to determine the required byte ranges. However, this is complicated when considering -BGZF compressed files. +the underlying target file, to determine the required byte ranges. + +For BGZF files, [GZI][gzi] files are supported, which enable the smallest possible byte ranges. + +### BGZF file example For BGZF compressed files, htsget-rs needs to return compressed byte positions. Also, after concatenating data from URL tickets, the resulting file must be valid. This means that byte ranges must start and finish on BGZF blocks, otherwise the concatenation -would not result in a valid file. However, index files (BAI, TBI, CSI) do not contain all the information required to +would not result in a valid file. Index files (BAI, TBI, CSI) do not contain all the information required to produce minimal byte ranges. For example, consider this [file][example-file]: + * There are 14 BGZF blocks positions using all available data in the corresponding [index file][example-index] (chunk start positions, chunk end positions, linear index positions, and metadata positions): * `4668`, `256721`, `499249`, `555224`, `627987`, `824361`, `977196`, `1065952`, `1350270`, `1454565`, `1590681`, `1912645`, `2060795` and `2112141`. * Using just this data, the following query with: @@ -86,25 +86,16 @@ produce minimal byte ranges. For example, consider this [file][example-file]: * `bytes=824361-842100` * `bytes=977196-996014` -To produce the smallest byte ranges, htsget-rs needs to find this data somewhere else. There are two ways to accomplish this: -* Get the data from the underlying target file, by seeking to the start of a BGZF, and reading until the end of the block is found. -* Get the data from an auxiliary index file, such as GZI. - -Currently, htsget-rs takes the latter approach, and uses GZI files, which contain information on all BGZF start and -end positions. However, this is not ideal, as GZI contains more information than required by htsget-rs. The former -approach also has issues when considering cloud-based storage, which in the case of S3, does not have seek operations. - -The way htsget-rs finds the information needed for minimal byte ranges is very likely to change in the future, as more efficient -approaches are implemented. For example, a database could be used to further index files. Queries to a database could be -as targeted as possible, retrieving only the required information. +To produce the smallest byte ranges, htsget-rs needs can search through GZI files and regular index files. It does not +read data from the underlying target file. [example-file]: ../data/bam/htsnexus_test_NA12878.bam [example-index]: ../data/bam/htsnexus_test_NA12878.bam.bai -## Benchmarks +## Benchmarks Since this crate is used to query file data, it is the most performance critical component of htsget-rs. Benchmarks, using -[Criterion.rs][criterion-rs], are therefore written to test performance. Run benchmarks by executing: +[Criterion.rs][criterion-rs] are written to test performance. Run benchmarks by executing: ```sh cargo bench -p htsget-search --all-features diff --git a/htsget-storage/README.md b/htsget-storage/README.md index 75923e8c5..1e9a8c537 100644 --- a/htsget-storage/README.md +++ b/htsget-storage/README.md @@ -16,23 +16,22 @@ Contains storage interfaces and abstractions for [htsget-rs]. It: ## Overview -This crate is the mechanism by which htsget-rs interacts fetches data from bioinformatics files which it needs to -process requrests. It also allows htsget-rs to create and format URL tickets correctly. It does this by providing storage -layer abstractions which other crates can use to interact with data. It defines three kinds of storage which can fetch data: +This crate is the mechanism htsget-rs uses to fetches data from the bioinformatics files it needs to +process requests. It also allows htsget-rs to create and format URL tickets correctly. It does this by providing storage +layer abstractions which other crates can use to interact with data. It defines the following storage layers: * [local]: Access files on the local filesystem. * [s3]: Access files on [AWS S3][s3-docs]. * [url]: Access files on any server which can respond to requests. +* [c4gh]: Access and process Crypt4GH-encrypted files. [s3-docs]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html -This crate is responsible for allowing the user to fetch the URL tickets returned by the ticket server. In the case of -`LocalStorage`, this entails a separate `data_server` that can serve files using HTTP. `S3Storage` simply returns +This crate is responsible for allowing the user to fetch the URL tickets returned by the ticket server. With +`LocalStorage` a separate `data_server` is used to serve files using HTTP. `S3Storage` returns presigned S3 URLs. ## Usage -### For running htsget-rs as an application - In order to use a particular storage backend for URL tickets, the proper backend should be configured using [htsget-config]. [htsget-config]: ../htsget-config @@ -54,6 +53,7 @@ This crate has the following features: [local]: src/local.rs [s3]: src/s3.rs [url]: src/url.rs +[c4gh]: src/c4gh/mod.rs ## License diff --git a/htsget-test/README.md b/htsget-test/README.md index 93af34539..fd8e2f6c8 100644 --- a/htsget-test/README.md +++ b/htsget-test/README.md @@ -17,12 +17,8 @@ Common test functions and utilities used by [htsget-rs]. This crate contains shared code used for testing by other htsget-rs crates. It has common server tests, as well as other utility functions. -[noodles]: https://github.com/zaeleus/noodles - ## Usage -### For running htsget-rs as an application - There is no need to interact with this crate for running htsget-rs. ### As a library From 383f8256a1aa9ba7e2860725dd56fa065c67d901 Mon Sep 17 00:00:00 2001 From: Marko Malenic Date: Fri, 29 Nov 2024 11:16:56 +1100 Subject: [PATCH 2/3] style: fix new clippy warnings --- htsget-config/src/config/parser.rs | 2 +- htsget-config/src/types.rs | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/htsget-config/src/config/parser.rs b/htsget-config/src/config/parser.rs index fb4bb3cf1..3734b0a4c 100644 --- a/htsget-config/src/config/parser.rs +++ b/htsget-config/src/config/parser.rs @@ -17,7 +17,7 @@ pub enum Parser<'a> { Path(&'a Path), } -impl<'a> Parser<'a> { +impl Parser<'_> { /// Deserialize a string or path into a config value using Figment. #[instrument] pub fn deserialize_config_into(&self) -> io::Result diff --git a/htsget-config/src/types.rs b/htsget-config/src/types.rs index 8e6a637cf..872f5352c 100644 --- a/htsget-config/src/types.rs +++ b/htsget-config/src/types.rs @@ -121,12 +121,12 @@ pub struct Interval { impl Interval { /// Check if this interval contains the value. pub fn contains(&self, value: u32) -> bool { - return match (self.start.as_ref(), self.end.as_ref()) { + match (self.start.as_ref(), self.end.as_ref()) { (None, None) => true, (None, Some(end)) => value < *end, (Some(start), None) => value >= *start, (Some(start), Some(end)) => value >= *start && value < *end, - }; + } } /// Convert this interval into a one-based noodles `Interval`. From 7b7c4d8388d47847e0cc54c567b76baf2b3efce3 Mon Sep 17 00:00:00 2001 From: Marko Malenic Date: Fri, 29 Nov 2024 12:33:38 +1100 Subject: [PATCH 3/3] docs: grammar and typos --- CONTRIBUTING.md | 4 ++-- README.md | 2 +- htsget-actix/README.md | 4 ++-- htsget-axum/README.md | 2 +- htsget-config/README.md | 6 +++--- htsget-search/README.md | 2 +- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 619e77e01..d33bcd335 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -2,7 +2,7 @@ Thank you for your interest in contributing. We greatly value feedback and contributions, whether that's an issue, bug fix, new feature or document change. All contributions are welcome, and no change is too small, even if -its just a typo fix. +it's just a typo fix. To get familiar with the project, have a look at the READMEs of each crate. @@ -23,7 +23,7 @@ guidance. ### Tests and formatting -If the proposed change alters the code, tests should updated to ensure that no regressions are made. Any new features +If the proposed change alters the code, tests should update to ensure that no regressions are made. Any new features need to have thorough testing before they are merged. We also use [clippy] and [rustfmt] for code style, linting and formatting. diff --git a/README.md b/README.md index 6eca98c69..ba3bc122c 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ A **server** implementation of the [htsget protocol][htsget-protocol] for bioinf [actix-web]: https://github.com/actix/actix-web [criterion-rs]: https://github.com/bheisler/criterion.rs -## Quick start +## Quickstart To run a local instance htsget-rs, run [htsget-axum]: diff --git a/htsget-actix/README.md b/htsget-actix/README.md index 867a1f314..c68665fb3 100644 --- a/htsget-actix/README.md +++ b/htsget-actix/README.md @@ -9,7 +9,7 @@ [actions-url]: https://github.com/umccr/htsget-rs/actions?query=workflow%3Atests+branch%3Amain > [!IMPORTANT] -> The functionality of [htsget-axum] is identical to this crate and it is recommended for all +> The functionality of [htsget-axum] is identical to this crate, and it is recommended for all > projects to use [htsget-axum] instead. > > This crate will be maintained to preserve backwards compatibility, however [htsget-axum] is @@ -29,7 +29,7 @@ This crate is used for running a local instance of htsget-rs. It is based on: [htsget-http]: ../htsget-http -## Quick start +## Quickstart Launch a server instance: diff --git a/htsget-axum/README.md b/htsget-axum/README.md index 3678d75ec..0311e9a7d 100644 --- a/htsget-axum/README.md +++ b/htsget-axum/README.md @@ -21,7 +21,7 @@ This crate is used for running a server instance of htsget-rs. It is based on: [htsget-http]: ../htsget-http -## Quick start +## Quickstart Launch a server instance: diff --git a/htsget-config/README.md b/htsget-config/README.md index 72b1ce1e8..4f9ec7d27 100644 --- a/htsget-config/README.md +++ b/htsget-config/README.md @@ -23,7 +23,7 @@ Any config options set by environment variables override values in the config fi The configuration consists of TOML tables, such as config for the ticket server, data server, service-info, or resolvers. -As a starting point, see the [basic TOML][basic] example file which should work for many use cases. +As a starting point, see the [basic TOML][basic] example file which should work for many use-cases. #### Ticket server config @@ -151,7 +151,7 @@ To create a resolver, add a `[[resolvers]]` array of tables, and set the followi | `substitution_string` | The replacement expression used to map the matched query ID. This has access to the match groups in the `regex` option. | String with access to capture groups | `'$0'` | For example, below is a `regex` option which matches a `/` between two groups, and inserts an additional `data` -inbetween the groups with the `substitution_string`. +in between the groups with the `substitution_string`. ```toml [[resolvers]] @@ -498,7 +498,7 @@ There is experimental support for serving [Crypt4GH][c4gh] encrypted files. This `experimental` feature flag. This allows htsget-rs to read Crypt4GH files and serve them encrypted, directly to the client. In the process of -serving the data, htsget-rs will decrypt the headers of the Crypt4GH files and reencrypt them so that the client can read +serving the data, htsget-rs will decrypt the headers of the Crypt4GH files and re-encrypt them so that the client can read them. When the client receives byte ranges from htsget-rs and concatenates them, the output bytes will be Crypt4GH encrypted, and will need to be decrypted before they can be read. All file formats (BAM, CRAM, VCF, and BCF) are supported using Crypt4GH. diff --git a/htsget-search/README.md b/htsget-search/README.md index 51c8f8d37..a849a3d42 100644 --- a/htsget-search/README.md +++ b/htsget-search/README.md @@ -101,7 +101,7 @@ Since this crate is used to query file data, it is the most performance critical cargo bench -p htsget-search --all-features ``` -Alternatively if you are using `cargo-criterion` and want a machine readable JSON output, run: +Alternatively if you are using `cargo-criterion` and want a machine-readable JSON output, run: ```sh cargo criterion --bench search-benchmarks --message-format=json -- LIGHT 1> search-benchmarks.json