Skip to content

ukbREST on sample data

meliao edited this page Sep 12, 2019 · 41 revisions

Table of Contents

These instructions are intended to walk you though the loading of sample data into the database and make some queries. Here we show how to do it using both approaches: either if you installed the ukbREST Docker image or if you are running the ukbREST code natively.

You need to clone/download this repository.

Run PostgreSQL using Docker

If you followed the installation instructions, you already have Docker, the PostgreSQL docker image, and the ukbREST code or docker image as well.

If you didn't run PostgreSQL, you can run a test instance using Docker with this command (from the root of the repository folder):

$ bash utils/scripts/run_postgresql.sh

When you stop it (Ctrl-C) all data will be wiped out. This is fine since this is just a test.

Load sample CSV data into PostgreSQL

Here we use sample data in tests/data/pheno2sql/example14/ to load it into our PostgreSQL database.

Using Python scripts

First of all, we'll load the data using the Python scripts directly. Inside a terminal you should move to the repository folder and run:

$ export PYTHONPATH=.
$ conda activate ukbrest
$ python ukbrest/load_data.py \
  --pheno-dir tests/data/pheno2sql/example14/ \
  --db-uri postgresql://test:test@localhost:5432/ukb \
  --bgen-sample-file tests/data/pheno2sql/example14/impv2.sample
[...]
2018-07-18 10:28:43,772 - ukbrest - INFO - Loading finished!

Now the sample data in tests/data/pheno2sql/example14/ (.csv and .html files) was loaded into the PostgreSQL database. We explain the schema of this database below.

Using the ukbREST Docker image

You can also do exactly the same but using the ukbREST Docker image. The command in this case is (you need to change the paths):

$ docker run --rm --net ukb \
  -v /Users/miltondp/projects/ukbrest/tests/data/pheno2sql/example14/:/var/lib/genotype \
  -v /Users/miltondp/projects/ukbrest/tests/data/pheno2sql/example14/:/var/lib/phenotype \
  -e UKBREST_GENOTYPE_BGEN_SAMPLE_FILE="impv2.sample" \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load
[...]
2018-07-20 22:50:34,962 - ukbrest - INFO - Loading finished!

This creates a container, runs the data loading code into PostgreSQL, and quits.

Note that here we need to specify the full path of our directory. Using the Docker image seems more complicated, but you don't have to install any dependencies. Here is a brief explanation of the parameters:

  • --rm: it means that the container will be removed after finishing.
  • --net ukb: specifies the network name you created in the installation instructions. Both PostgreSQL and ukbREST must run inside the same network.
  • -v /my/local/path/with/genotypes:/var/lib/genotype: with -v you mount a local path inside the container. Here you are saying that your local path /my/local/path/with/genotypes has the BGEN files and their sample file.
  • -v /my/local/path/with/phenotypes:/var/lib/phenotype: here you specify your local path where the data resides (.csv and .html files). Although here we have simulate data (and the phenotypes and genotypes directory is the same), you generate these files using the ukbconv utility provided by the UK Biobank.
  • -e UKBREST_GENOTYPE_BGEN_SAMPLE_FILE="impv2.sample": the -e parameter in Docker means that you want to specify an environmental variable. ukbREST supports many of these variables that let you change the behavior. In this case, UKBREST_GENOTYPE_BGEN_SAMPLE_FILE specifies the name of the BGEN sample file, which is relative to /var/lib/genotype.
  • -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb": here we specified another environmental variable, UKBREST_DB_URI, which is the connection string to PostgreSQL. test/test is our user/password, pg is the name of the container and ukb the database name (you specified both of these when running the container.
  • --load: indicates to load the main data (csv and html).

You can specify/change many environmental variables with -e. You can take a look at the full list in this page.

PostgreSQL data schema

That's it, your data is now loaded into PostgreSQL. Below you have some more technical details about this, but it is not necessary to read it (although very useful), and you can go to the next section in this page.

The previous steps created the necessary tables in your PostgreSQL instance. If you take a look at the folder tests/data/pheno2sql/example14/, you'll find two CSV files with their corresponding HTML metadata files (it has the data types, etc); this simulated data mimics the files you got from UK Biobank. The data itself is loaded in tables ukb_pheno_0_00 and ukb_pheno_1_00 (there could be many more according to the number of data fields of your application). You'll also find a impv2.sample file, this is the sample file of your BGENs and the order of the sample IDs was loaded into table bgen_samples. In table fields you'll find all the metadata present in the HTML files: all data fields information is stored here: description of the field, where it is stored (table), field ID, data type, coding, instances and arrays.

Load additional files

ukbREST allows you to load any extra data about samples. For example, you can load the Sample-QC or relatedness files, which in previous releases they were delivered separately (not in CSV files). We'll cover this later.

Start ukbREST server

Now all your data is inside the PostgreSQL database. You just need to start the server. You can do this in several ways, and depending on your use case, you can pick the most appropriate for you. The easiest, as already mentioned, is to use the ukbREST Docker image. If you choose to run the server natively, you can run Flask for testing purposes or Gunicorn for production.

Using the ukbREST Docker image

This is the command to run the ukbREST server using the docker image:

$ docker run --rm --net ukb -p 127.0.0.1:5000:5000 \
  -v /Users/miltondp/projects/ukbrest/tests/data/example01:/var/lib/genotype \
  -e UKBREST_SQL_CHUNKSIZE="10000" \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest

[2018-07-23 23:01:11 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2018-07-23 23:01:11 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2018-07-23 23:01:11 +0000] [1] [INFO] Using worker: eventlet
[2018-07-23 23:01:11 +0000] [10] [INFO] Booting worker with pid: 10
[2018-07-23 23:01:11 +0000] [11] [INFO] Booting worker with pid: 11
[2018-07-23 23:01:11 +0000] [12] [INFO] Booting worker with pid: 12
[2018-07-23 23:01:11 +0000] [13] [INFO] Booting worker with pid: 13
2018-07-23 23:01:11,947 - ukbrest - WARNING - UKBREST_SQL_CHUNKSIZE was not set, no chunksize for SQL queries, what can lead to memory problems.
[...]

COMPLETE THIS GENOTYPE FILE FORMAT NEEDED: This will run ukbREST using our Docker image. For this example, we are reading

Behind the scenes it runs Gunicorn. You should take a look at the Dockerfile to see how it is setup, particularly to environment variables WEB_CONCURRENCY and GUNICORN_CMD_ARGS to change the number of workers (4 by default) and other Gunicorn options, respectively.

Using other methods

There could be cases where you cannot run Docker. In those, you need to run the code directly using your Python environment. Please, follow these other page where you can find instructions to run the ukbREST server using Flask (for testing purposes) or Gunicorn (for production) without Docker. That page also covers how to activate HTTP Basic Authentication and SSL.

Query examples

Phenotype data

You can use curl to send a simple query to the ukbREST server and get the data:

$ curl -G \
  -HAccept:text/csv \
  "http://127.0.0.1:5000/ukbrest/api/v1.0/phenotype" \
  --data-urlencode "columns=c101_0_0 as variable_name"

eid,variable_name
1000010,NA
1000021,0.0401
1000030,NA
1000041,0.5632
1000050,0.4852
1000061,0.1192

In this case you asked for data field 101 (instance 0 and array 0) and CSV format. You also asked to rename that column in the CSV file by variable_name.

If have a more complex data request need (for instance, you need to filter samples out, have different data specifications and so) you can use YAML file to specify your data needs:

$ cat query.yml
COMPLETE

$ curl -X POST \
  -H "Accept: text/csv" \
  -F [email protected] \
  -F section=data \
  http://localhost:5000/ukbrest/api/v1.0/query

Genotype data

In this example you can query for all variants between positions 0 and 1000 in chromosome 1 (all these parameters are highlighted in bold):

$ curl -G \
  -HAccept:application/octel-stream \
  "http://localhost:5000/ukbrest/api/v1.0/genotype/1/positions/0/1000" \
  > test.bgen

Check out the genotype query page for more examples.