-
Clone the source code repository from https://github.com/iossifovlab/gpf/ and switch to
gpf
directory -
Setup
gpf
python environment usingconda
:conda env create --name gpf --file ./environment.yml conda env update --name gpf --file ./dev-environment.yml
-
Activate the
gpf
environment:conda activate gpf
-
Clone
data-hg19-empty
instance repository from https://github.com/iossifovlab/data-hg19-empty and switch todata-hg19-empty
directory -
Review the content of
gpf_instance.yaml
file. It specifies the minimal configuration needed for a GPF instance:- reference genome
- gene models
-
Consider using cache for Genomic Resource Repository. By default the GPF system uses Genomic Resources Repository (GRR) located at https://www.iossifovlab.com/distribution/public/genomic-resources-repository/ without caching resources. You can change this adding following snippet into your
gpf_instance.yaml
file:grr: id: "%(instance_id)s" type: "url" url: "https://www.iossifovlab.com/distribution/public/genomic-resources-repository/" cache_dir: "%($HOME)s/grrCache"
The
cache_dir
should point to a folder in your local filesystem that is suitable for storing large files. -
Consider using
impala
genotype storage. By default the GPF system uses filesystem genotype storage that has limited capabilities. You can switch to usingimpala
genotype storage by adding and adapting the following snippet into yourgpf_instance.yaml
file:genotype_storage: default: genotype_impala storage: genotype_impala: dir: "%($DAE_DB_DIR)s/work/" hdfs: base_dir: /user/%(instance_id)s/studies host: localhost port: 8020 replication: 1 impala: db: "%(instance_id)s" hosts: - localhost pool_size: 3 port: 21050 storage_type: impala
-
When ready with GPF instance configuration adjustments define environment variable
DAE_DB_DIR
to point to thedata-hg19-empty
directory:export DAE_DB_DIR=<path to data-hg19-empty directory>