From 39f74a7687e8a43272b734239130d15ccaca8a16 Mon Sep 17 00:00:00 2001 From: Cristina Yenyxe Gonzalez Garcia Date: Fri, 27 Jan 2017 14:17:50 +0000 Subject: [PATCH 1/3] Updated project description and build instructions --- README.md | 54 ++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 36 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 110d5776a..b1f1ca501 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,44 @@ # European Variation Archive (EVA) Pipeline v2 [![Build Status](https://travis-ci.org/EBIvariation/eva-pipeline.svg)](https://travis-ci.org/EBIvariation/eva-pipeline) -This repository contains work in progress for the next version of the European Variation Archive pipeline. If you are -looking for the production source code, please check https://github.com/EBIvariation/eva-ws for the REST web services -API, and the `master` branch of this same repository, https://github.com/EBIvariation/eva-pipeline. +The European Variation Archive pipeline processes VCF files and stores the variants contained within them in a database, in a format that supports efficient searching. The EVA pipeline produces and stores the following information: -The core of the new EVA is a pipeline for VCF file processing, implemented purely in Java and based on the Spring Batch -framework, instead of combining the Luigi workflow manager with Java binaries. The reason for using Spring Batch is -tracking job statuses and avoiding waste of computation, as result of repeating just the needed steps when something -fails, in the more automated way possible. +* A normalized representation of the variants contained within a VCF file +* Variant annotation: consequence type, SIFT and Polyphen scores, etc +* Statistics: allele and genotype counts and frequencies -## Features +You can find a more detailed description of these operations in the [project wiki](https://github.com/EBIvariation/eva-pipeline/wiki/Jobs). Please visit the [EVA website](http://www.ebi.ac.uk/eva/?Variant Browser) to see a public service depending on this pipeline, or the [EVA web services repository](https://github.com/EBIvariation/eva-ws) for more information on the API. -The current goal is to allow indexing VCF files into MongoDB. +The pipeline automatically tracks the job status, and avoids waste of computation by resuming a job in the exact point where it failed; successful steps already executed are automatically skipped. -The approach is to have two different jobs: one for genotyped files, and another for aggregated files. +## Dependencies -Both jobs will have four (logical) steps: transformation, loading, statistics and annotation. +The pipeline is implemented purely in Java and uses the Maven build system. -## Using this tool +In order to run, the pipeline needs access to a MongoDB database instance. The easiest way to set one up in a local machine is [using Docker](https://hub.docker.com/_/mongo/). -You may compile the project with `mvn package` and call the produced jar directly, as `$ java -jar eva-pipeline/target/eva-pipeline-0.1.jar`. +If you want to generate and store variant annotations you will also need to [download Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html). Please note this software requires Perl. + +Finally, before compiling the pipeline itself, you will need to clone and build its dependencies running these commands from a folder of your choice: + +``` +git clone https://github.com/EBIvariation/biodata.git +cd biodata && mvn clean install -DskipTests +cd .. +git clone https://github.com/EBIvariation/opencga.git +cd opencga && mvn clean install -DskipTests +``` + +### Build + +The latest stable version can be found in the [master](https://github.com/EBIvariation/eva-pipeline/tree/master) branch. [develop](https://github.com/EBIvariation/eva-pipeline/tree/develop) contains work in progress, which is fully tested but could be more unstable. + +If a MongoDB instance is available in the machine where you are running the build, you can test and build the application with `mvn test package`. + +If a MongoDB instance is not available, please run `mvn package -DskipTests`. + +## Run + +Once successfully built, you can simply run the produced JAR file with `java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar`. We did not implement a custom command line, we are using the `org.springframework.boot.autoconfigure.batch.JobLauncherCommandLineRunner` class to obtain all the parameters from @@ -29,24 +48,23 @@ command line. Almost all the parameters you can use are showed in the `example-l Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. -`application.properties` is used to configure database connections and applications the pipeline depends on: -[OpenCGA](https://github.com/opencb/opencga/tree/hotfix/0.5) and [Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/index.html). +`application.properties` is used to configure database connections and applications the pipeline depends on (OpenCGA and Ensembl VEP, see _Dependencies_ section). `load-genotyped-vcf.properties`, `load-aggregated-vcf.properties` and `initialize-database.properties` are job-specific configurations. -If more convenient for your use case, the global and job configuration files can be merged into one. +If more convenient for your use case, the global configuration and job parameters files can be merged into one. It is likely that you will need to change some parameters to fit your installation and/or or configure your job. For instance, the location of your MongoDB databases, your OpenCGA/VEP installation directory, the folder were your files are, the type of job to run, etc. By using these properties files, a job can be launched with a single command like: - java -jar target/eva-pipeline-0.1.jar \ + java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.config.location=file:examples/application.properties,file:examples/load-genotyped-vcf.properties The contents from the configuration files can be provided directly as command-line arguments, like the following: - java -jar target/eva-pipeline-0.1.jar \ + java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.batch.job.names=load-genotyped-vcf \ input.vcf=/path/to/file.vcf \ input.study.name=My sample study \ From 944e03ae4dcf534364ae7a1625c8d5ee4ed633b3 Mon Sep 17 00:00:00 2001 From: Cristina Yenyxe Gonzalez Garcia Date: Fri, 27 Jan 2017 16:44:50 +0000 Subject: [PATCH 2/3] Small wording changes --- README.md | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index b1f1ca501..1957dc77a 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,22 @@ # European Variation Archive (EVA) Pipeline v2 [![Build Status](https://travis-ci.org/EBIvariation/eva-pipeline.svg)](https://travis-ci.org/EBIvariation/eva-pipeline) -The European Variation Archive pipeline processes VCF files and stores the variants contained within them in a database, in a format that supports efficient searching. The EVA pipeline produces and stores the following information: +The European Variation Archive pipeline processes Variant Call Format (VCF) files and stores the variants contained within them in a database, in a format that supports efficient searching. The EVA pipeline produces and stores the following information: * A normalized representation of the variants contained within a VCF file * Variant annotation: consequence type, SIFT and Polyphen scores, etc * Statistics: allele and genotype counts and frequencies -You can find a more detailed description of these operations in the [project wiki](https://github.com/EBIvariation/eva-pipeline/wiki/Jobs). Please visit the [EVA website](http://www.ebi.ac.uk/eva/?Variant Browser) to see a public service depending on this pipeline, or the [EVA web services repository](https://github.com/EBIvariation/eva-ws) for more information on the API. +You can find a more detailed description of these operations in the [project wiki](https://github.com/EBIvariation/eva-pipeline/wiki/Jobs). Please visit the [EVA website](http://www.ebi.ac.uk/eva/?Variant Browser) to see a public service depending on this pipeline, and the [EVA web services repository](https://github.com/EBIvariation/eva-ws) for more information on the API. The pipeline automatically tracks the job status, and avoids waste of computation by resuming a job in the exact point where it failed; successful steps already executed are automatically skipped. ## Dependencies -The pipeline is implemented purely in Java and uses the Maven build system. +The pipeline has been implemented in Java and uses the Maven build system. In order to run, the pipeline needs access to a MongoDB database instance. The easiest way to set one up in a local machine is [using Docker](https://hub.docker.com/_/mongo/). -If you want to generate and store variant annotations you will also need to [download Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html). Please note this software requires Perl. +If you want to generate and store variant annotations you will also need to [download Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html). Please note this software requires Perl to be installed. Finally, before compiling the pipeline itself, you will need to clone and build its dependencies running these commands from a folder of your choice: @@ -32,21 +32,13 @@ cd opencga && mvn clean install -DskipTests The latest stable version can be found in the [master](https://github.com/EBIvariation/eva-pipeline/tree/master) branch. [develop](https://github.com/EBIvariation/eva-pipeline/tree/develop) contains work in progress, which is fully tested but could be more unstable. -If a MongoDB instance is available in the machine where you are running the build, you can test and build the application with `mvn test package`. - -If a MongoDB instance is not available, please run `mvn package -DskipTests`. +If a MongoDB instance is available in the machine where you are running the build, you can test and build the application with `mvn test package`, otherwise please run `mvn package -DskipTests`. ## Run Once successfully built, you can simply run the produced JAR file with `java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar`. -We did not implement a custom command line, we are using the -`org.springframework.boot.autoconfigure.batch.JobLauncherCommandLineRunner` class to obtain all the parameters from -command line. Almost all the parameters you can use are showed in the `example-launch.sh` script. - -### Examples - -Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. +Arguments to run the pipeline can be provided either using the command line or a properties file. Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. `application.properties` is used to configure database connections and applications the pipeline depends on (OpenCGA and Ensembl VEP, see _Dependencies_ section). From 1345d4b22e175ac2594fa7176772b50794ebd152 Mon Sep 17 00:00:00 2001 From: Cristina Yenyxe Gonzalez Garcia Date: Tue, 31 Jan 2017 09:51:50 +0000 Subject: [PATCH 3/3] MongoDB version and OpenCGA configuration file --- README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 1957dc77a..58d777328 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ The pipeline automatically tracks the job status, and avoids waste of computatio The pipeline has been implemented in Java and uses the Maven build system. -In order to run, the pipeline needs access to a MongoDB database instance. The easiest way to set one up in a local machine is [using Docker](https://hub.docker.com/_/mongo/). +In order to run, the pipeline needs access to a MongoDB 3.x database instance. The easiest way to set one up in a local machine is [using Docker](https://hub.docker.com/_/mongo/). If you want to generate and store variant annotations you will also need to [download Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html). Please note this software requires Perl to be installed. @@ -28,7 +28,7 @@ git clone https://github.com/EBIvariation/opencga.git cd opencga && mvn clean install -DskipTests ``` -### Build +## Build The latest stable version can be found in the [master](https://github.com/EBIvariation/eva-pipeline/tree/master) branch. [develop](https://github.com/EBIvariation/eva-pipeline/tree/develop) contains work in progress, which is fully tested but could be more unstable. @@ -36,8 +36,6 @@ If a MongoDB instance is available in the machine where you are running the buil ## Run -Once successfully built, you can simply run the produced JAR file with `java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar`. - Arguments to run the pipeline can be provided either using the command line or a properties file. Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. `application.properties` is used to configure database connections and applications the pipeline depends on (OpenCGA and Ensembl VEP, see _Dependencies_ section). @@ -46,15 +44,16 @@ Arguments to run the pipeline can be provided either using the command line or a If more convenient for your use case, the global configuration and job parameters files can be merged into one. -It is likely that you will need to change some parameters to fit your installation and/or or configure your job. For instance, -the location of your MongoDB databases, your OpenCGA/VEP installation directory, the folder were your files are, the type of job to run, etc. +It is likely that you will need to edit some parameters to match your environment and/or configure your job. For instance, connection details to MongoDB databases, OpenCGA/VEP installation directories, the folder containing the input files, the type of job to run, etc. + +**Note:** Most of the environment configuration can be provided directly to the application, but MongoDB connection details also need to be filled in the OpenCGA configuration file. The installation folder is by default located in `/opencga-app/build`, but can be moved to any destination of your choice. The configuration is located in `/conf/storage-mongodb.properties`. By using these properties files, a job can be launched with a single command like: java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.config.location=file:examples/application.properties,file:examples/load-genotyped-vcf.properties -The contents from the configuration files can be provided directly as command-line arguments, like the following: +The contents from the configuration files can be also provided directly as command-line arguments, like the following: java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.batch.job.names=load-genotyped-vcf \