diff --git a/README.md b/README.md index 110d5776a..58d777328 100644 --- a/README.md +++ b/README.md @@ -1,52 +1,61 @@ # European Variation Archive (EVA) Pipeline v2 [![Build Status](https://travis-ci.org/EBIvariation/eva-pipeline.svg)](https://travis-ci.org/EBIvariation/eva-pipeline) -This repository contains work in progress for the next version of the European Variation Archive pipeline. If you are -looking for the production source code, please check https://github.com/EBIvariation/eva-ws for the REST web services -API, and the `master` branch of this same repository, https://github.com/EBIvariation/eva-pipeline. +The European Variation Archive pipeline processes Variant Call Format (VCF) files and stores the variants contained within them in a database, in a format that supports efficient searching. The EVA pipeline produces and stores the following information: -The core of the new EVA is a pipeline for VCF file processing, implemented purely in Java and based on the Spring Batch -framework, instead of combining the Luigi workflow manager with Java binaries. The reason for using Spring Batch is -tracking job statuses and avoiding waste of computation, as result of repeating just the needed steps when something -fails, in the more automated way possible. +* A normalized representation of the variants contained within a VCF file +* Variant annotation: consequence type, SIFT and Polyphen scores, etc +* Statistics: allele and genotype counts and frequencies -## Features +You can find a more detailed description of these operations in the [project wiki](https://github.com/EBIvariation/eva-pipeline/wiki/Jobs). Please visit the [EVA website](http://www.ebi.ac.uk/eva/?Variant Browser) to see a public service depending on this pipeline, and the [EVA web services repository](https://github.com/EBIvariation/eva-ws) for more information on the API. -The current goal is to allow indexing VCF files into MongoDB. +The pipeline automatically tracks the job status, and avoids waste of computation by resuming a job in the exact point where it failed; successful steps already executed are automatically skipped. -The approach is to have two different jobs: one for genotyped files, and another for aggregated files. +## Dependencies -Both jobs will have four (logical) steps: transformation, loading, statistics and annotation. +The pipeline has been implemented in Java and uses the Maven build system. -## Using this tool +In order to run, the pipeline needs access to a MongoDB 3.x database instance. The easiest way to set one up in a local machine is [using Docker](https://hub.docker.com/_/mongo/). -You may compile the project with `mvn package` and call the produced jar directly, as `$ java -jar eva-pipeline/target/eva-pipeline-0.1.jar`. +If you want to generate and store variant annotations you will also need to [download Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html). Please note this software requires Perl to be installed. -We did not implement a custom command line, we are using the -`org.springframework.boot.autoconfigure.batch.JobLauncherCommandLineRunner` class to obtain all the parameters from -command line. Almost all the parameters you can use are showed in the `example-launch.sh` script. +Finally, before compiling the pipeline itself, you will need to clone and build its dependencies running these commands from a folder of your choice: -### Examples +``` +git clone https://github.com/EBIvariation/biodata.git +cd biodata && mvn clean install -DskipTests +cd .. +git clone https://github.com/EBIvariation/opencga.git +cd opencga && mvn clean install -DskipTests +``` -Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. +## Build -`application.properties` is used to configure database connections and applications the pipeline depends on: -[OpenCGA](https://github.com/opencb/opencga/tree/hotfix/0.5) and [Ensembl VEP](http://www.ensembl.org/info/docs/tools/vep/index.html). +The latest stable version can be found in the [master](https://github.com/EBIvariation/eva-pipeline/tree/master) branch. [develop](https://github.com/EBIvariation/eva-pipeline/tree/develop) contains work in progress, which is fully tested but could be more unstable. + +If a MongoDB instance is available in the machine where you are running the build, you can test and build the application with `mvn test package`, otherwise please run `mvn package -DskipTests`. + +## Run + +Arguments to run the pipeline can be provided either using the command line or a properties file. Skeletons to load genotyped and aggregated VCF files are provided in the `examples` folder. + +`application.properties` is used to configure database connections and applications the pipeline depends on (OpenCGA and Ensembl VEP, see _Dependencies_ section). `load-genotyped-vcf.properties`, `load-aggregated-vcf.properties` and `initialize-database.properties` are job-specific configurations. -If more convenient for your use case, the global and job configuration files can be merged into one. +If more convenient for your use case, the global configuration and job parameters files can be merged into one. + +It is likely that you will need to edit some parameters to match your environment and/or configure your job. For instance, connection details to MongoDB databases, OpenCGA/VEP installation directories, the folder containing the input files, the type of job to run, etc. -It is likely that you will need to change some parameters to fit your installation and/or or configure your job. For instance, -the location of your MongoDB databases, your OpenCGA/VEP installation directory, the folder were your files are, the type of job to run, etc. +**Note:** Most of the environment configuration can be provided directly to the application, but MongoDB connection details also need to be filled in the OpenCGA configuration file. The installation folder is by default located in `/opencga-app/build`, but can be moved to any destination of your choice. The configuration is located in `/conf/storage-mongodb.properties`. By using these properties files, a job can be launched with a single command like: - java -jar target/eva-pipeline-0.1.jar \ + java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.config.location=file:examples/application.properties,file:examples/load-genotyped-vcf.properties -The contents from the configuration files can be provided directly as command-line arguments, like the following: +The contents from the configuration files can be also provided directly as command-line arguments, like the following: - java -jar target/eva-pipeline-0.1.jar \ + java -jar target/eva-pipeline-2.0-beta2-SNAPSHOT.jar \ --spring.batch.job.names=load-genotyped-vcf \ input.vcf=/path/to/file.vcf \ input.study.name=My sample study \