GSOC 2016 project summary and guide

#1. Project Description Knowledge of the three-dimensional structure of a protein often gives a basic understanding on its function. Protein structure variances resulted from nonsynonymous mutations may related with cancer. With proper alignment and visualization tools, researchers gain insights by seeing the location of mutations in protein three-dimensional structures. The distribution and location of the mutations on the protein structures may reveal several details helpful for understanding cancer at the molecular level. cBioPortal (http://www.cbioportal.org), a cancer genome research web site, provides a variety of large-scale cancer genomics analysis and visualization tools. However, the protein 3D structure visualization tool currently used in cBioPortal is not kept current, since it lacks the ability to automatically update the mutation mapping on updated structures released from PDB weekly. Building a pipeline to map mutations in protein sequences to 3D models of protein structures will help cBioPortal incorporate the updated protein structures, making them available for visualization by researchers and physicians.
The goals of this project are:

Build a software pipeline which will automate the process of aligning human protein sequences (including all isoforms) to PDB structures. The pipeline must be able to align to the complete set of PDB structures, and also update previous results with alignments from incremental updates to the set of available PDB structures.
Build a Web Api which provides access to fetch these alignments.

This project has three major procedures:

Pipeline Initiation: Starts from aligning downloaded ensembl human protein sequences against PDB protein sequences by blastp. All alignment results between ensembl sequences and target PDB sequences are stored into a designed database. For updating to use a newer release of Ensembl protein sequence, all the alignments in the database will be rebuilt.
Pipeline Update: The pipeline will periodically fetch new protein structure sequences and update modified and obsolete sequences from PDB's weekly update. Blastp will be run on new sequences and the alignments will be updated in the database.
API: Two JSON based API functions are established to provide alignment database query services.

#2. Installation Guide ##Prerequest:

OS: Linux 64bit
java: jdk_1.8.0 (https://java.com/)
maven: 3.3.9 (https://maven.apache.org/)
mysql: Ver 15.1 Distrib 10.0.21-MariaDB (https://www.mysql.com/)
blast: 2.4.0+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.4.0/)
At least 25GB free disk space is needed. If user wants to download and parse local clone of PDB, please make sure usePdbSeqLocalTag in ${workdir}/pdb-annotation/pdb-alignment-pipeline/src/main/resources/application.properties is set as true, and that will cost another ~22GB disk space.

*Please make sure java, mvn, mysql, blastp are all in your paths.

##Step 1. Init the pipeline and the database

Create the database in MYSQL:
Create an empty database named "pdb", username as "your-username", password as "your-password".At the mysql prompt, type:
CREATE DATABASE PDB;
CREATE USER 'your-username'@'localhost' IDENTIFIED BY 'your-password';
GRANT SELECT, INSERT, DELETE, ALTER, CREATE, DROP on PDB.* TO 'your-username'@'localhost';
FLUSH PRIVILEGES;
Clone the project from Github:
Select an appropriate location in your file system as your code workspace, namely ${workdir}.Type:
cd ${workdir}
git clone https://github.com/cBioPortal/pdb-annotation.git
Change settings of the project:
Edit the variables in the setting file: ${workdir}/pdb-annotation/pdb-alignment-pipeline/src/main/resources/application.properties
(i) Change the "workspace" to an appropriate location, this location will store all the sequences and essential material.
(ii) Change "resource_dir" to ${workdir}/pdb-annotation/pdb/src/main/resources/
(iii) * If you want to use other test ensembl sequences, please change both ensembl_download_file and ensembl_fasta_file in your workspace
Compiling the project by maven:
Type:
cd ${workdir}/pdb-annotation/
mvn package
Initialize the pipeline:
Type:
cd ${workdir}/pdb-annotation/pdb-alignment-pipeline/target/
java -jar -Xmx7000m pdb-0.1.0.jar init

##Step 2. Check the API

Change variables of spring.datasource.username as your-username and spring.datasource.password as your-password in ${workdir}/pdb-annotation/pdb-alignment-api/src/main/resources/application.properties
Start the inner-built web service in the project: Type:
cd ${workdir}/pdb-annotation/pdb-alignment-api/
mvn spring-boot:run
Check Swagger-UI in your web browser:
http://localhost:8080/swagger-ui.html
You can check the two API documents by clicking "alignment-controller". The screenshot of the swagger-API should be like: https://drive.google.com/file/d/0B2cS3fM07DZNanN5MG1vOXlES3c/view?usp=sharing
Check API by your web browser: StructureMappingQuery?ensemblId={id} and ProteinIdentifierRecognitionQuery?ensemblId={id},e.g.
http://localhost:8080/pdb_annotation/StructureMappingQuery?ensemblId=ENSP00000483207.2
http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000483207.2
There will be results output in JSON
Now you have a web service providing ensembl to PDB alignments information. You can also deploy the jar to another web service component.

##Step 3. Weekly update The pipeline can auto-update its alignments from the PDB weekly update.

Choice 1: Update Weekly: User could use internally scheduled weekly update function, type:
cd ${workdir}/pdb-annotation/pdb-alignment-pipeline/target/
java -jar -Xmx7000m pdb-0.1.0.jar weeklyupdate
Now the project will update weekly automatically. If needed, user could change settings in application.properties.
Choice 2: Update Immediately: In your system, user could use CRON to invoke the update weekly in Linux system, type:
cd ${workdir}/pdb-annotation/pdb-alignment-pipeline/target/
java -jar pdb-0.1.0.jar update

Now the project will update immediately. CRON will invoke the update periodically. Check https://en.wikipedia.org/wiki/Cron for more details.

##Running Time on the Development Platform: Development Platform: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 8 cores, 8G Memory; Linux version 3.18.7-200.fc21.x86_64 (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC) ) #1 SMP; OpenJDK version "1.8.0_65" 64-Bit Server VM (build 25.65-b01, mixed mode); mysql Ver 15.1 Distrib 10.0.21-MariaDB, for Linux (x86_64) using EditLine wrapper

Typical Running time for pipeline Init : 80219.905 Seconds (around 22 hours)
Typical Running time for pipeline Update: 1062.796 Seconds (around 20 minutes)

#3. Project Features

Pdb-Annotation is a Java project based on the spring-boot (http://projects.spring.io/spring-boot/) framework
All the code is freely accessible via github
The runtime execution log file will be named pipeline.log in {$workspace}
User could easily change settings in application.properties and log4j.properties
User could either download and parse the whole clone of PDB to get accurate PDB sequences or quickly download curated PDB sequences
User could use CRON in Linux or inner-built schedule to do weekly update
Weekly update processes will create a new folder named as YYYYMMDD to store the essential files
Swagger (http://swagger.io/) was enabled to organize the API and API documentation

#4. API Documentation ##StructureMappingQuery pdb_annotation/StructureMappingQuery?ensemblId={$ensemblId}

Inputs:

"ensemblId": (string) ensembl ID

Outputs:
results : (List) a list of result objects for alignments which meet criteria. In each result:

"alignmentid": (string), id of alignments
"bitscore": (int), bitscore got from blastp
"chain": (string), PDB chain name
"ensemblAlign": (string), alignments string of ensembl
"ensemblfrom": (int), start residue in ensembl alignment
"ensemblid": (string), ensembl id in the alignments
"ensemblto": (int), end residue in ensembl alignment
"evalue": (string), evalue got from blastp
"identity": (int), identity got from blastp
"identp": (int), positive identity from blastp
"midlineAlign": (string), alignments between ensembl and PDB sequences
"pdbAlign": (string),alignments string of PDB sequences
"pdbfrom": (int), start residue in PDB alignment
"pdbid": (string), PDB id
"pdbno": (string), PDB id "_" PDB chain
"pdbto": (int), end residue in PDB alignment
"updateDate": (string), when the alignment was added in the system

Example:
http://localhost:8080/pdb_annotation/StructureMappingQuery?ensemblId=ENSP00000483207.2

Functional Details:
If the queried ensemblId did not aligned to any sequences in the PDB structure, no report for that alignment will be included. Otherwise, a list of result objects for PDB alignments were returned in JSON format.

##ProteinIdentifierRecognitionQuery
pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId={$ensemblId}

Inputs:

"ensemblId" : (string) ensembl ID

Outputs:

"isRecognized" : (boolean), whether the ensemblId is valid in the system

Example:
http://localhost:8080/pdb_annotation/ProteinIdentifierRecognitionQuery?ensemblId=ENSP00000483207.2

Functional Details:
If the ensemblId was part of the sequence database which was queried against PDB sequences, then the output will the true (regardless of whether any acceptable alignments were generated). This will allow the client to distinguish between cases of “not examined” protein identifiers and cases where alignment generation was attempted, but no acceptable alignments were found.

#5. Code Organization
UML Picture of the pipeline:
https://drive.google.com/file/d/0B2cS3fM07DZNaElDYl83cnZqU2M/view?usp=sharing

#6. Further Steps and Challenges Most requirements resigned to the project are completed during the three months coding in 2016, but there are still several works needs to be done to meet all the needs of the cBioPortal project. For example, build another API query function to find the PDB coordinate for a specific residue in the protein sequence. There are still some of the challenges that need to be dealt with, such as the segmentation problem and gaps in the pdb sequence.

#7. Juexin Wang's Contribution
All Juexin Wang's contribution in the project could be found at https://github.com/cBioPortal/pdb-annotation/commits/master?author=juexinwang

#8. Acknowledge This project is funded by Google Summer of Code 2016. The project is detailed in https://summerofcode.withgoogle.com/projects/#5319347803258880. Thanks for Google, cbioportal in Memorial Sloan Kettering Cancer Center (http://www.cbioportal.org/), and my mentors Robert Sheridan, Pieter Lukasse, Selcuk Onur Sumer and Jianjiong Gao.

#9. Contact If you have any questions, please contact Juexin Wang (wang.juexin (at) gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSOC 2016 project summary and guide

Clone this wiki locally