scratch.tex

\chapter{Creation of Neo4J database} % Insert appendix links to the scripts!
\section{Data preparation}
We created a Neo4J database and used a slightly modified Cytoscape app to
communicate with it. Input to the database is a modified version of data. The
data started as a PPI file from the StringDB. The file had a size of 3,9GB and
consisted of interactions between two proteins. Each line had two Ensembl
Protein IDs (ENSP), the type of interaction, what direction the interaction went
in and a score for the interaction. 

\begin{verbatim}
item_id_a	item_id_b	mode	action	a_is_acting	score
9606.ENSP00000000233	9606.ENSP00000000233	binding		0	800
\end{verbatim}

Then we strip this file in order to get a file with just the ENSP IDs. One line
per ID.

\begin{verbatim}
ENSP00000000233
ENSP00000005340
\end{verbatim}

Next up is the conversion from protein IDs to gene IDs coupled with Entrez ID.
This information we got from UniProt through the stripped ENSP IDs. The file
with the data is formatted like this:

\begin{verbatim}
Entry	EnsembleID	    Gene names
P84085	ENSP00000000233	ARF5
\end{verbatim}

Then match the \textit{EnsembleID} from UniProt, which is a ENSP ID, to the
\textit{item\_id\_a} and \textit{item\_id\_b} data from StringDB (without the
9606.  prefix). Then combine the Entrez id, which is the \textit{Entry} from
UniProt with \textit{Gene names} with a square (\textit{\#}) between them as
a separator to make it easier to split them when creating Cypher and GraphML
import queries for Neo4J. This combination makes up a single gene, and each line
in the file we write this information to consist of two genes and a type of
binding between them. The type of binding is provided by the StringDB file
mentioned first in this section. % label and refer instead of mentioning it!

\section{Neo4J setup and data import}

\subsection{Setup}
Not much of a setup is needed to get the database up and running. Every setting
that was changed in testing and using the database was set in the
\textit{conf/neo4j-server.properties} file, located in the directory relative to
the Neo4J database installation.

The database
location does not need to be set up, but knowing the location of the data that
is saved in the Neo4J database is somewhat crucial when working with several
database instances.

\begin{verbatim}
org.neo4j.server.database.location=data/string_mini.db
org.neo4j.server.webserver.port=7474
\end{verbatim}

NB! That path is relative to the location the neo4j database is run from.

Turning authentication on/off is also smart for testing purposes. Communicating
with a Neo4J database instance that demands authentication also requires the
usage of the modified Cytoscape app \textit{CyNeo4J}. This is because the app
does support everything we need except from authenticating with a database. For
the database, this is just having a boolean value \textit{true}/\textit{false}.

\begin{verbatim}
dbms.security.auth_enabled=true
\end{verbatim}

\subsection{Import}
Importing data into Neo4J is not very mature at this point. You have the
possibility to just use Neo4J's query language, \textit{Cypher}. However,
importing the whole gene data from StringDB and UniProt combined takes a long
time in Cypher, even if the language is the native language to Neo4J. The
alternative we used was \textit{neo4j-shell-tools} \cite{neo4j-tools}. It is
easy to install and supports import with \textit{CSV}, \textit{Geoff} and
\textit{GraphML} filetypes. A GraphML import took under 1 minute on an old and
underperforming laptop with 4GB memory, 2GHz CPU and a 128GB SSD \cite{laptop}.
In comparison, the Cypher import took several hours, with the same amount of
data. Testing related to Neo4J was not done with CSV-files, because Cytoscape
does not support exporting data of a whole network to a CSV file, only partial
information. On the other hand, Cytoscape has great support for exporting data
of a whole network to a GraphML file.

The Python scripts initialized the Neo4J database with data created with either
GraphML or Cypher. Importing any of them result in the same type of data in the
database, but because of the heavily increased speed using GraphML over Cypher,
we used GraphML exclusively to initialize the database. The nodes are created
with the forementioned \textit{Entrez ID} as primary key and \textit{Gene name}
as the displayed name in the Neo4J GUI.