In this section, we will set up a Dataflow job to extract and transform data from CSV files stored as objects in Google Cloud Storage and load them into a Neo4j graph database instance.
Documentation for this section can be found here
For this part of the lab you will need a Google Cloud Platform account with permission and access to deploy the following services:
- Neo4j Aura: https://console.cloud.google.com/marketplace/product/endpoints/prod.n4gcp.neo4j.io
- Cloud Storage: https://console.cloud.google.com/storage/
- Dataflow: https://console.cloud.google.com/dataflow/
In this example we will use the London public transport network as our test dataset.
The template files we will use for this example are located here
The data sources for this template need to be CSV files stored in Google Cloud Storage buckets.
CSV files must fulfill some constraints in order to be used as data sources for the Google Cloud to Neo4j template:
-
they should not contain headers. Column names should be specified in the ordered_field_names attributes, and files should contain data rows only.
-
they should not contain empty rows.
In order to deploy a Dataflow job for Neo4j you will need two JSON templates:
- A Dataflow job specification template which specifies the URI of the source files you've uploaded to Google Cloud Storage. This template will specify where to extract the data from and subsequently how to transform and load into our your graph data model in Neo4j. For this demo we will use this job spec template.
The template should refer to the Google Cloud Storage URI in the sources section using following format:
"sources": [
{
"type": "text",
"name": "tube_lines",
"uri": "gs://neo4j-datasets/dataflow-london-transport/gcs-to-neo4j/source-data/London_tube_lines_no_headers.csv",
"format": "EXCEL",
"delimiter": ",",
"ordered_field_names": "Tube_Line,From_Station,To_Station"
}
You can refer to our demo template for reference or refer to our online documentation for more details.
This notebook will guide you through the steps of setting up a Google Cloud Storage bucket with the necessary template files uploaded to them. You can also do this step manually.
- A Neo4j connnection template. This template contans the login credentials for our Neo4j instance. There is a sample connection template available here, but in general the format should look like this:
{
"server_url": "neo4j+s://<instance-id>.databases.neo4j.io",
"database": "neo4j",
"auth_type": "basic",
"username": "neo4j",
"pwd": "<password>"
}
There is also a helper Python script available which can convert a Neo4j Aura credentials file into the correct JSON format.
Once your template files are uploaded to the storage buckets you can continue on to configure and set up your Dataflow job.
- Go to the Dataflow console
- Select "Create New Job"
- Give your job a name and select the region
- Click on the dropdown menu and type "neo4j"
- Select the "Google Cloud to Neo4j" template
- Configure Dataflow job specification template
- Browse to your storage bucket and select your job spec template
- Configure the Neo4j connection template
- Scroll down and open up the "Optional Parameters" section
- Fill in the location of the Neo4j connections template file or if you are using Google Secret Manager, enter the name of the Secret ID.
- NOTE: Although these two individual fields are "Optional", it is required to fill in one of them.
- Scroll down to the bottom of the page and click "Run Job" and now wait for the job to finish (about 5-10 minutes for this demo)
- The job is complete once all of the stages turn green and the job status field says "Succeeded".
- Now you can log into the Neo4j instance and the graph is ready to explore!
- You can also explore the graph using Neo4j Bloom.