email: [email protected]
This repository contains the information needed for the NODES 2021 workshop entitled "Creating a Knowledge Graph with Neo4j: A Simple Machine Learning Approach." There have been a couple of runnings of this course, and it has evolved a bit each time. In the original running (2021-06-11), it was based off of a Docker container with Jupyter connecting to a Neo4j database on your local machine. In the most recent running (2021-09-17) we will be a bit more flexible. We will be using free Neo4j Sandbox instances and you can use either a Jupter notebook setup on your own machine or using Google Colab.
For the sake of backwards compatibility, the original Docker container (which was used in the original running) is still available in this repository. There are instructions below on how to use it.
- Google Colab
- Note that this means you will need a Google account to use this!
- Neo4j Sandbox
- Neo4j Blank Graph Data Science Sandbox
Below you will find information on how to created both an API key for the Google Knowledge Graph as well as the API token for Wikidata. If you would like to create your own graph (versus using the graphs supplied in this repo), you will need to create these before the workshop.
Additionally, note that this repository will be updated regularly between now and the workshop. You are encouraged to pull the latest version of it just before the workshop to make sure you have the most up-to-date code.
There are two methods that will be in this workshop of how to create a knowledge graph are as follows:
- A version based on natural language processing (NLP) using Spacy to extract (subject, verb, object) triples from Wikipedia and the Google Knowledge Graph via their API.
- A version that queries Wikidata given a series of items (based on the Wikidata Q-values) and their claims (using the Wikidata P-values). The Q-values are used to create the subjects and objects of the triples while the P-values are used to create the verbs.
While either of these methods works, the benefit of the first approach is that you will have limitless numbers of verbs (since they are somewhat automatically detected from text), but you will have a problem with entity disambiguation. The benefit of the second approach is that Wikidata is able to handle the entity disambiguation for you, but you have to supply the list of verbs (claims) that you care about.
Personally, I prefer the second approach. The reason is that you don't have to do too much NLP on the unstructured text. You will still use named entity resolution on the input text, but Spacy handles that pretty easily. The first approach, on the other hand, relies on the ability to accurately detect the verbs and attribute them to subjects and objects, which is very complicated. The second approach is much cleaner. Further, complicated NLP approaches like the first require much more tuning. NLP is not a so-called "silver bullet." It requires a lot of tuning and is very specific to the language and vocabulary. If the vocabulary is particularly technical, it is likely that you will find Wikidata to provide you with superior results.
(This is only used for the first approach above and just for demonstration purposes. You can easily substitue any additional data source, including Wikidata.)
We will be working with the Google Knowledge Graph REST API in this example. Users are permitted 100,000 calls per day for free to the API, but will require an API key for the API calls. You can read more on how to get this API key here. Once the key is created, it is recommended that you store in in a file named .api_key
at the root level of this repo. This should go in the notebooks/
subdirectory.
We will be using Pywikibot to scrape entries from Wikidata. In order to do this, you will need to create a token for this bot. Directions on how to do so can be found here. Once you have that token, save it into a file named .wiki_api_token
in the notebooks/
subdirectory.
With Docker and docker-compose installed, from the CLI:
docker-compose build
docker-compose up
(Note: if you have already built the container and run it once, Neo4j assumes ownership of data/
as group and user 7474:7474
, which means the only way to access it is via sudo
. Therefore, you might need to run the above two commands via sudo
.)
Take the link for Jupyter Lab from the terminal (it has the notebook token with it) and copy and paste that into your web browser. To open the Neo4j browser, navigate to localhost:7474
. The login is neo4j
and the password is kgDemo
. These are set on line 15 in docker-compose.yml
and you can change them to anything you like.
When you are done, you can shut down the container by hitting CTRL-C
in the terminal and then at the prompt:
docker-compose down
There are some variables in the file such as NEO4J_dbms_memory_pagecache_size
that have been set to some somewhat arbitrary values. This should be set for values corresponding to the available memory on your computer. If these values, say, exceed the available memory on your machine, you can either edit them or comment them out entirely.
You might find it convenient to have two different databases, one for each method. In order to achieve this, edit lines 8 and 9 in docker-compose.yml
to reflect that (i.e. make a different directory for each graph). You might find this helpful if, like me, you screw up one and don't want to recreate the other. :)
The first time you run this container from the repo, the permissions on data/
will be changed to root. This means that all subsequent runnings of docker-compose
will need to be executed by sudo
. However, this also will change the directory forwarding in the .yml
file at line 8 since $HOME
will change from your personal login to root
. To adjust for this, you can explicitly change $HOME
to a hard-coded path or you can leave it and find your database backups at /root/graph_data/
.
- Bite-Sized Neo4j for Data Scientists
- Google Colab
- Project Jupyter
- Docker for Data Science -- A Step by Step Guide
- Google Knowledge Graph Search API
- Neo4j
- spacy Documentation
- Wikipedia package docs
This workshop has been run a couple of times and I will add to this list as the videos become available.