- Python 3.12
- pyenv (optional)
- PyCharm (optional)
- Neo4J Aura DB (optional)
- create account at Neo4J Console and create a new free tier instance there
- Neo4J Desktop
- create a new project
- connect to the just created remote Neo4J Aura instance or
- create a local Neo4J database management system (DBMS)
It is recommended to use a virtual environment in order to isolate libraries used in this project from the environment of your operating system. To do so, run the following in the project directory:
# create the virtual environment in the project directory; do this once
python3 -m venv venv
# activate the environment; do this before working with the scraper
source venv/bin/activate
# install the required libraries
pip3 install -r requirements.txt
- menu
File
->Settings...
-> search forPython Interpreter
Add Interpreter
- keep selection on
Virualenv Environment
- either create
New
venv (when not done before) or chooseExisting
one
- Run configuration
- Script:
<your local path to clone Git repo>/app/scraping.py
- Working directory:
<your local path to clone Git repo>/app
- Environment variables:
PYTHONUNBUFFERED=1
- Script:
- copy config_example to config and adapt the connection values there to match either your local or your remote database
[Neo4j]
NEO4J_URI=neo4j+s://<instance_id>.databases.neo4j.io:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=<password>
- Download mirror.db linked in this release and copy it to
app/mirror.db
- execute scraper either by using run configuration in PyCharm or by calling
python ./scraping.py
from Console being in folderapp
, this will import the contents of mirror.db (as configured inapp/oparl/fakerequest.py
) to Neo4J database (as configured inconfig
)
- connect to your database and open it in Neo4J Browser
- now you can query the data
- find paper by reference id:
MATCH(p:Paper) WHERE p.reference = 'VII-P-08704-VSP-01' RETURN p
- find entities by modified date:
MATCH(p) WHERE p.modified > datetime('2023-07-17') RETURN p
- find entities of type NamedEntity by modified date:
MATCH(p:NamedEntity) WHERE p.modified > datetime('2023-07-17') RETURN p
- find entities of type NamedEntity by modified date:
- find persons that issued more than 6 papers:
- only the persons:
MATCH(pa:Person)-[r]-(pe:Paper) WITH count(r) as cnt, pa WHERE cnt > 6 RETURN pa, cnt
- persons and the papers that belongs to them:
MATCH(pa:Paper)<-[r]-(pe:Person) WITH count(pa) as cnt, pe, collect(pa) as papers WHERE cnt > 6 RETURN pe, papers, cnt
- only the persons: