Skip to content

Latest commit

 

History

History
executable file
·
121 lines (88 loc) · 5.08 KB

index.adoc

File metadata and controls

executable file
·
121 lines (88 loc) · 5.08 KB

Neo4j ETL

Overview

Introduction

The Neo4j ETL, especially the neo4j-etl command-line tool, can be used to import well modeled (i.e. normalized) relational data into Neo4j. It applies some simple rules for transforming the relational model.

The process as outlined below:

  1. Read database metadata and generate mapping.json

  2. Optionally edit mapping.json with the neo4j-etl-ui in Neo4j Desktop

  3. Export relational data to CSV

  4. Generate Mapping Headers

  5. Import into Neo4j using

    1. the neo4j-import tool for initial offline bulk load

    2. the neo4j-shell tool for incremental offline bulk load

    3. the cypher-shell tool for incremental online single-transaction load

    4. the java bolt driver for incremental online batch load

Architecture Diagram

neo4j etl architecture

What it is

  • Command-Line tools

  • Java API/library

  • Infer Schema and save in mapping file

  • Filter and merge strategies

  • Read mapping file to export data from other databases then

  • Import into Neo via different tools (neo4j-import, neo4j-shell, cypher-shell, java bolt driver)

  • Work in offline and online mode

  • Import in both an empty (initial load) and not-empty graph (incremental)

  • Build indexes and constraints

  • Support on Unix-like and Microsoft Operating Systems

  • Support for most popular relational databases like MySQL, PostgreSQL, Oracle and Microsoft SQL

  • Support user specified JDBC drivers

  • UI tool to visually modify mappings

Plans for the Future

  • Custom Mapping Rules + Transformations for names, data, links

  • Exemplary integration into a 3rd party ETL pipeline

  • More data types (binary, datetime, geo)

Who is it for

  • Developer learning to work with Neo4j for initial data import

  • Partners providing data integration with Neo4j

  • Enterprise developers building applications based on well modeled relational data

Open Questions

  • Date and binary datatypes

  • Security (secure connections, handling of passwords, encrypting data)

Capabilities

Inferring Schema with Mapping Rules (generate-metadata-mapping)

  • Generic relational database mapping based on the following rules

  • A table with a foreign key is treated as a Join and imported as a node with a relationship

  • Ex: Person -> Address is imported as (Person)-[:ADDRESS_ID]->(Address)

  • A table that has two foreign keys is imported as a JoinTable and imported as a relationship

  • Ex: Student <- Student_Course -> Course is imported as
    (Student) -[:STUDENT_COURSE]-> (Course)

  • A table that has more than two foreign keys is treated as an intermediate node and imported as node with multiple relationships

  • Ex: Order_Detail -> Shipping_Address, Order_Detail -> Payment_Information, Order_Detail -> Shipment_Instructions is imported as

(Shipping_Address) -[:SHIPPING]-> (Order_Detail)
(Payment_Information) -[:PAYMENT]-> (Order_Detail)
(Shipment_Instructions) -[:SHIPMENT]-> (Order_Detail)
  • Resolve relationships through composite keys.

  • Support most of the data types.

    • TinyInt can be imported as either Byte or as a Boolean (This is to support boolean values being saved in mysql as TinyInt)

    • Dates are imported as String

    • Blobs are skipped while importing until the import-tool supports binary array data.

    • Decimal to be confirmed.

  • Relationship names can either take column name or the table that is being referred to

    • --relationship-name=table then a Person->Address will become (Person)-[:ADDRESS]->(Address)

    • --relationship-name=column will become (Person)-[:ADDRESS_ID]->(Address)

  • Filter tables that you want to include or exclude using --include and --exclude

  • TODO: Filter columns that you want to include or exclude using --include and --exclude

  • TODO: Retaining natural keys(marked as PrimaryKeys and ForeignKeys) as needed using <TBA> flag

    • A Foreign Key is usually used to create a relationship between 2 nodes without being saved as a property.

    • With this flag, the node would keep that value as a property.

    • Ex: A loan has the SSN of the loan applicant which would normally be used to connect the Loan and Person nodes.

    • With this flag the Loan node will also keep the SSN as a property.

Edit Mapping via UI

A Neo4j-ETL graph application can be added Neo4j Desktop which allows visual editing of the mapping and interactive import.

The UI allows you to change and set you preferred label names, property names and types, relationship types, with a preview of the resulting graph.

Exporting Data (export)

  • Generate CSV files from relational source as outlined by mappings.json

    • TODO: Use a streaming api that is more performant

  • Import CSV providing the correct labels and rel-types and headers headers

    • TODO: Missing ability to pass options to neo4j-import tool as a file instead of command line arguments

    • TODO: Create indexes and constraints related to not-primary key columns