FaunaDB Importer is a command line utility to help you import static data into FaunaDB. It can import data into FaunaDB Cloud or an on-premises FaunaDB Enterprise cluster.
Supported input file formats:
- JSON
- CSV
- TSV
Requirements:
- Java 8
Download the latest version and extract the zip file. Inside the extracted folder, run:
./bin/faunadb-importer \
import-file \
--secret <keys-secret> \
--class <class-name> \
<file-to-import>
NOTE: The command line arguments are the same on Windows, but you must use a different startup script. For example:
.\bin\faunadb-importer.bat import-file --secret <keys-secret> --class <class-name> <file-to-import>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
data/users.json
The importer will load all data into the specified class, preserving the field names and types as described in the import file.
You can also type ./bin/faunadb-importer --help
for more detailed information.
The importer is a stateful process separated into two phases: ID generation and data import.
First, the importer will parse all records and generate unique IDs by calling
the next_id
function for each record. Pre-generating IDs beforehand allows us to import
schemas containing relational data while keeping foreign keys consistent. It
also ensures that we can safely re-run the process without the risk of
duplicating information.
In order to map legacy IDs to newly generated Fauna IDs, the importer will:
- Check if there is a field configured with the
ref
type. The field's value will be used as the lookup term for the new Fauna ID. - If no field is configured with the
ref
type, the importer will assign a sequential number for each record as the lookup term for the new Fauna ID.
Once this phase completes, the pre-generated IDs will be stored at the cache
directory. In case of a re-run, the importer will load the IDs from disk and
skip this phase.
Second, the importer will insert all records into FaunaDB, using the pre-generated IDs from the first step as their ref field.
At this phase, if the import fails to run due to data inconsistency, it is:
- SAFE to fix data inconsistencies in any field except fields configured
with the
ref
type. - NOT SAFE to change fields configured with the
ref
type as they will be used as the lookup term for the pre-generated ID from the first phase. - NOT SAFE to remove entries from the import file if you don't have a
field configured as a
ref
field; this will alter the sequential number assigned to the record.
As long as you keep the cache
directory intact, it is safe to re-run the
process until the import completes. If you want to use the importer again with a
different input file, you must empty the cache
directory first.
.
├── README.md # This file
├── bin #
│ ├── faunadb-importer # Unix startup script
│ └── faunadb-importer.bat # Windows startup script
├── cache # Where the importer saves its cache
├── data # Where you should copy the files you wish to import
├── lib #
│ └── faunadb-importer-1.0.jar # The importer library
└── logs # Logs for each execution
When importing JSON
files, field names and types are optional; when
importing text files, you must specify each field's name and type in order using
the --format
option:
./bin/faunadb-importer \
import-file \
--secret "<your-keys-secret-here>" \
--class <your-class-name> \
--format "<field-name>:<field-type>,..." \
<file-to-import>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username:string, vip:bool" \
data/users.csv
Name | Description |
---|---|
string |
A string value |
long |
A numeric value |
double |
A double precision numeric value |
bool |
A boolean value |
ref |
A ref value. It can be used to mark the field as a primary key or to reference another class when importing multiple files. For example city:ref(cities) |
ts |
A numeric value representing the number of milliseconds passed since 1970-01-01 00:00:00. You can also specify your own format as a parameter. For example: ts("dd/MM/yyyyTHH:mm:ss.000Z") |
date |
A date value formatted as yyyy-MM-dd . You can also specify your own format as a parameter. For example: date("dd/MM/yyyy") |
You can rename fields from the input file as they are inserted into FaunaDB with the following syntax:
<field-name>-><new-field-name>:<field-type>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username->userName:string, vip->VIP:bool" \
data/users.csv
When importing a JSON
file where the root element of the file is an array
, or
when importing a text file where the first line is the file header, you can skip
the root element with the --skip-root
option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--skip-root true \
data/users.csv
You can ignore fields with the --ignore-fields
option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username->userName:string, vip->VIP:bool" \
--ignore-fields "id" \
data/users.csv
NOTE: In the above example, we omit the id
field when importing the data into FaunaDB, but we still use the id
field as the ref
type so that the importer tool will properly map the newly-generated Fauna ID for that specific user.
You can maintain chronological order when importing data by using the
--ts-field
option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--ts-field "created_at" \
data/users.csv
The value configured in the --ts-field
option will be used as the ts
field for the imported instance.
By default, the importer will load your data into FaunaDB Cloud. If you wish to
import the data to your own cluster, you can use the --endpoints
option. For
example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--endpoints "http://10.0.0.120:8443, http://10.0.0.121:8443" \
data/users.csv
NOTE: The importer will load balance requests across all configured endpoints.
In order to import multiple files, you must run the importer with a schema definition file. For example:
./bin/faunadb-importer \
import-schema \
--secret "abc" \
data/my-schema.yaml
<file-address>:
class: <class-name>
skipRoot: <boolean>
tsField: <field-name>
fields:
- name: <field-name>
type: <field-type>
rename: <new-field-name>
ignoredFields:
- <field-name>
For example:
data/users.json:
class: users
fields:
- name: id
type: ref
- name: name
type: string
ignoredFields:
- id
data/tweets.csv:
class: tweets
tsField: created_at
fields:
- name: id
type: ref
- name: user_id
type: ref(users)
rename: user_ref
- name: text
type: string
rename: tweet
ignoredFields:
- id
- created_at
The importer's default settings should be enough to provide good performance for most cases. Still, there are a few things that are worth mentioning:
You can set the maximum amount of memory available to the import tool with
-J-Xmx
. For example:
./bin/faunadb-importer \
-J-Xmx10G \
import-schema \
--secret "abc" \
data/my-schema.yaml
NOTE: Parameters prefixed with -J
must be placed as the first parameters for
the import tool.
The size of each individual batch is controlled by --batch-size
parameter.
In general, individual requests will have a higher latency with a larger batch size. However, the overall throughput of the import process may increase by inserting more records in a single request.
Large batches can exceed the maximum size of a HTTP request, forcing the import tool to split the batch into smaller requests, therefore degrading the overall performance.
Default: 50 records per batch.
Concurrency is configured using the --concurrent-streams
parameter.
A large number of concurrent streams can cause timeouts. When timeouts happen, the import tool will retry failing requests applying exponential backoff to each request.
Default: the number of available processors * 2
Exponential backoff is a combination of the follow parameters:
network-errors-backoff-time
: The number of seconds to delay new requests when the network is unstable. Default: 1 second.network-errors-backoff-factor
: The number to multiplynetwork-errors-backoff-time
by per network issue detected; not to exceedmax-network-errors-backoff-time
. Default: 2.max-network-errors-backoff-time
: The maximum number of seconds to delay new requests when applying exponential backoff. Default: 60 seconds.max-network-errors
: The maximum number of network errors tolerated within the configured timeframe. Default: 50 errors.reset-network-errors-period
: The number of seconds the import tool will wait for a new network error before resetting the error count. Default: 120 seconds.
All projects in this repository are licensed under the Mozilla Public License