-
Notifications
You must be signed in to change notification settings - Fork 2
Ingestion and Access Patterns
In developing tagbase-server we have uncovered/realized some small but important gotchas related to, in particular, data ingestion. They are documented below and will make valuable reading for data producers who wish to ingest data into tagbase-server.
Before we discuss best practices related to data ingestion, let's first have a look at tooling.
-
rsync: a fast and extraordinarily versatile file copying tool. Further to ISSUE-189 we can now use rsync to copy data to a
staging
location and automate ingestion into TagbaseDB via tagbase-server's REST API. - curl: an extremely popular and pervasive command line tool and library for transferring data with URLs. Available on most linux operating systems by default.
- tagbase-server UI: available as part of the tagbase-server docker composition. The OpenAPI specification is self-documenting and the UI also provides examples of how to interact with the REST API via
curl
and via a browser URL bar.
A huge benefit of leveraging the OpenAPI specification is the larger ecosystem of tooling. In particular the openapi-generator project facilitates the generation of a wide variety of clients in many different languages. If you would like to see a new client say for in Python, Java, Rust or some other supported programming language, simply open a ticket and we can generate one for you and publish it in your packaging ecosystem.
The following emerging best practices can be used to drive throughput in the ingestion process
Although tagbase-server is capable of ingesting plain text (.txt) utf-8 encoded etuff data via both POST and GET requests, we suggest first grouping and compressing multiple etuff files into a single .zip
for example. tagbase-server will decompress and unpack the binary container and then ingest files in parallel.
tagbase-server uses the powerful patool library to unpack a wide variety of files. See supported formats for more information.
As mentioned above, tagbase-server is capable of ingesting multiple files in parallel. It does this by using the powerful and lightweight parmap library which utilizes as many processor cores as possible to perform parallel ingestion.
ISSUE-189 offers the ability to use rsync
to copy data to a staging
location. The data is then automatically ingested into TagbaseDB via tagbase-server's REST API
rsync -e "ssh -i ~/.ssh/etags_tagbase.txt" -a ./staging_data/* [email protected]:/home/tagbase/tagbase-server/staging_data/
curl -X 'POST' \
'https://XXX.XXX.XXX.XXX/tagbase/api/v0.7.0/ingest?notes=New%20notes&type=etuff&version=1&filename=159903_2012_117464_eTUFF.txt' \
-H 'accept: application/json' \
-H 'Content-Type: text/plain' \
-u ...:... --insecure -T 159903_2012_117464_eTUFF.txt
N.B. Ensure that you have the correct IP/DNS and username/password.