Skip to content

Ingestion and Access Patterns

Lewis John McGibbney edited this page Feb 13, 2023 · 5 revisions

Introduction

In developing tagbase-server we have uncovered/realized some small but important gotchas related to, in particular, data ingestion. They are documented below and will make valuable reading for data producers who wish to ingest data into tagbase-server.

Tooling

Before we discuss best practices related to data ingestion, let's first have a look at tooling.

  • rsync: a fast and extraordinarily versatile file copying tool. Further to ISSUE-189 we can now use rsync to copy data to a staging location and automate ingestion into TagbaseDB via tagbase-server's REST API.
  • curl: an extremely popular and pervasive command line tool and library for transferring data with URLs. Available on most linux operating systems by default.
  • tagbase-server UI: available as part of the tagbase-server docker composition. The OpenAPI specification is self-documenting and the UI also provides examples of how to interact with the REST API via curl and via a browser URL bar.

Future work on tooling

A huge benefit of leveraging the OpenAPI specification is the larger ecosystem of tooling. In particular the openapi-generator project facilitates the generation of a wide variety of clients in many different languages. If you would like to see a new client say for in Python, Java, Rust or some other supported programming language, simply open a ticket and we can generate one for you and publish it in your packaging ecosystem.

Ingestion Recommendations

The following emerging best practices can be used to drive throughput in the ingestion process

Use compressed binary containers when submitting etuff data to tagbase-server

Although tagbase-server is capable of ingesting plain text (.txt) utf-8 encoded etuff data via both POST and GET requests, we suggest first grouping and compressing multiple etuff files into a single .zip for example. tagbase-server will decompress and unpack the binary container and then ingest files in parallel.

tagbase-server uses the powerful patool library to unpack a wide variety of files. See supported formats for more information.

Submit multiple files at once

As mentioned above, tagbase-server is capable of ingesting multiple files in parallel. It does this by using the powerful and lightweight parmap library which utilizes as many processor cores as possible to perform parallel ingestion.

Use rsync

ISSUE-189 offers the ability to use rsync to copy data to a staging location. The data is then automatically ingested into TagbaseDB via tagbase-server's REST API

rsync -e "ssh -i ~/.ssh/etags_tagbase.txt" -a ./staging_data/* [email protected]:/home/tagbase/tagbase-server/staging_data/

Example POST

curl -X 'POST' \
  'https://XXX.XXX.XXX.XXX/tagbase/api/v0.7.0/ingest?notes=New%20notes&type=etuff&version=1&filename=159903_2012_117464_eTUFF.txt' \
  -H 'accept: application/json' \
  -H 'Content-Type: text/plain' \
  -u ...:... --insecure -T 159903_2012_117464_eTUFF.txt

N.B. Ensure that you have the correct IP/DNS and username/password.