Skip to content

1.1 Generate L0 data in Pachyderm

covesturtevant edited this page Nov 4, 2022 · 12 revisions

Generating L0 data in Pachyderm

All data is file-based in Pachyderm, and data files are in Parquet format, which is a self-describing format with an embedded schema.

NEON's Engineering team is working on streaming L0 data from sites directly into files, but for now we need to convert L0 data already in the NEON Trino database into files. A file schema for each source type specifies the field names and column order of the data streams output by the sensor and stored in the data file. L0 file schemas are created by Engineering, and stored in the Engineering/avro-schemas Git repo.

Follow these steps to create L0 test data in Pachyderm:

  1. Create the mapping between existing L0 data product IDs and the new L0 schema for the source type. The mappings are located here, and an example mapping for the hmp155 source type (relative humidity sensor) is shown below.
---
# Stream mappings for the hmp155
temperature:
- DP0.00098.001.01309
- DP0.20271.001.01309
relative_humidity:
- DP0.00098.001.01357
- DP0.20271.001.01357
dew_point:
- DP0.00098.001.01358
- DP0.20271.001.01358
error_state:
- DP0.00098.001.01359
- DP0.20271.001.01359

The stream/term/field names (e.g. temperature) in the mapping come directly from the L0 schema for the source type here in the .avsc file for each sensor. Underneath each field name are the existing L0 data product ID + term IDs that map to that field in the schema. You can look up existing data product and term IDs in the NEON Terms Database under the Look-Ups section. Note that multiple L0 IDs may map to the same schema. For example, data products DP0.00098 (relative humidity) and DP0.20271 (relative humidity on buoy) both use the hmp155 sensor, so terms for each of these products will map to the same field in the hmp155 schema.

  1. Once the mapping is complete, commit your changes and create a new tag for the Engineering/avro-schemas Git repo. If you are unsure of how to create a tag, here is a great place to start. Once pushed, this will automatically generate a Docker image with the same tag at quay.io/battelleecology/genavro:TAG where TAG is the tag you created.

  2. Create a repository in pachyderm called SOURCE_TYPE_import_trigger, where SOURCE_TYPE is the name of your source type (e.g. hmp155_import_trigger). This repository will control the data that is converted from the NEON database into Pachyderm. Populate the repository with the YEAR/MONTH/DAY folder structure for each of the days you want data from (see example below). In the folder for each day, create a zero-byte file named for each site code you want data from (full file name is the site code, no extension). Sensor data from those sites and days will be converted into Parquet format in Pachyderm in step 4 below. In Pachyderm, your SOURCE_TYPE_import_trigger repo should look similar to this:

$ pachctl glob file SOURCE_TYPE_import_trigger@master:/**
NAME             TYPE SIZE
/2019/01/01/ARIK file 0B
/2019/01/01/BARC file 0B
/2019/01/01/CPER file 0B
/2019/01/02/ARIK file 0B
/2019/01/02/BARC file 0B
/2019/01/02/CPER file 0B
/2019/01/03/ARIK file 0B
/2019/01/03/BARC file 0B
/2019/01/03/CPER file 0B
/2019/01/04/ARIK file 0B
/2019/01/04/BARC file 0B
/2019/01/04/CPER file 0B
/2019/01/05/ARIK file 0B
/2019/01/05/BARC file 0B
/2019/01/05/CPER file 0B

An easy way to complete this step is to create the SOURCE_TYPE_import_trigger folder structure in your local work environment, then put the whole thing in pachyderm using the following sequence of commands (replacing SOURCE_TYPE with your source type, and entering the correct local path):

pachctl create repo SOURCE_TYPE_import_trigger
pachctl put file -r SOURCE_TYPE_import_trigger@master:/ -f /PATH/TO/LOCAL/FOLDER/SOURCE_TYPE_import_trigger
  1. Create and deploy the SOURCE_TYPE_data_source_Trino Pachyderm pipeline spec that will use the docker image with the mapping you created in step 1 and the SOURCE_TYPE_import_trigger repo you created in step 3 to convert data from the NEON database into pachyderm. Find one from another source type in this Git repo to use as a template, making sure to find and replace the old source type with the new one throughout the pipeline spec. Also update the quay.io/battelleecology/genavro:TAG to the one you just created. Make sure that the input portion of the pipeline spec looks like this (replacing SOURCE_TYPE with your source type):
  "input": {
    "pfs": {
      "name": "import_trigger",
      "repo": "SOURCE_TYPE_import_trigger",
      "glob": "/*/*/*"
    }
  },

Note that the repo field in the example above is the actual name of the input repo. You use the name field to call it something else when Pachyderm puts the data into the container for your code to run on. This is useful so that your code can always expect the same name no matter what the actual input repository is.

  1. After you've stood up the SOURCE_TYPE_data_source_Trino pipeline (recall the command:pachctl create pipeline -f /PATH/TO/PIPELINE.json), check that the output looks something like this (except for your source type). For future reference, the number after the date in the folder structure and before the date in the file name is the source ID for the sensor.
$ pachctl glob file prt_data_source_trino@master:/**
NAME                                               TYPE SIZE
/prt                                               dir  29.78MiB
/prt/2019                                          dir  29.78MiB
/prt/2019/01                                       dir  29.78MiB
/prt/2019/01/01                                    dir  5.787MiB
/prt/2019/01/01/13417                              dir  392.8KiB
/prt/2019/01/01/13417/prt_13417_2019-01-01.parquet file 392.8KiB
/prt/2019/01/01/14491                              dir  357.8KiB
/prt/2019/01/01/14491/prt_14491_2019-01-01.parquet file 357.8KiB
/prt/2019/01/01/16299                              dir  363.2KiB
/prt/2019/01/01/16299/prt_16299_2019-01-01.parquet file 363.2KiB
/prt/2019/01/01/17478                              dir  42.68KiB
/prt/2019/01/01/17478/prt_17478_2019-01-01.parquet file 42.68KiB
/prt/2019/01/01/17596                              dir  44.41KiB
/prt/2019/01/01/17596/prt_17596_2019-01-01.parquet file 44.41KiB
...
  1. BONUS: Check out the output! Copy/download one of these data files into your local environment and open it in Rstudio using the NEONprocIS.base::def.read.parq function from this Git repository. See Wiki section 9.1 Useful Pachyderm Commands for the Pachyderm commands or 9.2 Useful R functions for handy R functions to transfer files from Pachyderm to your local environment.