Skip to content

1.1 Generate L0 data in Pachyderm

Kevin edited this page Mar 11, 2021 · 12 revisions

Generating L0 data in Pachyderm

All data is file-based in Pachyderm, and data files are in Parquet format. NEON's Engineering team is working on streaming L0 data from sites directly into files, but for now we need to convert L0 data already in the NEON database to the new file format. A file schema for each source type (a.k.a. sensor type) specifies the field names and column order of the data streams output by the sensor and stored in the data file. L0 file schemas are created by Engineering, and stored in the Engineering/avro-schemas Git repo. Follow these steps to create L0 test data in Pachyderm:

  1. Create the mapping between existing L0 data product IDs and the new L0 schema for the source type. The mappings are located here, and an example mapping for the hmp155 source type (relative humidity sensor) is shown below. The stream/term/field names (e.g. temperature) in the mapping come directly from the L0 schema for the source type here in the .avsc file for each sensor. Underneath each field name are the existing L0 data product ID + term IDs that map to that field in the schema. You can look up existing data product and term IDs in the NEON Terms Database under the Look-Ups section. Note that multiple L0 IDs may map to the same schema. For example, data products DP0.00098 (relative humidity) and DP0.20271 (relative humidity on buoy) both use the hmp155 sensor, so terms for each of these products will map to the same field in the hmp155 schema.
---
# Stream mappings for the hmp155
temperature:
- DP0.00098.001.01309
- DP0.20271.001.01309
relative_humidity:
- DP0.00098.001.01357
- DP0.20271.001.01357
dew_point:
- DP0.00098.001.01358
- DP0.20271.001.01358
error_state:
- DP0.00098.001.01359
- DP0.20271.001.01359
  1. Once the mapping is complete, commit your changes and create a new tag for the Engineering/avro-schemas Git repo. If you are unsure of how to create a tag here is a great place to start. Once pushed, this will automatically generate a Docker image with the same tag at quay.io/battelleecology/genavro:TAG where TAG is the tag you created.

  2. Create a repository in pachyderm called import_trigger_SOURCE_TYPE, where SOURCE_TYPE is the name of your source type (e.g. import_trigger_hmp155). This repository will control the data that is converted from the NEON database into Pachyderm. Populate the repository with the year/month/day folder structure for each of the days you want data from (see example below). In the folder for each day, create a zero-byte file named for each site code you want data from (full file name is the site code, no extension). Sensor data from those sites and days will be converted into Parquet format in Pachyderm in step 4 below. In Pachyderm, your import_trigger_SOURCE_TYPE repo should look similar to this:

$ pachctl list file import_trigger_SOURCE_TYPE@master:/**
NAME             TYPE SIZE
/2019/01/01/ARIK file 0B
/2019/01/01/BARC file 0B
/2019/01/01/CPER file 0B
/2019/01/02/ARIK file 0B
/2019/01/02/BARC file 0B
/2019/01/02/CPER file 0B
/2019/01/03/ARIK file 0B
/2019/01/03/BARC file 0B
/2019/01/03/CPER file 0B
/2019/01/04/ARIK file 0B
/2019/01/04/BARC file 0B
/2019/01/04/CPER file 0B
/2019/01/05/ARIK file 0B
/2019/01/05/BARC file 0B
/2019/01/05/CPER file 0B

An easy way to complete this step is to create the import_trigger_SOURCE_TYPE folder structure in your local work environment, then put the whole thing in pachyderm using the following sequence of commands (replacing SOURCE_TYPE with your source type, and entering the correct local path):

pachctl create repo import_trigger_SOURCE_TYPE
pachctl put file -r import_trigger_SOURCE_TYPE@master:/ -f /PATH/TO/LOCAL/FOLDER/import_trigger_SOURCE_TYPE

Note: It's also ok to skip this step and use the import_trigger repo from another source type, but be aware that you will get data from the same sites and time period specified in that repo.

  1. Create and deploy the pair of Pachyderm pipeline specs that will use the docker image with the mapping you created in step 1 and the import_trigger_SOURCE_TYPE repo you created in step 3 to convert data from the NEON database into pachyderm. These pipeline specs are housed in this Git repo in the /pipe/presto_data_source folder. Use the data_source_prt... files as templates, using 'Save As...' to save them for your source type. The first is data_source_SOURCE_TYPE_site. Find and replace all the instances of the previous source type (e.g. prt) with the source type you're working with. Also update the quay.io/battelleecology/genavro:TAG to the one you just created. Make sure that the input portion of the pipeline spec looks like this (replacing SOURCE_TYPE with your source type):
  "input": {
    "pfs": {
      "name": "import_trigger",
      "repo": "import_trigger_SOURCE_TYPE",
      "glob": "/*/*/*"
    }
  },

The second spec is data_source_SOURCE_TYPE_linkmerge. All you should need to do is find and replace all the instances of the previous source type (e.g. prt) with the source type you're working with.

  1. After you've stood up these pipelines in Pachyderm check that the output looks something like this (except for your source type). For future reference, the number after the date in the folder structure and before the date in the file name is the SOURCE_ID for the sensor.
$ pachctl list file data_source_prt_linkmerge@master:/**
NAME                                               TYPE SIZE
/prt                                               dir  29.78MiB
/prt/2019                                          dir  29.78MiB
/prt/2019/01                                       dir  29.78MiB
/prt/2019/01/01                                    dir  5.787MiB
/prt/2019/01/01/13417                              dir  392.8KiB
/prt/2019/01/01/13417/prt_13417_2019-01-01.parquet file 392.8KiB
/prt/2019/01/01/14491                              dir  357.8KiB
/prt/2019/01/01/14491/prt_14491_2019-01-01.parquet file 357.8KiB
/prt/2019/01/01/16299                              dir  363.2KiB
/prt/2019/01/01/16299/prt_16299_2019-01-01.parquet file 363.2KiB
/prt/2019/01/01/17478                              dir  42.68KiB
/prt/2019/01/01/17478/prt_17478_2019-01-01.parquet file 42.68KiB
/prt/2019/01/01/17596                              dir  44.41KiB
/prt/2019/01/01/17596/prt_17596_2019-01-01.parquet file 44.41KiB
...
  1. BONUS: Check out the output! Copy/download one of these data files into your local environment and open it in Rstudio using the NEONprocIS.base::def.read.parq function from this Git repository. See Wiki section 9.2 Useful R functions for handy functions to transfer files from Pachyderm to your local environment.