Skip to content

1.1 Generate L0 data in Pachyderm

Cove Sturtevant edited this page Feb 9, 2024 · 12 revisions

Generating L0 data in Pachyderm

All data is file-based in Pachyderm, and data files are in Parquet format, which is a self-describing format with an embedded schema.

Recent upgrades to the Location Controller at NEON sites change the way NEON data stream into HQ. The new method is called Kafka. Within a particular latency period (probably 30 days), data are ingested from Kafka into Pachyderm while older data are ingested into Pachyderm from the Trino database. Thus, in normal operation there will be two data ingest pipelines that gather data from Kafka and/or Trino as needed. Both methods rely on a file schema for each source type that specifies the field names and column order of the data streams output by the sensor and stored in the data file. L0 file schemas are created by Engineering, and stored in the BattelleEcology/neon-avro-schemas Git repo.

Follow these steps to create L0 test data in Pachyderm:

  1. Develop the mapping between the old L0 data product IDs and the new L0 schema for the source type. This is required for the Trino loader. The mappings are located here, and an example mapping for the hmp155 source type (relative humidity sensor) is shown below.
---
# Stream mappings for the hmp155
temperature:
- DP0.00098.001.01309
- DP0.20271.001.01309
relative_humidity:
- DP0.00098.001.01357
- DP0.20271.001.01357
dew_point:
- DP0.00098.001.01358
- DP0.20271.001.01358
error_state:
- DP0.00098.001.01359
- DP0.20271.001.01359

The stream/term/field names (e.g. temperature) in the mapping come directly from the L0 schema for the source type (here) in the .avsc file for each sensor. Underneath each field name are the existing L0 data product ID + term IDs that map to that field in the schema. You can look up existing data product and term IDs in the NEON Terms Database under the Look-Ups section. Note that multiple L0 IDs may map to the same schema. For example, data products DP0.00098 (relative humidity) and DP0.20271 (relative humidity on buoy) both use the hmp155 sensor, so terms for each of these products will map to the same field in the hmp155 schema.

  1. Once the mapping is complete, commit your changes and create a new tag for the BattelleEcology/neon-avro-schemas Git repo. If you are unsure of how to create a tag, here is a great place to start. Once pushed, this will automatically generate a Docker image with the same tag in the Google image registry to be used as the base image for the data_source_trino module. You'll need to update the Dockerfile for the data_source_trino module with the new base image tag and create a new Docker image for data_source_trino. Follow the instructions in section 3.0 Updating modules or creating new modules of this wiki.

  2. Stand up the cron_daily_and_date_control pipeline for your source type (Recall the command:pachctl create pipeline -f /PATH/TO/PIPELINE.json). Use the pipeline spec of another source type as a template and making the adjustments described below. This pipeline will control the dates and sites of data that is converted from the NEON database into Pachyderm. The only pre-requisite to run this module is the site_list.json, which you can also copy from another source type and edit for the sites you want to pull data from. In the pipeline spec for cron_daily_and_date_control adjust the START_DATE and END_DATE to a small range of data for which you know there is data. For testing, choose a date range that is more than 1 month from the date you stand it up. In normal operations this pipeline will run on a daily interval and add the most recent day of data to the system from the streaming ingest (Kafka). But for testing it's best to start with a small, fixed range of data that is grabbed from NEON's trino database. As its name suggests, this is a cron pipeline that is triggered daily (even though we are configuring it for testing to use a fixed date range). If you do nothing after standing it up, it will execute its first run on the next daily interval. To get it to run immediately, use the pachctl run cron <pipeline> command.

  3. Create and deploy the SOURCE_TYPE_data_source_trino Pachyderm pipeline that will use the docker image with the mapping you created in steps 1-2 and the SOURCE_TYPE_cron_daily_and_date_control pipeline you created in step 3 to convert data from the NEON database into pachyderm. As usual, find a data_source_trino pipeline spec one from another source type in this Git repo to use as a template, making sure to find and replace the old source type with the new one throughout the pipeline spec. Also update the quay.io/battelleecology/data_source_trino:TAG to the one you just created. Make sure that the input portion of the pipeline spec looks like this (replacing SOURCE_TYPE with your source type):

  4. After you've stood up the SOURCE_TYPE_data_source_trino pipeline, check that the output looks something like this (except for your source type). For future reference, the number after the date in the folder structure and before the date in the file name is the source ID for the sensor.

$ pachctl glob file prt_data_source_trino@master:/**
NAME                                               TYPE SIZE
/prt                                               dir  29.78MiB
/prt/2019                                          dir  29.78MiB
/prt/2019/01                                       dir  29.78MiB
/prt/2019/01/01                                    dir  5.787MiB
/prt/2019/01/01/13417                              dir  392.8KiB
/prt/2019/01/01/13417/prt_13417_2019-01-01.parquet file 392.8KiB
/prt/2019/01/01/14491                              dir  357.8KiB
/prt/2019/01/01/14491/prt_14491_2019-01-01.parquet file 357.8KiB
/prt/2019/01/01/16299                              dir  363.2KiB
/prt/2019/01/01/16299/prt_16299_2019-01-01.parquet file 363.2KiB
/prt/2019/01/01/17478                              dir  42.68KiB
/prt/2019/01/01/17478/prt_17478_2019-01-01.parquet file 42.68KiB
/prt/2019/01/01/17596                              dir  44.41KiB
/prt/2019/01/01/17596/prt_17596_2019-01-01.parquet file 44.41KiB
...
  1. BONUS: Check out the output! Copy/download one of these data files into your local environment and open it in Rstudio using the NEONprocIS.base::def.read.parq function from this Git repository. See Wiki section 9.1 Useful Pachyderm Commands for the Pachyderm commands.