-
Notifications
You must be signed in to change notification settings - Fork 8
1.3 Groups and Context
A group brings together specific named locations that are required to be processed together in the creation of a data product. A good example of a group is the instance of the Single Aspirated Air Temperature product on measurement level 3 on the tower at CPER (see diagram below, labeled temp-air-single_CPER000030
). The processing algorithm for this product instance requires data from 4 different sensors:
- prt temperature sensor at ML3 at CPER
- fan + tachometer (dualfan) sensor in the aspirated shield at ML3 at CPER
- heater in the aspirated shield at ML3 at CPER
- 2D wind sensor (windobserverii) at ML3 at CPER
All of these sensors are installed at different named locations. Placing them in a group allows the data collected at these locations to be brought together for processing.
A group name is unique and consists of a prefix and a location descriptor in the general format:
Group name = PREFIX_LOCATION
where...
PREFIX = Data product short name
LOCATION = SITEHORVER
The prefix helps relate it to similar groups, such as the three temp-air-single
groups in the diagram above, which are all instances of the Single Aspirated Air Temperature data product. In fact, the prefix should match the shortname for the data product, which can be found in the data product manager in the SOM portal. The location descriptor gives an indication of where the product instance is located. Generally, the location descriptor will match the format SITEHORVER
, which combines the 4-letter NEON site code and the horizontal and vertical location indices.
Note that:
- A location can be included in more than one group
- Groups may be members of other groups
An example of the latter is the Barometric Pressure data product, in which an example group is shown in the diagram above (pressure-air_CPER000035
). Producing this product requires data from the barometric pressure sensor (ptb330a) on the tower along with the L1 output for the Relative Humidity data product on the tower at the same site. Thus, the group for the instance of Barometric Pressure at CPER will include the named location for the ptb330a at CPER and the group for the Relative Humidity product on the tower at CPER (rel-humidity_CPER000040
).
A group is always required in order to publish data from Pachyderm, even if the group contains a single named location as a member. This is to create consistency in the product pipelines and because groups have properties attached to them that are used in the publication process.
Context is a free-form string that can be attached as a property to named locations and/or QC thresholds.
When used for named locations, context typically describes the environment or application in which the measurement is used and that is not otherwise described by its source type or group. Most often, context will enable differentiating among multiple locations of the same source type within the same group. For example, the Photosynthetically Active Radiation (PAR) product instance at the top of the tower at CPER (and every other tower) includes both incoming and reflected radiation measured by two sensors of the same source type (pqs1), one that faces up and one that faces down. The group for this product instance (e.g. par_CPER000040
in the diagram above) will include both named locations. Why do we need context here? Take a look at the repository structure for data in this group collected on January 1, 2020.
/2020 <-- year
/01 <-- month
/01 <-- day
/par_CPER000040 <-- group for the PAR data product instance at CPER ML4
/pqs1 <-- source type of the PAR sensor
/CFGLOC101563 <-- named location of the upward-facing PAR sensor
/data <-- subdirectory for sensor data
pqs1_CFGLOC101563_2020_01_01.parquet <-- sensor data file
/location <-- subdirectory for location metadata
CFGLOC101563.json <-- location metadata file
/CFGLOC101564 <-- named location of the downward-facing PAR sensor
/data <-- subdirectory for sensor data
pqs1_CFGLOC101564_2020_01_01.parquet <-- sensor data file
/location <-- subdirectory for location metadata
CFGLOC101564.json <-- location metadata file
Both named locations are of the same source type, so they cannot be differentiated that way. If one were to open the location file nested under each named location, one would see that they also have the same HOR and VER location indices because they are both on the tower (HOR=000) at measurement level 4 (VER=040). In this case, we need context to tell them apart, which is also stored in each location file (see next section). The context 'upward-facing' is assigned to the upward facing location and the context 'downward-facing' is assigned to the downward facing location.
If necessary, multiple contexts can be assigned to the same named location. Avoid assigning a context that overlaps things already described by the group (i.e. data product) or other properties on the named location, such as location indices (HOR & VER).
When used in QC thresholds, context is used to differentiate QC thresholds for data products that share the same term name - more on this in Thresholds.
Note that the contexts used for named locations do not need to be the same as the contexts used for thresholds.
The source of truth for groups and contexts is the PDR database, and they may be viewed and edited in the online SOM portal. These are loaded or updated in Pachyderm on a daily basis.
Group information is stored a json file for each group member, loaded via pipeline [GROUP_PREFIX]_groups_loader
and applied to existing pipelines using the using the [SHORT-NAME]_group_path
module. After execution of these modules, group information will accompany data in downstream pipelines in the group
folder. A file in this folder list the groups and associated metadata that the member is a part of with an entry that looks something like:
...
"name": "CFGLOC101563",
"group": "par_CPER000040",
"active_periods": [
{
"start_date": "2013-09-12T00:00:00Z",
"end_date": "2016-01-01T00:00:00Z"
}
],
"HOR": "000",
"VER": "040"
...
In the example above, name
is the name of the member (either a named location or a group) and group
is the name of the group that it is a member of. The metadata below these fields are specific to the group (and can differ from similar properties on name locations).
The [SHORT-NAME]_group_path
inserts the group name is in the path structure, which makes it easy to apply the filter_joiner
module to bring together the data from all the members of each group for further processing. Read how groups change the repository structure in Section 1.0 Pipeline & repo structure, pipeline naming, terms. Also see the Wiki section on the filter-joiner module to see how data from the group members are brought together after the group name is inserted into the path.
Context is a property of named locations, and can be found in two different spots:
- The location JSON file for a sensor in the
[SOURCE_TYPE]_location_asset
repo and accompanying data in downstream pipelines in thelocation
folder. - The location JSON file for a particular named location, as found in the
[SOURCE_TYPE]_location_loader
repo and accompanying data in downstream pipelines in thelocation
folder.
These files list the context(s) for the associated named location with an entry that looks something like:
...
"name": "CFGLOC101563",
"site": "CPER",
"context":[
"upward-facing"
],
"active_periods": [
{
"start_date": null,
"end_date": null
{
],
"HOR": "000",
"VER": "040",
...
In the example above, name
is the name of the named location and context lists the contexts associated with the named location. The metadata below these fields are specific to the named location.
Unlike groups, context information remains only in the location file and is not inserted into the path structure. If it is necessary to split a repository based on context, use the context_filter
module. Otherwise, your code may access the location file directly and determine the context(s).