The data preprocessor checks the raw Batumi Raptor Count data coming straight from the Trektellen database. It flags records containing possibly erroneous or suspicious information, but does not delete any data. It is up to coordinators and data technicians to decide what to do with the flagged records.
Author: Bart Hoekstra | Mail: [email protected]
The preprocessor runs on Amazon Lambda and regularly checks the Trektellen site for newly uploaded BRC counts. If both stations have uploaded data for the day, the fetcher will download the data and store a raw version of the data in Dropbox (in e.g. 2019/data/raw
). The preprocessor subsequently checks a copy of the raw data for all kinds of possible errors and flags them by adding a description of the potential problem to a check
column in the file stored in 2019/data/inprogress
. It is then up to coordinators to use their experience and knowledge of the migration during a given day to determine the validity of the flags added by the preprocessor and act accordingly. Once they have dealt with these issues and emptied the check
column of flags, the file can be moved to 2019/data/clean
. A copy of the checked file gets stored in 2019/data/inprogress-backup
, so data technicians can check how changes to the data have been made.
The following records will be flagged by the preprocessor:
- Records with invalid doublecount entries (e.g. not within 10 minutes or with the wrong distance code).
- Records containing >1 bird that is injured and/or killed (rare occurrence).
- Records lacking critical information in
datetime
,telpost
,speciesname
,count
orlocation
columns (very unlikely, but the possible result of a bug). - Records of birds in >E3 (rare occurrence).
- Records with registered morphs for all species other than Booted Eagles (and Eleonora's Falcons).
- Records of
HB_NONJUV
,HB_JUV
,BK_NONJUV
andBK_JUV
if the number of aged birds is higher than the number of counted birds (HB
andBK
) within a 10-minute window around the age record. - Records of Honey Buzzards that should probably be single-counted (at Station 2 during the HB focus period).
- Records of aged Honey Buzzards and Black Kites outside of expected distance codes (i.e. outside of W1-O-E1).
- Records containing unexpected combinations of sex and/or age information.
- Records with no timestamps, which are set to 00:00:00 during processing.
- Records containing non-protocol species.
- Records with age details in
W3
,E3
and>E3
, excluding non-juvenile harriers with a sex, juvenileMonPalHen
and juvenile/non-juvenile eagles. - Records of female Pallid Harriers with
I
orA
age (legal per protocol, though very difficult to age in the field).
- Implement automatic download of the data, flagging of suspicious records and storing of the data in Dropbox using AWS Lambda.
- Automatically add
START
andEND
records to fetched data based on count start and end times.
- Implement checks for possibly erroneous records based on some statistical rules, e.g. the expected (daily) phenology of a species.
- Clone this repository.
cd
into this directory.- Build the Docker image to generate a deployment image for the function.
docker build --platform linux/amd64 -t brc-data-preprocessor-docker:v1 .
- Tag docker image. Replace XXXXXX with your account ID.
docker tag brc-data-preprocessor-docker:v1 XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
- Push docker image to Amazon container repository. Replace XXXXXX with your account ID.
docker push XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
- Update function. Replace XXXXXX with your account ID.
aws lambda update-function-code --function-name brc-data-preprocessor-docker \ --image-uri XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest