Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AODN_WAVE harvester design #607

Closed
bpasquer opened this issue May 2, 2018 · 3 comments
Closed

AODN_WAVE harvester design #607

bpasquer opened this issue May 2, 2018 · 3 comments
Assignees

Comments

@bpasquer
Copy link
Contributor

bpasquer commented May 2, 2018

Harvest of QLD delayed mode wave data using the CKAN data API:
Access the QLD government database using CKAN API: allow to search and download datasets and (some) metadata by querying data in SQL, Python or JSONP.
The JSONP solution has been chosen to be implemented in a TALEND harvester.

In the catalog, collections can be identified either by their names or their 'package_id’. A package consists of all data and some metadata from one site.
The data is subdivided into resources usually corresponding to one year of data. Resources are identified by their resource_id.
I have listed target datasets in a 'metadata' csv file read by the harvester (QLD_buoys_metadata.csv). Informations in the file are same as for the near-real time metadata csv file with the addition of package_name, longitude,latitude,package_id

Ex:
Cairns, 55028, Department of Environment and Science, QLD, directional waverider buoy, 30minutes, coastal-data-system-waves-cairns, 145.7, -16.73, a34ae7af-1561-443c-ad58-c19766d8354c

In order to reduce the level of maintenance of the data collection, we need to design the harvester so that:

  1. the existing data is updated if needed
  2. new data is automatically added to the data collection

The metadata provided when querying a dataset provides information we can use to design the update.
For example, a query to the API like the following will return these parameter ( amongst a long list of other):
'https://data.qld.gov.au/api/3/action/package_search?q=coastal-data-system-waves-cairns'
id: faff952f-b91d-4ffe-811f-998e04f9e576
name: "Wave data - 2018"
description: "2018 wave data from the Cairns site (January 1 - March 31)"
last_modified: "2018-04-18T01:04:45.521941"
revision_id: 14aca928-abec-4909-a2b7-aaad1f97c4ab

Information is also provided for resources, for example :
https://data.qld.gov.au/api/3/action/resource_show?id=9385fa18-eaf3-41cb-bf80-5fc2822fd786
package_id: "b2282253-e761-4d75-89ff-8f77cf43d501"
last_modified: "2018-04-18T01:06:54.791621",
name: "Wave data - 2018",
revision_id: "cf438bea-35b0-4fcc-8cf7-1dc62a8801fe",

Updates:
In the example above we can see that the 2018 dataset comprises data up to March 31 suggesting that datasets are updated regularly.

What kind of approach for dataset maintenance/update do we want to adopt regarding :

-Frequency of harvest: monthly, every 3month, 6month?

-Harvesting based on specific criteria or regular re-harvest of the whole dataset (and that implies deleting previous data)?

Data from previous years (historical data) could be considered static and not be re-harvested. Current year data could be regularly re-harvested in full, or re-harvested only if it's been updated. In this latter case which parameter is the best diagnostic to check whether a dataset has been updated or not?
Check parameter "last_modified", check for a change in "revision_id" (assuming it changes at each update- a question I should ask actually)

Following solution has been implemented so far:

  • Data update is conditional: Test on table deployments: Run harvest only for deployment_id for which last_modified date has changed. A Boolean variable ”modified” is set to true or false according to the test result.

Issues with datasets:

  • Time zone: I contacted the Coastal Wave Unit to confirmed that time is given in AEST (UTC+10 since no clear statement is made anywhere in the metadata, data or on the website
  • Location: The location is not provided in the dataset nor in the metadata. The location is provided in the metadata csv file, but we got no way to automatically update the location if they were changing. Also, we are missing information on deployment location history => Need to contact Coastal Wave Unit

The harvester is not finalized. The outstanding issues are:

  • timezone: Conversion from AEST to UTC is done by extracting 10 hours. Not proper way. I have found a java code snippet to do that properly but went for the easy solution so far.
  • comparison of current and stored last_modified date is not working properly. Possibly(probably?) a design issue.
  • SST and wave direction are stored as text instead of numeric in the older dataset (can be up to 30 years old or more). Need to check data type and convert from text to numeric before harvesting ( not done yet)
@jonescc
Copy link
Contributor

jonescc commented May 17, 2018

Some alternate approaches

  • detect new/modified/deleted package/resources using cron/pipeline and only use harvester to delete add or modify information as per other harvesters
  • use old harvesting components instead of manifest components to keep track of new/modified/deleted packages/resources rather than trying to keep track of it yourself

we should probably be using the pipeline approach these days for consistency with other collections

Comments on your current approach

harvest_csv_metadata

  • doesn't currently handle deleted packages - but perhaps that isn't required
  • should frequence be frequency? no such english word as frequence
  • had an error loading the metadata due to the encoding setting on the tFileInputDelimited component - I set it to 'UTF-8' but if you want this pased in as a parameter you'll need to add it to nthe possible context parameters
  • perhaps consider skipping empty rows/trimming columns - I had to play around with before I could get some valid data in but then that's because I didn't know what format was required

harvest_metadata

  • not fault tolerant at the moment - if harvest_metadata runs successfully but the harvester fails after this - the next time the harvester is run the modified flag will be reset to false and modifications not processed - commit should be around full transaction, old harvesting components used which handle this or some other approach taken
  • won't handle deleted resources if that's possible
  • need to put single quotes around string values in tpostgresqlrow_3 - although see below re this needing to be a tpostgresqlinput
  • need to use tpostgresqlinput instead of tpostgresqlrow for lookup - input will actually return rows - this is why your last mod date comparison isn't working.
  • timezone handling is a bit horrible in java 6/7 but a couple of ways of setting the correct timezone are by passing a time offset when parsing the string e.g.
    TalendDate.parseDate(inputDate.replaceFirst("(\\d{3})\\d*$", "$1+1000"), "yyyy-MM-dd'T'HH:mm:ss.SSSz")

       doesn't allow for daylight savings adjustments but probably OK for Queensland, or,

    TimeZone qldTZ = TimeZone.getTimeZone("Australia/Queensland");
    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    output_row.last_modified = format.parse(input_row.last_modified);

       which could be pulled out into a talend code routine so you could just use it in a tMap.

Data_Harvest

  • modified is never set to false anywhere
  • assuming you're intending on removing the limit parameter
  • could just set max and min values directly rather than use with clause

@lbesnard
Copy link
Contributor

lbesnard commented Sep 20, 2018

Current issues with the Queensland Wave dataset

Without taking into consideration which way is the best, here is a list of various issues
I came across with the Queensland wave dataset (minus the ones already found by Bene). There might be more issues

issues in letting Talend do everything

Talend harvesters get quickly overly complicated when dealing with external web-services.
The idea that a Talend harvester, running as a cron job, to run, download the data, clean,
update... assumes that the external web-service and data is perfect.

This is unfortunately/fortunately not the case.

I'm of the opinion that, for ALL external web-services we retrieve data from and
host, the Talend harvester is not the suitable tool to handle the whole process.
It should be only created to process physical files.

All our system and tools are based around physical files:

  • If we want to add a new file, or update it, this goes via a pipeline,
    a rather powerful tool. Talend will easily update the data on the database side.
  • if we want to delete some data, again, we just run a po_s3_del command to remove
    any file from the database

As PO's, we don't have other tools or even credentials to deal with cleaning any data
from the database(which is great in my opinion). If we used Talend for this, we are pretty much locked in.

Using talend to do everything means POs actually can't do anything with the data. It
becomes almost impossible update/remove any data. Any full reprocess of the data becomes
really complicated with any physical files to re-harvest. We sometimes deal with
NRT web-services which remove their old on-going data which we have decided to keep.
In this case, a data reprocess would be challenging. When dealing with physical files,
everything else becomes easier

Finally, it is also extremely harder to review a talend harvester than a python script

recommended design for all external web-services data retrieve

harvest the data from an external web-service -> PYTHON

Python has all the toolboxes(pandas, numpy...) possible to quickly write some code to download and read data
from :

  • wfs
  • http(s) request
  • ...

in various formats:

  • json
  • xml
  • text
  • ...

Many web-services would fail when they are triggered many times too quickly. This is easily
handled with Python by adding a retry decorator to a download function in order
to retry the downloading a defined amount of times.

cleaning the dataset -> PYTHON

The data we collect from external contributors is never perfect. It is full of
inconsistencies and this will always be the case.

  • variable name/case changing
  • varying fill values
  • empty variables
  • bad values
  • dealing with various timezones
  • inconsistent time format
  • ...

Writing any logic in java to handle those various cases as stated above becomes complicated and extremely time consuming for us PO's.
it's a matter of days vs 10 minutes ratio. And it will also be, most likely, poorly written.
Debugging is also rather complicated in Talend.

creating physical files -> PYTHON

We are currently in the process of writing a common NetCDF generator from Json files. The process
will even be easier.

Harvesting the data to our database -> Pipeline v2 -> Talend

Another benefit of using the pipeline is that we can use the IOOS checker for find more issues
with the data without blindly commit it to the database.
We have also talked in the past of creating a checker on the actual data values, as well as an
indexing database.
If we want to reduce the amount of harvesters, this is also a better way to go

@bpasquer
Copy link
Contributor Author

The solution proposed by Laurent has been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants