-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AODN_WAVE harvester design #607
Comments
Some alternate approaches
we should probably be using the pipeline approach these days for consistency with other collections Comments on your current approachharvest_csv_metadata
harvest_metadata
doesn't allow for daylight savings adjustments but probably OK for Queensland, or,
which could be pulled out into a talend code routine so you could just use it in a tMap. Data_Harvest
|
Current issues with the Queensland Wave datasetWithout taking into consideration which way is the best, here is a list of various issues
issues in letting Talend do everythingTalend harvesters get quickly overly complicated when dealing with external web-services. This is unfortunately/fortunately not the case. I'm of the opinion that, for ALL external web-services we retrieve data from and All our system and tools are based around physical files:
As PO's, we don't have other tools or even credentials to deal with cleaning any data Using talend to do everything means POs actually can't do anything with the data. It Finally, it is also extremely harder to review a talend harvester than a python script recommended design for all external web-services data retrieveharvest the data from an external web-service -> PYTHONPython has all the toolboxes(pandas, numpy...) possible to quickly write some code to download and read data
in various formats:
Many web-services would fail when they are triggered many times too quickly. This is easily cleaning the dataset -> PYTHONThe data we collect from external contributors is never perfect. It is full of
Writing any logic in java to handle those various cases as stated above becomes complicated and extremely time consuming for us PO's. creating physical files -> PYTHONWe are currently in the process of writing a common NetCDF generator from Json files. The process Harvesting the data to our database -> Pipeline v2 -> TalendAnother benefit of using the pipeline is that we can use the IOOS checker for find more issues |
The solution proposed by Laurent has been implemented. |
Harvest of QLD delayed mode wave data using the CKAN data API:
Access the QLD government database using CKAN API: allow to search and download datasets and (some) metadata by querying data in SQL, Python or JSONP.
The JSONP solution has been chosen to be implemented in a TALEND harvester.
In the catalog, collections can be identified either by their names or their 'package_id’. A package consists of all data and some metadata from one site.
The data is subdivided into resources usually corresponding to one year of data. Resources are identified by their resource_id.
I have listed target datasets in a 'metadata' csv file read by the harvester (QLD_buoys_metadata.csv). Informations in the file are same as for the near-real time metadata csv file with the addition of package_name, longitude,latitude,package_id
Ex:
Cairns, 55028, Department of Environment and Science, QLD, directional waverider buoy, 30minutes, coastal-data-system-waves-cairns, 145.7, -16.73, a34ae7af-1561-443c-ad58-c19766d8354c
In order to reduce the level of maintenance of the data collection, we need to design the harvester so that:
The metadata provided when querying a dataset provides information we can use to design the update.
For example, a query to the API like the following will return these parameter ( amongst a long list of other):
'https://data.qld.gov.au/api/3/action/package_search?q=coastal-data-system-waves-cairns'
id: faff952f-b91d-4ffe-811f-998e04f9e576
name: "Wave data - 2018"
description: "2018 wave data from the Cairns site (January 1 - March 31)"
last_modified: "2018-04-18T01:04:45.521941"
revision_id: 14aca928-abec-4909-a2b7-aaad1f97c4ab
Information is also provided for resources, for example :
‘https://data.qld.gov.au/api/3/action/resource_show?id=9385fa18-eaf3-41cb-bf80-5fc2822fd786
package_id: "b2282253-e761-4d75-89ff-8f77cf43d501"
last_modified: "2018-04-18T01:06:54.791621",
name: "Wave data - 2018",
revision_id: "cf438bea-35b0-4fcc-8cf7-1dc62a8801fe",
Updates:
In the example above we can see that the 2018 dataset comprises data up to March 31 suggesting that datasets are updated regularly.
What kind of approach for dataset maintenance/update do we want to adopt regarding :
Data from previous years (historical data) could be considered static and not be re-harvested. Current year data could be regularly re-harvested in full, or re-harvested only if it's been updated. In this latter case which parameter is the best diagnostic to check whether a dataset has been updated or not?
Check parameter "last_modified", check for a change in "revision_id" (assuming it changes at each update- a question I should ask actually)
Following solution has been implemented so far:
Issues with datasets:
The harvester is not finalized. The outstanding issues are:
The text was updated successfully, but these errors were encountered: