This repository holds per-facility Python pipeline code, for example handlers, destination path functions and any other code specific to a given facility and/or pipeline.
The code extends the aodncore project which is treated as an upstream library, providing primarily the HandlerBase parent class, which provides all of the common handler capabilities and essentially implements the entire "generic handler".
Facility/pipeline specific handlers are then created as a subclass of HandlerBase, and act to configure, extend and/or modify the behaviour as appropriate for the specific pipeline.
For example, a child class may do one or more of the following:
- define how the destination path is determined for the file(s) being handled (i.e. a dest_path function)
- restrict which file extensions/types are allowed to be handled
- determine which compliance checks are performed against files (if applicable)
- in case of a "multiple file" handler (e.g. a ZIP or manifest file), determine which files are included/excluded, e.g. process all NC files in a ZIP but not TXT files, or define on an individual file basis which ones are harvested and/or uploaded and/or archived
- determine who is notified in case of handler error (or success)
It is highly recommended to use the PyCharm IDE for developing on this code base, as it provides many useful features out of the box, such as good unit test integration, real time code quality checking, automatic completion, and the usual basic IDE debugging capabilities such as setting breakpoints and syntax highlighting.
This project is licensed under the terms of the GNU GPLv3 license.
-
Ensure the aodn/chef repository is checked out (directory will be referred to as ${CHEF_REPO}), and the usual pre-requisites for running the PO box are met
-
Ensure that your GitHub keys are in place so that you have write access to the python-aodndata repository on GitHub
-
Provision the PO box using the bin/pipeline-box.sh script, so that the required repositories are checked out and the PO box is ready for use.
-
Install the boto3 and virtualenv packages into your system Python environment:
$ sudo pip install boto3 virtualenv
Developing in a Python virtual environment is the best choice to isolate the project from external Python dependencies (e.g. those installed with the operating system).
-
In a terminal, browse to the ${CHEF_REPO}/src/python-aodndata:
$ cd ${CHEF_REPO}/src/python-aodndata
-
Execute the setup_virtualenv.sh script:
$ scripts/setup_virtualenv.sh Downloading dependencies... Creating virtual environment... Installing dependencies into virtual environment... Virtual environment successfully created at: python-aodndata-virtualenv To use: * Configure PyCharm project interpreter as: /home/me/github/chef/src/python-aodndata/python-aodndata-virtualenv/bin/python * Activate in shell environment: $ source /home/me/github/chef/src/python-aodndata/python-aodndata-virtualenv/bin/activate
Note: by default, the virtual environment will source AODN dependencies (e.g. aodntools, aodncore) from the production repository. If you want to test a package which is at a different promotion stage, you can override this by setting the
STAGE
environment to eitherrc
orbuild
:```bash $ export STAGE=rc $ scripts/setup_virtualenv.sh ```
-
Make note of the path to the Python interpreter output by the script (this is needed for the next step)
Alternatively, if you have some issues with this step, you can try using anaconda/miniconda and the following conda env file
- Open PyCharm
- Click 'Open' and browse to the python-aodndata repository checked out by the PO box script (i.e. ${CHEF_REPO}/src/python-aodndata)
- Click 'File' -> 'Settings' and browse to 'Project: python-aodndata'-> 'Project Interpreter'
- Click the 'cog' icon in the top right of the window, select 'Add Local...', 'Existing Environment', browse to or paste the path to the Python interpreter from the virtual environment setup step, and press 'OK' and 'OK' to save the configuration
- Confirm that the unit tests for the project run correctly by right-clicking on the test_aodndata directory and 'Run Unittests in test_aodndata'
The best way to get started writing a handler is to create the handler class itself along with an associated unit test class in order to easily run the handler with arbitrary inputs. This makes it possible to make small changes and immediately run the handler in the IDE to observe the results, long before trying to deploy the code to a running pipeline.
Each facility is given it's own module namespace, aodndata.facility_name in which to define objects relating specifically to that facility (for example, the handler classes themselves, destination path functions or any other miscellaneous supporting code).
For example, the moorings facility occupies the aodndata.moorings namespace, and defines the following objects:
# a handler class; a sub-class of HandlerBase extended to specifically support moorings input files
from aodndata.moorings.handlers import MooringsHandler
# helper class used in determining the destination path
from aodndata.moorings.classifiers import MooringsFileClassifier
# a 'dest_path' function, which given the path to a file, returns the _destination path_, i.e. the path to which the file will be published on S3
from aodndata.moorings.classifiers import dest_path_anmn_nrs_realtime
dest_path = dest_path_anmn_nrs_realtime('test_aodndata/common/IMOS_ANMN-NRS_MT_20161109T231108Z_NRSMAI_FV00_NRSMAI-Surface-21-2016-11-MET-realtime.nc')
print(dest_path)
IMOS/ANMN/NRS/REAL_TIME/NRSMAI/Meteorology/IMOS_ANMN-NRS_MT_20161109T231108Z_NRSMAI_FV00_NRSMAI-Surface-21-2016-11-MET-realtime.nc
-
If there isn't already a relevant facility subpackage under aodndata, create one (in accordance with the preferred naming conventions). A package is a directory containing an (often empty) __init__.py file, and allows clean arrangement of code into a namespace. The structure is arbitrary, however if in doubt, you may wish to create a handlers.py module in the directory to contain the handler code as a starting point:
${CHEF_REPO}/src/python-aodndata/aodndata$ find myfacility/ myfacility/ myfacility/__init__.py # empty file myfacility/handlers.py
-
Create a handler class in handlers.py:
import os from aodncore.pipeline import HandlerBase class MyFacilityHandler(HandlerBase): @staticmethod def dest_path(filepath): basename = os.path.basename(filepath) return "IMOS/parent/path/that/is/always/the/same/{basename}".format(basename=basename)
-
Add the handler to the ENTRY_POINTS['pipeline.handlers'] list in setup.py. This is not required for unit testing, but is required to "advertise" the class as an available handler under the pipeline.handlers entry point group once the aodndata package has been deployed:
ENTRY_POINTS = { 'pipeline.handlers': [ ... 'MyFacilityHandler = aodndata.myfacility.handlers:MyFacilityHandler', ... ], ... }
-
Create a unit test subpackage under the test_aodndata directory. Prefix the module you create with 'test_'. You may wish to add some example data files for use in testing, for example, an example of a good file which should always pass the tests and an example of a bad file which will allow testing a failure scenario, e.g.
${CHEF_REPO}/src/python-aodndata/test_aodndata$ find myfacility/ myfacility/ myfacility/__init__.py # empty file myfacility/test_handlers.py myfacility/myfacility_good.nc # file expected to succeed myfacility/myfacility_bad.nc # file expected to fail, e.g. non-compliant, invalid format etc.
-
Create a handler test case in test_handlers.py:
import os from aodncore.pipeline.exceptions import ComplianceCheckFailedError from aodncore.testlib import HandlerTestCase from aodndata.myfacility.handlers import MyFacilityHandler TEST_ROOT = os.path.join(os.path.dirname(__file__)) GOOD_NC = os.path.join(TEST_ROOT, 'myfacility_good.nc') NOT_A_NETCDF = os.path.join(TEST_ROOT, 'not_a_netcdf_file.nc') class TestMyFacilityHandler(HandlerTestCase): """It is recommended to inherit from the HandlerTestCase class (which is itself a subclass of the standard unittest.TestCase class). This provides some useful methods and properties to shortcut some common test scenarios. """ #This is a "boilerplate" method that must appear in each test case in order to correctly inherit from the HandlerTestCase class def setUp(self): # set the handler_class attribute to your handler (as imported above) self.handler_class = MyFacilityHandler super(TestMyFacilityHandler, self).setUp() def test_good_file(self): # we expect this to succeed, so if the handler experiences an error, it is considered a # "failed test" handler = self.run_handler(GOOD_NC) pass def test_good_file_with_compliance_check(self): # we also expect this to succeed, since the test file is known be CF compliant handler = self.run_handler(GOOD_NC, check_params={'checks': ['cf']}) pass def test_bad_file(self): # we expect this to fail with a 'ComplianceCheckFailedError' exception, since it's not actually # a NetCDF file since we expect this to be a failure, we use run_handler_with_exception to # invert the expected result, so that it treats a success as an undesired outcome, and therefore # a "failed test" handler = self.run_handler_with_exception(ComplianceCheckFailedError, NOT_A_NETCDF) pass
-
You can now test your handler by simply running the unit tests. There are several ways to run them in the IDE, but you can get fine-grained control over which tests are run by opening the test module, and right-clicking on the test class and clicking 'Run unittests for Unittests for test_handlers.TestMyFacilityHandler', or even the individual test methods. This enables the handler class itself to be largely configured and tested before leaving the IDE, and proceeding on to integration testing, to run the handler in a "deployed" context.
-
For further documentation relating to available handler parameters, and how a handler class works, refer to the upstream aodncore documentation. Handler parameters consist of a single positional parameter, which is always 'input_file', a 'config' object (set automatically in both unittests and when deployed), an optional Celery task parameter, set when run under a Celery task, and a series of keyword arguments to control the handler behaviour. For example, as at time of writing, the user configurable handler parameters are as follows:
:param allowed_extensions: list of allowed extensions for the input file :param archive_input_file: flag to determine whether the original input file is archived :param archive_path_function: function reference or entry point used to determine archive_path for a file :param check_params: list of parameters to passed through to the compliance checker library :param dest_path_function: function reference or entry point used to determine dest_path for a file :param exclude_regexes: list of regexes that files matching include_regexes must *not* match to be 'eligible' :param harvest_params: keyword parameters passed to the publish step to control harvest runner parameters :param harvest_type: determine which harvest type will be used (supported types in harvest module) :param include_regexes: list of regexes that files must match to be 'eligible' :param notify_params: keyword parameters passed to the notify step to control notification behaviour :param upload_path: original path of file (for information only, e.g. notifications) :param resolve_params: keyword parameters passed to the publish step to control harvest runner parameters :param kwargs: allow additional keyword arguments to allow potential for child handler to use custom arguments
It is perfectly possible to perform this setup and IDE testing without the use of the PO box and chef repo, however the key reason to do this is to shortcut the integration testing and leverage the capabilities of Vagrant/Virtualbox shared folders to source the Python libraries directly from the same directory that you are editing in the IDE.
This allows for a much more rapid turnaround in getting your "work in progress" code to actually run in a live development environment (in this case, the PO box).
Note: the following assumes that you are running the PO box from the same ${CHEF_REPO} as referred to above in the IDE setup
-
Add a watch configuration to the imos_po_watches databag in Chef. A watch configuration defines an individual pipeline, and consists of the following JSON keys:
- path: types: JSON=array, Python=list : list of incoming directory paths to be watched, and have incoming files routed to this pipeline
- handler: types JSON=string, Python=str : name of the handler class to use (the handler class is resolved by looking for this string in the pipeline.handlers entry point group and retrieving the corresponding handler object)
- params: types JSON=object, Python=dict : parameters passed directly through to the handler class __init__ method as keyword arguments
${CHEF_REPO}/private-sample/data_bags/imos_po_watches/MYFACILITY.json
{ "id": "MYFACILITY", "path": [ "myfacility" ], "handler": "MyFacilityHandler", "params": { "allowed_extensions": [ ".nc" ], "check_params": { "checks": ["cf"] } } }
-
Edit the ${CHEF_REPO}/private-sample/nodes/po.json file and add the watch in the data_services -> pipeline_2_watches array
... "data_services": { ... "pipeline_2_watches": [ ... "MYFACILITY", ... ], ...
-
Provision the PO box
cd ${CHEF_REPO} bin/po-box.sh
-
When you make a change to the aodndata code, it is simply necessary to restart the individual pipeline in order for the changes to be applied to the PO box environment:
$ sudo supervisorctl restart pipeline_worker_MYFACILITY