Skip to content

Commit

Permalink
Feature #2690 tmp_files Contributor's Guide (#2693)
Browse files Browse the repository at this point in the history
  • Loading branch information
JohnHalleyGotway authored Sep 27, 2023
1 parent f72d85d commit a45f5c8
Show file tree
Hide file tree
Showing 17 changed files with 199 additions and 18 deletions.
1 change: 0 additions & 1 deletion data/config/PlotPointObsConfig_default
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ point_data = [

////////////////////////////////////////////////////////////////////////////////

tmp_dir = "/tmp";
version = "V12.0.0";

////////////////////////////////////////////////////////////////////////////////
11 changes: 11 additions & 0 deletions docs/Contributors_Guide/dev_details/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
*******************
Development Details
*******************

This chapter provides specific details about select topics within the
MET code base. The list of topics is certainly not comprehensive.

.. toctree::
:titlesonly:

tmp_file_use
169 changes: 169 additions & 0 deletions docs/Contributors_Guide/dev_details/tmp_file_use.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
.. _tmp_file_use:

Use of Temporary Files
======================

The MET application and library code uses temporary files in several
places. Each specific use of temporary files is described below. The
directory in which temporary files are stored is configurable as,
described in :numref:`User's Guide Section %s <config_tmp_dir>`.

Whenever a MET application is run, the operating system assigns it a
process identification number (PID). All temporary files created by
MET include the PID in the file name so that multiple instances can
run concurrently without conflict. In addition, when creating a
temporary file name, the :code:`make_temp_file_name(...)` utility
function appends :code:`_0` to the PID, checks to see if the
corresponding file name is already in use, and if so, tries
:code:`_1`, :code:`_2` and so on, until an available file name is
found.

Note that creating, reading, and deleting temporary files from the
local filesystem is much more efficient than performing these
operations across a network filesystem. Using the default
:code:`/tmp` directory is recommended, unless prohibited by policies
on your system.

In general, MET applications delete any temporary files they create
when they are no longer needed. However, if the application exits
abnormally, the temporary files may remain.

.. _tmp_files_pb2nc:

PB2NC Tool
^^^^^^^^^^

The PB2NC tool reads input binary files in the BUFR or PrepBUFR
format, extracts and/or derives observations from them, filters
those observations, and writes the result to a NetCDF output file.

PB2NC creates the following temporary files when running:

* :code:`tmp_pb2nc_blk_{PID}`, :code:`tmp_pb2nc_meta_blk_{PID}`,
:code:`tmp_pb2nc_tbl_blk_{PID}`

PB2NC assumes that each input binary file requires Fortran
blocking prior to being read by the BUFRLIB library. It applies
Fortran blocking, writes the result to this temporary file, and
uses BUFRLIB to read its contents.

* :code:`tmp_pb2nc_bufr_{PID}_tbl`: PB2NC extracts Bufr table data
that is embedded in input files and writes it to this temporary
file for later use.

.. note::
The first 3 files listed above are identical. They are all
Fortran-blocked versions of the same input file. Recommend
modifying the logic to only apply Fortran blocking once.

.. _tmp_files_point2grid:

Point2Grid Tool
^^^^^^^^^^^^^^^

The Point2Grid tool reads point observations from a variety of
inputs and summarizes them on a grid. When processing GOES input
files, a temporary NetCDF file is created to store the mapping of
input pixel locations to output grid cells unless the
MET_GEOSTATIONARY_DATA environment variable defines an existing grid
navigation file to be used.

If that temporary geostationary grid mapping file already exists, it
is used directly and not recreated. If not, it is created as needed.

Note that this temporary file is *not* deleted by the Point2Grid
tool. Once created, it is intended to be reused in future runs.

.. _tmp_files_bootstrap:

Bootstrap Confidence Intervals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Several MET tools support the computation of bootstrap confidence
intervals, as described in :numref:`User's Guide Section %s <config_boot>`
and :numref:`User's Guide Appendix D, Section %s <App_D-Confidence-Intervals>`.
When bootstrap confidence intervals are requested, up to two
temporary files are created for each CNT, CTS, MCTS, NBRCNT, or
NBRCTS line type written to the output.

* :code:`tmp_{LINE_TYPE}_i_{PID}`: When the BCA bootstrapping method
is requested, jackknife resampling is applied to the input matched
pairs. Statistics are computed for each jackknife resample and
written to this temporary file.

* :code:`tmp_{LINE_TYPE}_r_{PID}`: For each bootstrap replicate
computed from the input matched pairs, statistics are computed
and written to this temporary file.

Where {LINE_TYPE} is :code:`cnt`, :code:`cts`, :code:`mcts`,
:code:`nbrcnt`, or :code:`nbrcts`.

.. note::
Consider whether or not it's realistic to hold the resampled
statistics in memory rather than writing them to temporary files.
If so, that would reduce the I/O.

.. _tmp_files_stat_analysis:

Stat-Analysis Tool
^^^^^^^^^^^^^^^^^^

The Stat-Analysis tool reads ASCII output created by the MET
statistics tools. A single job can be specified on the command line
or one or more jobs can be specified in an optional configuration
file. When a configuration file is provided, any filtering options
specified are applied to all entries in the :code:`jobs` array.

Rather than reading all of the input data for each job, Stat-Analysis
reads all the input data once, applies any common filtering options,
and writes the result to a temporary file.

* :code:`tmp_stat_analysis_{PID}`: Stat-Analysis reads all of the
input data, applies common filtering logic, and writes the result
to this temporary file. All of the specified jobs read data from
this temporary file, apply any additional job-specific filtering
criteria, and perform the requested operation.

.. note::
Consider revising the logic to only use a temp file when actually
necessary, when multiple jobs are specified along with non-empty
common filtering logic.

.. _tmp_files_python_embedding:

Python Embedding
^^^^^^^^^^^^^^^^

As described in
:numref:`User's Guide Appendix F, Section %s <appendixF>`, when the
:code:`MET_PYTHON_EXE` environment variable is set, the MET tools run
any Python embedding commands using the specified Python executable.

* :code:`tmp_mpr_{PID}`: When Python embedding of matched pair data
is performed, a Python wrapper is run to execute the user-specified
Python script and write the result to this temporary ASCII file.

* :code:`tmp_met_nc_{PID}`: When Python embedding of gridded data or
point observations is performed, a Python wrapper is run to
execute the user-specified Python script and write the result to
this temporary NetCDF file.

The compile-time Python instance is run to read data from these
temporary files.

.. _tmp_files_tc_diag:

TC-Diag Tool
^^^^^^^^^^^^

The TC-Diag tool requires the use of Python embedding. It processes
one or more ATCF tracks and computes model diagnostics. For each
track point, it converts gridded model data to cylindrical
coordinates centered at that point, writes it to a temporary NetCDF
file, and passes it to Python scripts to compute model diagnostics.

* :code:`tmp_met_nc_{PID}`: Cylindrical coordinate model data is
written to this temporary NetCDF file for each track point
and passed to Python scripts to compute diagnostics. If requested,
these temporary NetCDF files for each track point are combined into
a single NetCDF cylindrical coordinates output file for each track.
2 changes: 2 additions & 0 deletions docs/Contributors_Guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@ Welcome to the Model Evaluation Tools (MET) Contributor's Guide.
.. toctree::
:titlesonly:
:numbered:
:maxdepth: 1

coding_standards
dev_env
dev_details/index
github_workflow
testing
continuous_integration
Expand Down
2 changes: 1 addition & 1 deletion docs/Users_Guide/appendixF.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ If a user attempts to invoke Python embedding with a version of MET that was not
Controlling Which Python MET Uses When Running
==============================================

When MET is compiled with Python embedding support, MET uses the Python executable in that Python installation by default when Python embedding is used. However, for users of highly configurable Python environments, the Python instance set at compilation time may not be sufficient. Users may want to use an alternate Python installation if they need additional packages not available in the Python installation used when compiling MET. In MET versions 9.0+, users have the ability to use a different Python executable when running MET than the version used when compiling MET by setting the environment variable **MET_PYTHON_EXE**.
When MET is compiled with Python embedding support, MET uses the Python executable in that Python installation by default when Python embedding is used. However, for users of highly configurable Python environments, the Python instance set at compilation time may not be sufficient. Users may want to use an alternate Python installation if they need additional packages not available in the Python installation used when compiling MET. In MET versions 9.0+, users have the ability to use a different Python executable when running MET than the version used when compiling MET by setting the environment variable **MET_PYTHON_EXE**. Whenever **MET_PYTHON_EXE** is set, MET writes a temporary file, as described in :numref:`Contributor's Guide Section %s <tmp_files_python_embedding>`.

If a user's Python script requires packages that are not available in the Python installation used when compiling the MET software, they will encounter a runtime error when using MET. In this instance, the user will need to change the Python MET is using to a different installation with the required packages for their script. It is the responsibility of the user to manage this Python installation, and one popular approach is to use a custom Anaconda (Conda) Python environment. Once the Python installation meeting the user's requirements is available, the user can force MET to use it by setting the **MET_PYTHON_EXE** environment variable to the full path of the Python executable in that installation. For example:

Expand Down
7 changes: 7 additions & 0 deletions docs/Users_Guide/config_options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -533,6 +533,8 @@ override the default value set in ConfigConstants.
output_precision = 5;
.. _config_tmp_dir:

tmp_dir
^^^^^^^

Expand All @@ -546,6 +548,9 @@ Some tools override the temporary directory by the command line argument
tmp_dir = "/tmp";
A description of the use of temporary files in MET can be found in
:numref:`Contributor's Guide Section %s <tmp_file_use>`.

message_type_group_map
^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1684,6 +1689,8 @@ interval.
ci_alpha = [ 0.05, 0.10 ];
.. _config_boot:

boot
^^^^

Expand Down
1 change: 0 additions & 1 deletion docs/Users_Guide/plotting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ ________________________

.. code-block:: none
tmp_dir = "/tmp";
version = "VN.N";
The configuration options listed above are common to multiple MET tools and are described in :numref:`config_options`.
Expand Down
2 changes: 1 addition & 1 deletion docs/Users_Guide/point-stat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ For continuous fields (e.g., temperature), it is possible to estimate confidence

For the measures relating the two fields (i.e., mean error, correlation and standard deviation of the errors), confidence intervals are based on either the joint distributions of the two fields (e.g., with correlation) or on a function of the two fields. For the correlation, the underlying assumption is that the two fields follow a bivariate normal distribution. In the case of the mean error and the standard deviation of the mean error, the assumption is that the errors are normally distributed, which for continuous variables, is usually a reasonable assumption, even for the standard deviation of the errors.

Bootstrap confidence intervals for any verification statistic are available in MET. Bootstrapping is a nonparametric statistical method for estimating parameters and uncertainty information. The idea is to obtain a sample of the verification statistic(s) of interest (e.g., bias, ETS, etc.) so that inferences can be made from this sample. The assumption is that the original sample of matched forecast-observation pairs is representative of the population. Several replicated samples are taken with replacement from this set of forecast-observation pairs of variables (e.g., precipitation, temperature, etc.), and the statistic(s) are calculated for each replicate. That is, given a set of n forecast-observation pairs, we draw values at random from these pairs, allowing the same pair to be drawn more than once, and the statistic(s) is (are) calculated for each replicated sample. This yields a sample of the statistic(s) based solely on the data without making any assumptions about the underlying distribution of the sample. It should be noted, however, that if the observed sample of matched pairs is dependent, then this dependence should be taken into account somehow. Currently, the confidence interval methods in MET do not take into account dependence, but future releases will support a robust method allowing for dependence in the original sample. More detailed information about the bootstrap algorithm is found in the :numref:`Appendix D, Section %s. <appendixD>`
Bootstrap confidence intervals for any verification statistic are available in MET. Bootstrapping is a nonparametric statistical method for estimating parameters and uncertainty information. The idea is to obtain a sample of the verification statistic(s) of interest (e.g., bias, ETS, etc.) so that inferences can be made from this sample. The assumption is that the original sample of matched forecast-observation pairs is representative of the population. Several replicated samples are taken with replacement from this set of forecast-observation pairs of variables (e.g., precipitation, temperature, etc.), and the statistic(s) are calculated for each replicate. That is, given a set of n forecast-observation pairs, we draw values at random from these pairs, allowing the same pair to be drawn more than once, and the statistic(s) is (are) calculated for each replicated sample. This yields a sample of the statistic(s) based solely on the data without making any assumptions about the underlying distribution of the sample. It should be noted, however, that if the observed sample of matched pairs is dependent, then this dependence should be taken into account somehow. Currently, the confidence interval methods in MET do not take into account dependence, but future releases will support a robust method allowing for dependence in the original sample. More detailed information about the bootstrap algorithm is found in the :numref:`Appendix D, Section %s <appendixD>`. Note that MET writes temporary files whenever bootstrap confidence intervals are computed, as described in :numref:`Contributor's Guide Section %s <tmp_files_bootstrap>`.

Confidence intervals can be calculated from the sample of verification statistics obtained through the bootstrap algorithm. The most intuitive method is to simply take the appropriate quantiles of the sample of statistic(s). For example, if one wants a 95% CI, then one would take the 2.5 and 97.5 percentiles of the resulting sample. This method is called the percentile method, and has some nice properties. However, if the original sample is biased and/or has non-constant variance, then it is well known that this interval is too optimistic. The most robust, accurate, and well-behaved way to obtain accurate CIs from bootstrapping is to use the bias corrected and adjusted percentile method (or BCa). If there is no bias, and the variance is constant, then this method will yield the usual percentile interval. The only drawback to the approach is that it is computationally intensive. Therefore, both the percentile and BCa methods are available in MET, with the considerably more efficient percentile method being the default.

Expand Down
3 changes: 2 additions & 1 deletion docs/Users_Guide/reformat_point.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ ____________________
version = "VN.N";
The configuration options listed above are common to many MET tools and are described in :numref:`config_options`.
The use of temporary files in PB2NC is described in :numref:`Contributor's Guide Section %s <tmp_files_pb2nc>`.

_____________________

Expand Down Expand Up @@ -1082,7 +1083,7 @@ Optional arguments for point2grid

Only 4 interpolation methods are applied to the field variables; MIN/MAX/MEDIAN/UW_MEAN. The GAUSSIAN method is applied to the probability variable only. Unlike regrad_data_plane, MAX method is applied to the file variable and Gaussian method to the probability variable with the MAXGAUSS method. If the probability variable is not requested, MAXGAUSS method is the same as MAX method.

For the GOES-16 and GOES-17 data, the computing lat/long is time consuming. So the computed coordinate (lat/long) is saved into the NetCDF file to the environment variable MET_TMP_DIR or */tmp* if MET_TMP_DIR is not defined. The computing lat/long step can be skipped if the coordinate file is given through the environment variable MET_GEOSTATIONARY_DATA. The grid mapping to the target grid is saved to MET_TMP_DIR to save the execution time. Once this file is created, the MET_GEOSTATIONARY_DATA is ignored. The grid mapping file should be deleted manually in order to apply a new MET_GEOSTATIONARY_DATA environment variable or to re-generate the grid mapping file. An example of call point2grid to process GOES-16 AOD data is shown below:
For the GOES-16 and GOES-17 data, the computing lat/long is time consuming. The computed coordinate (lat/long) is saved to a temporary NetCDF file, as described in :numref:`Contributor's Guide Section %s <tmp_files_point2grid>`. The computing lat/long step can be skipped if the coordinate file is given through the environment variable MET_GEOSTATIONARY_DATA. The grid mapping to the target grid is saved to MET_TMP_DIR to save the execution time. Once this file is created, the MET_GEOSTATIONARY_DATA is ignored. The grid mapping file should be deleted manually in order to apply a new MET_GEOSTATIONARY_DATA environment variable or to re-generate the grid mapping file. An example of call point2grid to process GOES-16 AOD data is shown below:

.. code-block:: none
Expand Down
2 changes: 1 addition & 1 deletion docs/Users_Guide/stat-analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ The configuration file for the Stat-Analysis tool is optional. Users may find it

Most of the user-specified parameters listed in the Stat-Analysis configuration file are used to filter the ASCII statistical output from the MET statistics tools down to a desired subset of lines over which statistics are to be computed. Only output that meets all of the parameters specified in the Stat-Analysis configuration file will be retained.

The Stat-Analysis tool actually performs a two step process when reading input data. First, it stores the filtering information defined top section of the configuration file. It applies that filtering criteria when reading the input STAT data and writes the filtered data out to a temporary file. Second, each job defined in the **jobs** entry reads data from that temporary file and performs the task defined for the job. After all jobs have run, the Stat-Analysis tool deletes the temporary file.
The Stat-Analysis tool actually performs a two step process when reading input data. First, it stores the filtering information defined top section of the configuration file. It applies that filtering criteria when reading the input STAT data and writes the filtered data out to a temporary file, as described in :numref:`Contributor's Guide Section %s <tmp_files_stat_analysis>`. Second, each job defined in the **jobs** entry reads data from that temporary file and performs the task defined for the job. After all jobs have run, the Stat-Analysis tool deletes the temporary file.

This two step process enables the Stat-Analysis tool to run more efficiently when many jobs are defined in the configuration file. If only operating on a small subset of the input data, the common filtering criteria can be applied once rather than re-applying it for each job. In general, filtering criteria common to all tasks defined in the **jobs** entry should be moved to the top section of the configuration file.

Expand Down
Loading

0 comments on commit a45f5c8

Please sign in to comment.