Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for extensive unit testing of ESGF data #222

Open
agstephens opened this issue Feb 15, 2023 · 9 comments
Open

Plan for extensive unit testing of ESGF data #222

agstephens opened this issue Feb 15, 2023 · 9 comments
Assignees

Comments

@agstephens
Copy link
Contributor

agstephens commented Feb 15, 2023

See updated (improved) text below...

@agstephens agstephens converted this from a draft issue Feb 15, 2023
@agstephens
Copy link
Contributor Author

@alaniwi @cehbrecht @huard: here are my thoughts about building a more automated system for unit test building/running for ESGF datasets. Do you have any thoughts about how we can best do it?

@huard
Copy link
Collaborator

huard commented Feb 15, 2023

Not sure I understand the need to template the test builder, but I'm probably missing something.

A few ideas in no particular order...

One heuristic we can use here to reduce the test volume is to assume that all files are structured identically under a given directory structure. I've used this to "walk" through the catalog, and pick only one dataset per "level". This will make sure that every model is tested, without going through every variable, member and time step for a given model. This single dataset can be randomized to increase coverage over time.

Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.

Use pytest.mark.parameterize to apply tests to a programmatically defined list of datasets.

@agstephens
Copy link
Contributor Author

agstephens commented Feb 17, 2023

Hi @huard: thanks for your response. It's a very good point using parameterize is a better approach than building millions of tests from a template. I think that I was confusing "needing to test lots of datasets" with "needing lots of tests" - the reason is that we might find corner-cases where we want specific tests. However, the actual code (i.e. not the tests) should be where we fix how those corner-cases are handled - and then the tests themselves just run. I will adjust the suggested plan below, simplified to:

Plan for extensive unit testing of ESGF data in "roocs" - with subset

Need tests per:

  • project
  • node/site
  • that can just run as unit tests if flagged (for each relevant site)

Need tests that cover the functionality exposed, and deal with corner cases, e.g.:

  • get dimensions
  • assign a small bounding box inside dims
    • Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.
  • assign subsets in other dimensions (time/level)
  • run subset
  • check valid array with min and max being different
  • check each dimension is in range of the subset specified
  • check output array is not all missing_values
  • any other assertions that are useful

Use pytest.mark.parametrize to handle lists/dictionaries of inputs that cover the myriad datasets we want to test:

  • the data structure that we use for the test inputs can grow and grow
  • it will run differently at different sites based on what data they have
  • make sure that every model is tested without going through every variable, member and time step for a given model.

@agstephens
Copy link
Contributor Author

agstephens commented Apr 24, 2023

Some thoughts about how we tackle this problem:

  • Build a single test function called test_subset_in_data_pools in the module tests/test_data_pools.py
    • Define a simple variable, data_pool_tests_db which is a list with one dataset ID in it.
    • Pick a CMIP6 dataset at CEDA and make it the only record in data_pool_tests_db (a list - to start with)
    • Use @pytest.mark.parametrize("record", data_pool_tests_db)
    • And: def test_subset_in_data_pools(record):
    • get dimensions
    • assign a small bounding box inside dims
    • Define a few test bounding box for each domain (CORDEX, global), making sure to cover corner cases (the poles, longitude 0 and 180). The DRS should be sufficient to infer what the domain is for each dataset.
    • assign subsets in other dimensions (time/level)
    • run subset - based on these example tests (invoking the Python library interface to daops...subset()):
    • check valid array with min and max being different
    • check each dimension is in range of the subset specified
    • check output array is not all missing_values
    • any other assertions that are useful

Later, we'll work out the following:

  • Only deal with the database element when the test is well-formed
  • Some thoughts about the db:
    • we can split it into multiple DBs / files (per site) - because there is no interaction
    • we could use sqlite for each DB, then combine them before each git commit into a CSV file for visibility.
    • should it be an actual database, or a pickle, or CSV file?
    • Fields it might contain:
      • dataset_id(s)
      • site - e.g. ceda, dkrz etc.
      • domain - e.g. actual domain of dataset in time and space
      • last_ran_subset_parameters - i.e. the time and space constraints applied in the last test
      • result - of last test (success or fail)
      • code_version(s) - of key libraries
      • last_updated - when was it last run?
    • Might want to be able to say:
      • only run if not run before
      • only run if last_updated older than 2 years
      • re-run everything

@agstephens
Copy link
Contributor Author

Discussing with @cehbrecht, how we might decide which dataset IDs to send into this test...

We might assume...:

  • we want to test a large coverage of the overall project simulations (which can be considered as a sparse hypercube of facet key/values)
  • e.g. we want to test some data from each model, for each frequency, for each variable
  • we should always test the latest version
  • there are some facets that we might be able to ignore when sampling to get a representative coverage, e.g.:
    • If 3 institutions ran the EC-EARTH model: we can (hopefully) assume that we only need that model from one institution - so we do not need to sample across institutions.
    • Most variables within a given ensemble, should have a common structure - so we don't need to test them all

The process could be:

  1. Get a list of all datasets - maybe cached as a .csv.gz file (or other compression) alongside the tests
  2. Work out a subset of that list based on sampling across each combination of facets
  3. Store the list of samples ready for testing

@agstephens
Copy link
Contributor Author

@alaniwi here is the image that I shared today:

image

@agstephens
Copy link
Contributor Author

agstephens commented Sep 18, 2023

Provide multi-site support as follows:

  • Write the site name (e.g. "ceda" or "dkrz") into a version of the "csv.gz" file.
  • Include a consolidate command/script to merge all results "csv.gz" files from multiple sites into a single file.

@alaniwi
Copy link
Collaborator

alaniwi commented Sep 22, 2023

Code for the multi-site support is implemented, and an command-line script merge-test-logs is added -- in addition to the data-pools-checks command-line script that generates the logs in the first place.

https://github.com/roocs/daops/blob/52e32b1697f607eb93828dc509cde6ddd7ab6bad/setup.py#L83-L84

Currently this is in the test_data_pools_new branch. @cehbrecht I'll generate a new PR. With the exception of the above two added lines in setup.py (and a couple of gitignore lines), this only adds new files under a new subdirectory, so should hopefully be an easy merge.

@alaniwi
Copy link
Collaborator

alaniwi commented Sep 22, 2023

@cehbrecht PR is at roocs/daops#108 but one of the tests is failing. I'll take a look next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants