Skip to content

Scenario File Checks

Lucie Contamin edited this page Jul 9, 2024 · 14 revisions

File Check List

The complete list of checks is available on the Scenario Modeling Hub website - Validation Documentation

File Checks Running Locally

Starting with Round 13, each submission will be validated using the validate_submision() function from the SMHvalidation R package.

The package is currently available on GitHub. To install it please follow the following steps:

install.packages("remotes")
remotes::install_github("midas-network/SMHvalidation", 
                        build_vignettes = TRUE) 

This package can also be manually installed by directly cloning/forking/downloading the package from GitHub.

To load the package, execute the following command:

library(SMHvalidation)

The package contains a validate_submission() function allowing the user to check their SMH submissions locally.

Prerequisite

To test a submission file, the function requires multiple parameters:

  • path: path to the submissions file (or folder, for partitioned data) to test. A vector of parquet files can also be inputted, in this case, the validation will be run on the aggregation of all the parquet files together, and each file individually should match the expected SMH standard. If partition is not set to NULL, the path to the folder containing only the partitioned data should be input.

  • js_def: path to JSON file containing round definitions: names of columns, target names, ... following the tasks.json Hubverse format

    • the information in the JSON file can be separated in multiple groups for each round:
      • The "task_ids" object defines both labels and contents for each column in submission files. Any unique combination of the values define a single modeling task. For example, for SMH it can be the columns: "scenario_id", "location", "origin_date", "horizon", "target", "age_group", "race_ethnicity".
      • The "output_type" object defines accepted representations for each task. For example, for SMH it concerns the columns: "output_type", "output_type_id", "run_grouping", "stochastic_run" and "value".

Some additional optional parameters are available:

  • lst_gs: named list of the data frame containing the observed data. We highly recommend using the output of the SMHvalidation::pull_gs_data() function as input. This function will generate the output in the expected format with the expected data. For more information, please see ?pull_gs_data(). This parameter can be set to NULL (default) to not compare between the value and the observed data.
  • pop_path: path to a table containing the population size of each geographical entity by FIPS (in a column "location") and by location name. For example, path to the locations file in the COVID19 Scenario Modeling Hub GitHub repository. This parameter can be set to NULL (default) to not run a comparison between the value and the population size data.
  • merge_sample_col: Boolean to indicate if in the submission file(s), the output type "sample" has the "output_type_id" column set to NA and the information is instead contained in 2 columns: "run_grouping" and ⁠"stochastic_run⁠". By default, FALSE
  • partition: character vector indicating if the submission file is partitioned and if so, which field (or column) names correspond to the path segments. By default, NULL (no partition). See arrow R package for more information, and especially arrow::write_dataset(), arrow::open_dataset() functions. Warning: If the submission files is in a "partitioned" format, the path parameter should be to a directory to a folder containing ONLY the "partitioned" files. If any other file is present in the directory, it will be included in the validation.
  • n_decimal: integer, number of decimal points accepted in the column "value" (only for "sample" output type), if NULL (default) no limit expected.
  • round_id: character string, round identifier. This identifier is used to extract the associated round information from the js_def parameter. If NULL (default), extracted from path.

Run the validation

To test the model output projections from Round 18, please use at least the version 0.1.0 of the validation package:

Prerequisite

It is important to set the working directory to the folder containing all the data required to run the validation, here, for example, we will take a path to covid19-scenario-modeling-hub/ and We will use the projection from the "MyTeam-MyModel" group as example.

As stated previously, if the data are partition, the path for validation should be set to a folder containing only the partitioned data should be input.

So first, it is required to copy the "data-processed/MyTeam-MyModel/2024-04-28/" to a "validation/MyTeam-MyModel/2024-04-28/" folder.

Then, all the parameters can be set:

setwd("~/covid19-scenario-modeling-hub")
# Path to the folder containing the projection file
projection_path <- "validation/MyTeam-MyModel/"
# Path to JSON file containing round definitions
js_def <- "hub-config/tasks.json"
# path to a table containing the population size of each geographical entity by FIPS
pop_path <- "data-locations/locations.csv"

Validation

Following the documentation associated with the round, available in the data-processed/README.md, some optional parameters should be set:

  • partition = "origin_date", "target"
  • n_decimal = 1
  • merge_sample_col = TRUE as the sample pairing information is expected to be available into two columns (run_grouping and stochastic_run)
validate_submission(projection_path, js_def, pop_path = pop_path, 
                    partition = c("origin_date", "target"), n_decimal = 1, 
                    merge_sample_col = TRUE)

Output

The function can generate 3 different outputs:

  • message when the submission does not contain any issues
  • warning + report message when the submission contains one or multiple minor issues that do not prevent the submission from being included.
  • error + report message when the submission contains one or multiple minor and/or major issues that prevent the submission from being included. In this case the submission file will have to be updated to be included in the corresponding SMH round.

The example run should return a message:

Run validation on files: 2024-04-28/cum death/2024-04-28-MyTeam-MyModel0.gz.parquet, 2024-04-28/cum hosp/2024-04-28-MyTeam-MyModel0.gz.parquet, 2024-04-28/inc death/2024-04-28-MyTeam-MyModel0.gz.parquet, 2024-04-28/inc hosp/2024-04-28-MyTeam-MyModel0.gz.parquet
End of validation check: all the validation checks were successful

Please verify before submitting that the submission file(s) are in the expected data-processed/ team-model folder and no additional folder are in the repository, to avoid issue during the automatic validation.

Previous round

If you want to test a previous round's submission:

As the submission file format has been updated in 2024, please use past version of the package to validation previous round. As the validation requirement and parameter behavior as evolve with time, please refer to the past version of the documentation included in the package to run the validation function.

As a warning:

  • The vast majority of submissions done in 2021 will return an error on the model_projection_date column as the rules and tests on this column have been made and put in place only since 2022 (round 12).
  • Also the observed data might have been corrected after the previous round's submissions and so, previous submissions might have an error saying: "Error: Some value(s) are inferior than the last observed cumulative death count. Please check location(s): 39 " (location 39 for example)

File Visualization Running Locally (only for quantiles values)

The SMHvalidation R package contains plotting functionality to output a plot of each location and target, with all scenarios and observed data incorporated. The visualization function accept only quantile output type.

To run this visualization locally:

lst_gs <- NULL # set to NULL to not compare to observed data
generate_validation_plots(projection_path, NULL, save_path = getwd(), partition = c("origin_date", "target"))

The function will generate a PDF file with the visualizations.

If projections files are submitted with quantiles output type, the visualization function will be called in the automatic validation and the output PDF file will be available as an "artifact" of the GitHub Action. Please click on 'details' on the right of the 'Validate submission' GitHub Action checks. The PDF is available in a ZIP file as an artifact of the GH Actions. For more information, please see here