Discussion: scaling function design #53

chacalle · 2020-11-23T18:28:10Z

Basic Description

The scaling function should be used to scale values within a hierarchy so that the more detailed levels aggregated together equal the aggregate level. Input data dt is defined by a set of id_cols along with a column that is to be scaled over col_stem.

The column that is to be scaled can be two different types of variables col_type

Categorical variable like location or sex etc. col_type = categorical
Numeric interval variable like age or year. These are defined by the start and end of each interval. col_type = interval

Two basic types of use cases

Input data is expected to be "square" and is an exact match with the pre-defined hierarchy. Basic assertions need to be done and the function should be optimized for speed.
Input data is not "square" and may not match up exactly with the pre-defined hierarchy. More detailed assertions and standardization need to be done.
- Example 1: aggregating across locations, some years may have different sets of locations available.
- Example 2: aggregating across locations, some locations may have different age groups available so need to be collapsed to the most detailed common age groups prior to each level of aggregation.
- Example 3: aggregating across age groups, some locations may have different age groups available so need to map correctly from each detailed age group to the aggregate, and make sure each location has the entire expected age range.

Implementation Details

Assertions

Square datasets only

all combinations of all unique values of each id_cols exist.
all of the most detailed nodes in the hierarchy exist.

Non-square datasets only

each level of scaling need to check for square data and potentially collapse interval columns to the most detailed common intervals.

What is the expected behavior when...

it is not possible to scale to one aggregate given the available input data? `missing_dt_severity`

Implemented

For example when scaling to a national location, one subnational may be missing.

Default is to throw an error.
Warn or ignore, then skips impossible scaling and continues with others.
Skip the check and make scaling anyway.

when interval variables do not exactly match up in the input data? `collapse_interval_cols`

Implemented

For example when scaling to a national location, one subnational may have five year age groups and another has single year age groups.

Default is to throw error.
Option to automatically collapse to most detailed common intervals

when scaling a categorical variable, and one of the interval `id_cols` has overlapping intervals?`overlapping_dt_severity`

Implemented

For example when scaling subnational to national values, and the subnationals have a mix of five-year and single-year age groups, and some subnationals have both.

Default is to throw error.
Warn or ignore, then drops overlapping intervals and continue.
Skip the check and continue with scaling.

when scaling a categorical variable with multiple levels in the mapping but one level is missing? `collapse_missing`

For example when scaling county to state to national values and some years don't have state values. Should be able to collapse the mapping to know how the county level maps to the national level

Implemented

Default is to throw an error.
Option to automatically drop missing nodes from the mapping.

when `value_cols` have `NA` values like #49 `na_value_severity`

Implemented

Default is to throw error.
Warn or ignore, then drop missing values and continue with scaling.
Skip check for NA values and include in scaling.

Implementation steps

Clean up testing script for scaling (right now potentially too long and hard to follow).
Add square argument to determine amount of flexibility in inputs.
Add na_value_severity argument

The text was updated successfully, but these errors were encountered:

krpaulson · 2020-11-23T18:42:39Z

How about cases like this where some groups need to be scaled and some don't (i.e some groups only have highest level detail)?

dt <- data.table(
  age_start = c(0,1,2,0, 0),
  age_end = c(1,2,3,3, 5),
  year_start = c(rep(1, 4), 2),
  val = c(1,2,3,4, 5)
)

dt <- hierarchyUtils::scale(
  dt = dt,
  id_cols = c("age_start", "age_end", "year_start"),
  value_cols = "val",
  col_stem = "age",
  col_type = "interval"
)

meghanfrisch · 2020-11-24T18:51:23Z

Would that also apply to accounting for historical locations?

chacalle added the enhancement New feature or request label Nov 23, 2020

chacalle assigned meghanfrisch, krpaulson, hcomfo95 and chacalle Nov 23, 2020

chacalle mentioned this issue Dec 2, 2020

Clean up aggregation and scaling tests #52

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: scaling function design #53

Discussion: scaling function design #53

chacalle commented Nov 23, 2020 •

edited

Loading

krpaulson commented Nov 23, 2020

meghanfrisch commented Nov 24, 2020

Discussion: scaling function design #53

Discussion: scaling function design #53

Comments

chacalle commented Nov 23, 2020 • edited Loading

Basic Description

Implementation Details

Assertions

Square datasets only

Non-square datasets only

What is the expected behavior when...

it is not possible to scale to one aggregate given the available input data? missing_dt_severity

when interval variables do not exactly match up in the input data? collapse_interval_cols

when scaling a categorical variable, and one of the interval id_cols has overlapping intervals?overlapping_dt_severity

when scaling a categorical variable with multiple levels in the mapping but one level is missing? collapse_missing

when value_cols have NA values like #49 na_value_severity

Implementation steps

krpaulson commented Nov 23, 2020

meghanfrisch commented Nov 24, 2020

chacalle commented Nov 23, 2020 •

edited

Loading

it is not possible to scale to one aggregate given the available input data? `missing_dt_severity`

when interval variables do not exactly match up in the input data? `collapse_interval_cols`

when scaling a categorical variable, and one of the interval `id_cols` has overlapping intervals?`overlapping_dt_severity`

when scaling a categorical variable with multiple levels in the mapping but one level is missing? `collapse_missing`

when `value_cols` have `NA` values like #49 `na_value_severity`