Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: scaling function design #53

Open
2 of 5 tasks
chacalle opened this issue Nov 23, 2020 · 2 comments
Open
2 of 5 tasks

Discussion: scaling function design #53

chacalle opened this issue Nov 23, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@chacalle
Copy link
Collaborator

chacalle commented Nov 23, 2020

Basic Description

The scaling function should be used to scale values within a hierarchy so that the more detailed levels aggregated together equal the aggregate level. Input data dt is defined by a set of id_cols along with a column that is to be scaled over col_stem.

The column that is to be scaled can be two different types of variables col_type

  1. Categorical variable like location or sex etc. col_type = categorical
  2. Numeric interval variable like age or year. These are defined by the start and end of each interval. col_type = interval

Two basic types of use cases

  1. Input data is expected to be "square" and is an exact match with the pre-defined hierarchy. Basic assertions need to be done and the function should be optimized for speed.
  2. Input data is not "square" and may not match up exactly with the pre-defined hierarchy. More detailed assertions and standardization need to be done.
    • Example 1: aggregating across locations, some years may have different sets of locations available.
    • Example 2: aggregating across locations, some locations may have different age groups available so need to be collapsed to the most detailed common age groups prior to each level of aggregation.
    • Example 3: aggregating across age groups, some locations may have different age groups available so need to map correctly from each detailed age group to the aggregate, and make sure each location has the entire expected age range.

Implementation Details

Assertions

Square datasets only

  • all combinations of all unique values of each id_cols exist.
  • all of the most detailed nodes in the hierarchy exist.

Non-square datasets only

  • each level of scaling need to check for square data and potentially collapse interval columns to the most detailed common intervals.

What is the expected behavior when...

it is not possible to scale to one aggregate given the available input data? missing_dt_severity

  • Implemented

For example when scaling to a national location, one subnational may be missing.

  1. Default is to throw an error.
  2. Warn or ignore, then skips impossible scaling and continues with others.
  3. Skip the check and make scaling anyway.

when interval variables do not exactly match up in the input data? collapse_interval_cols

  • Implemented

For example when scaling to a national location, one subnational may have five year age groups and another has single year age groups.

  1. Default is to throw error.
  2. Option to automatically collapse to most detailed common intervals

when scaling a categorical variable, and one of the interval id_cols has overlapping intervals?overlapping_dt_severity

  • Implemented

For example when scaling subnational to national values, and the subnationals have a mix of five-year and single-year age groups, and some subnationals have both.

  1. Default is to throw error.
  2. Warn or ignore, then drops overlapping intervals and continue.
  3. Skip the check and continue with scaling.

when scaling a categorical variable with multiple levels in the mapping but one level is missing? collapse_missing

For example when scaling county to state to national values and some years don't have state values. Should be able to collapse the mapping to know how the county level maps to the national level

  • Implemented
  1. Default is to throw an error.
  2. Option to automatically drop missing nodes from the mapping.

when value_cols have NA values like #49 na_value_severity

  • Implemented
  1. Default is to throw error.
  2. Warn or ignore, then drop missing values and continue with scaling.
  3. Skip check for NA values and include in scaling.

Implementation steps

  1. Clean up testing script for scaling (right now potentially too long and hard to follow).
  2. Add square argument to determine amount of flexibility in inputs.
  3. Add na_value_severity argument
@chacalle chacalle added the enhancement New feature or request label Nov 23, 2020
@krpaulson
Copy link
Collaborator

How about cases like this where some groups need to be scaled and some don't (i.e some groups only have highest level detail)?

dt <- data.table(
  age_start = c(0,1,2,0, 0),
  age_end = c(1,2,3,3, 5),
  year_start = c(rep(1, 4), 2),
  val = c(1,2,3,4, 5)
)

dt <- hierarchyUtils::scale(
  dt = dt,
  id_cols = c("age_start", "age_end", "year_start"),
  value_cols = "val",
  col_stem = "age",
  col_type = "interval"
)

@meghanfrisch
Copy link

Would that also apply to accounting for historical locations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants