Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(baking): add variables.type column that explicitly specifies type information #2039

Closed
wants to merge 5 commits into from

Conversation

Marigold
Copy link
Contributor

@Marigold Marigold commented Mar 24, 2023

See the original proposal.

Add column JSON column variables.type that can contain explicit type of a variable. Some examples are {"type": "int"}, {"type": "string"} or {"type": "ordinal", "sort": ["Low", "Medium", "High"]}. If null, then type is determined from data (as int, float, string or mixed) to be backward compatible.

If you're trying this locally, first apply the latest migration with yarn buildTsc && yarn runDbMigrations and then set type for income groups

update variables
set type = '{"type": "ordinal", "sort": ["Not categorized", "Low income", "Lower-middle income", "Upper-middle income", "High income"]}'
where id = 42563

Baked metadata file for that variable would then contain following fields:

{
  ...
  "type": "ordinal",
  "sort": ["Not categorized", "Low income", "Lower-middle income", "Upper-middle income", "High income"]
}

Related: #1692, #1324

@Marigold Marigold marked this pull request as draft March 24, 2023 06:26
@sophiamersmann
Copy link
Member

I made the (minimal) changes necessary to sort the scatterplot legend using the new metadata field (it works!).

I'm unsure though what's the best way to let Grapher know that a variable is known to be "ordinal". If I am understanding correctly, Grapher seems to ignore the metadata field type at the moment and simply marks every value column as NumberOrStringColumn (which is essentially a "mixed" type):

type: isContinent
? ColumnTypeNames.Continent
: ColumnTypeNames.NumberOrString,

A code comment regarding NumberOrStringColumn:

/**
* We strive to have clearly typed variables in the future, but for now our
* grapher variables are still untyped. Most are number-only, but we also have some
* string-only, and even some mixed ones.
* Hence, NumberOrStringColumn is used to store grapher variables.
* It extends AbstractColumnWithNumberFormatting, which ensures that we have
* implementations of formatValueShortWithAbbreviations and the like already.
* -- @marcelgerber, 2022-07-01
*/

I'm not confident making changes at the moment. I'll have a chat with Marcel once he's back and then get back to this :)

@sophiamersmann
Copy link
Member

After discussion with @danyx23 and @samizdatco, how to best represent ordinal variables remains an open question. The options are:

  1. Store ordinal variables as string values and attach an ordering to them (current behaviour, inspired by Vega types)
  2. Store ordinal variables as numerical values (that define an ordering implicitly) and attach labels to them

I believe (1) would make our life a bit easier since turning a categorial variable into an ordinal one would require us to add a "sort" array but no data migration would be needed. (Though we do have some ordinal variables that are currently stored as numerical values as well.) The advantage of (2) is that the extra bit of encoded information (negative/neutral/positive) would help us auto-pick appropriate colour palettes (e.g. if we had an ordinal variable with categories "disagree" / "don't care" / "agree" encoded by -1, 0, 1, Grapher could infer that a diverging colour palette is needed).

One additional thing to keep in mind, and this might be controversial, is that I think it would be beneficial to attach an ordering to the "continents" variable as well. We could then customise the order of the continents such that "North America" and "South America", for example, would appear next to each other in legends (example chart). If we wanted to do this, approach (1) would be less awkward.

@Marigold
Copy link
Contributor Author

(It doesn't make a difference on ETL end, so I yield and let you decide.)

@marcelgerber
Copy link
Member

Ideally, our new handling of ordinal data would be able to cover:

  • all the cases where we already have ordinal data points, and are currently using (unordered) categorical legends, with underlying string data (example)
  • all the cases where we already have ordinal data points, and are currently encoding them as numbers, with custom labels applied in the legend (example)
  • bonus points if it can handle scenarios where we mix ordinal categories with actual numerical observations (example)

In terms of data representation, I would prefer some sort of shorthand for categorical values, which we would then put into the metadata file, similar to how we already handle entities (where there's a mapping entityId -> entityName in metadata.json, and the data.json only contains the entity ids).
The reason here is that the (uncompressed) data files can become very unwieldy very quickly otherwise, and we will always have a restricted domain of possible categorical values.

(See also: quick analysis of categorical variables that are currently in use)

What's unclear to me at this point, though, is how we would differentiate "actual" numeric values from named values, then, in the case of mixed types.

@sophiamersmann
Copy link
Member

Thanks for having a look at this, Marcel!

Ideally, our new handling of ordinal data would be able to cover:

  • all the cases where we already have ordinal data points, and are currently using (unordered) categorical legends, with underlying string data (example)
  • all the cases where we already have ordinal data points, and are currently encoding them as numbers, with custom labels applied in the legend (example)

Am I understanding correctly that you would want a system that handles both cases, underlying string data and underlying numerical data, at the same time? My thinking was that we decide on one strategy to represent ordinal data and then migrate offending ordinal variables to the new system, therefore reducing complexity in the long run.

  • bonus points if it can handle scenarios where we mix ordinal categories with actual numerical observations (example)

Do you have an estimate of how many variables we have that have a mixed type with numerical and ordinal categories? My guess is that most mixed types contain categorical data rather than ordinal data (e.g. different types of missing data). If this turns out to be a small number of variables, then I'm happy to treat those as "mixed" and wait for a more pressing need to address this.

In terms of data representation, I would prefer some sort of shorthand for categorical values, which we would then put into the metadata file, similar to how we already handle entities (where there's a mapping entityId -> entityName in metadata.json, and the data.json only contains the entity ids). The reason here is that the (uncompressed) data files can become very unwieldy very quickly otherwise, and we will always have a restricted domain of possible categorical values.

Good idea! Maybe we could then for categorical/ordinal variables add a metadata field dimensions.values.values with a list of categories, e.g. [ { id: 1, name: "Low income" }, { id: 2, name: "Lower-middle income" }, ...]. In the case of ordinal variables, the ids would imply an ordering (this would mean we go for the numerical representation).

(See also: quick analysis of categorical variables that are currently in use)

Nice!

@marcelgerber
Copy link
Member

My thinking was that we decide on one strategy to represent ordinal data and then migrate offending ordinal variables to the new system, therefore reducing complexity in the long run.

Ah, yes, that's exactly what I mean. Definitely pro reducing complexity here.

Do you have an estimate of how many variables we have that have a mixed type with numerical and ordinal categories?

Yeah, as you say this is a tiny minority. Not a pressing need.

@sophiamersmann
Copy link
Member

For reference: the original proposal has been updated to reflect the outcome of recent discussions

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

@github-actions
Copy link

github-actions bot commented Sep 9, 2023

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

@github-actions github-actions bot added the stale label Sep 9, 2023
@Marigold
Copy link
Contributor Author

We've talked about this recently and have an appetite for finishing it.

@github-actions github-actions bot removed the stale label Sep 11, 2023
@github-actions
Copy link

This PR has had no activity within the last two weeks. It is considered stale and will be closed in 3 days if no further activity is detected.

@github-actions github-actions bot added the stale label Sep 26, 2023
@Marigold
Copy link
Contributor Author

Hold on bot!

@sophiamersmann
Copy link
Member

I think adding the pinned label keeps the bot away

@sophiamersmann
Copy link
Member

Hey @Marigold,

This PR died a long time ago (rip), but we're still keen on getting type information into Grapher. If we wanted to resurrect this PR, how much effort/time do you think it would be on your part?

@Marigold
Copy link
Contributor Author

If we wanted to resurrect this PR, how much effort/time do you think it would be on your part?

I'm not sure, but we put it on this cycle's list. I'll prioritize it to unblock you.

@Marigold
Copy link
Contributor Author

Killed in favour of #3362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants